Video SLCM-02
Module/Code SLCM-02: Pendahuluan Model Klasifikasi II
Pendahuluan Data Mining
https://tau-data.id/course/adm/
Supervised Learning - Classification 02
https://tau-data.id/lesson/adm-classification-02/
(C) Taufik Sutanto
Outline:¶
- Decision Tree
- Random Forest
- Support Vector Machines
- Hyperparameter Optimization
- Pemilihan Model
# Cell ini dijalankan HANYA jika menggunakan Google Colab.
# Jika di jalankan di Jupyter Notebook sebaiknya di jalankan di terminal (command prompt) tanpa tanda "!"
!pip install graphviz dtreeviz
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/diabetes_data.csv
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/hr_data.csv
# Importing Modules untuk Notebook ini
import warnings; warnings.simplefilter('ignore')
from sklearn.model_selection import cross_val_score
import graphviz, pandas as pd, matplotlib.pyplot as plt, numpy as np, seaborn as sns
from pandas.plotting import scatter_matrix
from sklearn import model_selection, tree
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from dtreeviz.trees import *
from IPython.core.display import display, HTML
import numpy as np, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import svm, preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn import neighbors
from sklearn.gaussian_process.kernels import RBF
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.datasets import make_blobs, make_moons, make_circles, make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import VotingClassifier
from collections import Counter
sns.set(style="ticks", color_codes=True)
# load kembali iris data
df = sns.load_dataset("iris")
g = sns.pairplot(df, hue="species")
df.describe(include='all')
# Separate Data
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df['species']
seed = 99
validation_size = 0.3
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
print(x_train.shape, x_test.shape, len(y_test))
Decision Tree (Pohon Keputusan)¶
Pengaruh "ketinggian" tree terhadap bentuk model¶
- Contoh Lain: http://www.saedsayad.com/decision_tree.htm
- Ross Quinlan Website: https://www.rulequest.com/Personal/
When to use:
- Target : Binomial/nominal.
- Predictors (input): binomial, nominal, and-or interval (ratio).
Advantage:
- Fast and embarrassingly parallel.
- Tanpa iterasi, cocok untuk Big Data technology (e.g. Hadoop)[map-reduce friendly]
- Interpretability
- Robust terhadap outliers & missing values
Disadvantage:
- Non probabilistic (ad hoc heuristic) +/-
- Target dengan banyak kelas
- Sensitive (instability)
# Decision Tree: http://scikit-learn.org/stable/modules/tree.html
from sklearn import tree
DT = tree.DecisionTreeClassifier()
# Sengaja menggunakan default parameter, (Hyper)parameter Optimization akan dibahas kemudian
DT = DT.fit(x_train, y_train)
y_DT = DT.predict(x_test)
print(accuracy_score(y_test, y_DT))
print(confusion_matrix(y_test, y_DT))
print(classification_report(y_test, y_DT))
# Varible importance - Salah satu kelebihan Decision Tree
DT.feature_importances_
# Kelebihan lain Decision Tree yang tidak dimiliki model lain
# "WARNING"
# 1. tidak bisa dijalankan di Google Colab
# 2. membutuhkan software "graphViz" + setting system variabel
# caranya ada disini: https://stackoverflow.com/questions/49471867/installing-graphviz-for-use-with-python-3-on-windows-10
import graphviz
dot_data = tree.export_graphviz(DT, out_file=None)
graph = graphviz.Source(dot_data)
graph.render("iris")
var_names = ['sepal_length','sepal_width','petal_length','petal_width']
categories = ['Setosa', 'VersiColor', 'Virginica']
dot_data = tree.export_graphviz(DT, out_file=None,
feature_names = var_names,
class_names=categories,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph
Curse of Dimensionality¶
# Mari coba perbaiki dengan Random Forest
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
y_rf = rf.predict(x_test)
print('Akurasi = ', accuracy_score(y_test, y_rf))
print(confusion_matrix(y_test, y_rf))
print(classification_report(y_test, y_rf))
# Varible importance
importances = rf.feature_importances_
std = np.std([tree.feature_importances_ for tree in rf.estimators_], axis=0)
indices = np.argsort(importances)[::-1]
# Print the feature ranking
print("Feature ranking:")
for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.bar(range(X.shape[1]), importances[indices],color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])
plt.show()
Support Vector Machine (SVM)¶
Misal data dinyatakan sebagai berikut: $\{(\bar{x}_1,y_1),...,(\bar{x}_n,y_n)\}$, dimana $\bar{x}_i$ adalah input pattern untuk data ke $i^{th}$ dan $y_i$ adalah nilai target yang diinginkan. Kategori (class) direpresentasikan dengan $y_i=\{-1,1\}$. Sebuah bidang datar (hyperplane) yang memisahkan kedua kelas ini ("linearly separable") adalah: $$ \bar{w}'\bar{x} + b=0 $$ dimana $\bar{x}$ adalah input vector (prediktor), $\bar{w}$ weight, dan $b$ disebut sebagai bias.
Pemodelan SVM (Hard Margin):¶
- Misal Xo adalah sebuah vector di bidang (plane/garis) wX + b = -1
- Misal r adalah jarak antar SV-nya.
- karena X berada di bidang wX + b = 1 maka X = Xo + rw/||w||
- lihat gambar w tegak lurus X (karena wX + b = 0) dan w/||w|| adalah unit vektornya
- Sehingga wX + b = 1 dapat dituliskan sebagai w (Xo + r w/||w||) - b = 1
- atau wXo + r||w||²/||w|| - b = 1 ==> wXo - b = 1 - r||w|| ==> -1 = 1 - r||w||
- sehingga di dapat $r = \frac{2}{||w||}$
- Kesimpulannya optimal hyperplane bisa didapatkan dengan memaksimumkan $\frac{2}{||w||}$ atau setara dengan $\min \frac{||w||}{2}$
- More details here: https://nlp.stanford.edu/IR-book/html/htmledition/support-vector-machines-the-linearly-separable-case-1.html
- Efek outlier pada pemodelan ini?
Support Vector Machine: Soft Margin¶
- Apakah efek outlier masih sama pada pemodelan ini? Kaitannya dengan nilai C?
- C >>> ==> toleransi terhadap outlier <<<< dan sebaliknya
Dual dan Quadratic solver¶
- optimasi di atas biasanya diselesaikan dengan mencari bentuk Dual-nya.
- Solusi untuk parameter optimalnya kemudian ditemukan dengan mencari pendekatan nilai optimalnya lewat Quadratic Programming solver.
- Perhatikan bahwa bentuk fungsi optimasinya konvex ==> memiliki minimum global.
- Nilai optimal dari pemodelan di atas hanya bergantung pada data-data di margin (support vector) sehingga bisa lebih efisien (jika SV telah diketahui).
- SV juga dapat digunakan untuk menganalisa "Error Bound" : http://www.svms.org/vc-dimension/
Interpretation¶
- Recursive Feature Elimination (RFE) method : https://link.springer.com/content/pdf/10.1023/A:1012487302797.pdf
- melihat bentuk kuadrat dari setiap komponen w (higher better).
- hati-hati beberapa diskusi di internet menyatakan bahwa sign (+/-) menyatakan tingkat kepentingan terhadap setiap variabel, namun hal ini tidak selalu benar dan bisa dibuktikan cukup dengan counter example.
Bagaimana dengan data kategorik?¶
- Sama dengan regresi (logistik) ==> Dummy (indicator variable) variable.
- Misal X1 = {a,b,c} ==> X1_a = [1,0,0], X1_b = [0,1,0], X1_c = [0,0,1]
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html
# Contoh
df = pd.DataFrame({'X1': ['a', 'b', 'a','c','a'],'X2': [1, 2, 3, 2, 1]})
df = pd.get_dummies(df) # get_dummies(df, prefix=['dummy'])
df
Normalisasi/Standarisasi Data¶
- Sama seperti Regresi (logistik) prediktor/features di model SVM perlu untuk di standarisasi/normalisasi.
- http://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range
- Hati-hati standarisasi data dilakukan setelah outlier ditangani dengan baik.
scaler = preprocessing.StandardScaler(with_mean=True, with_std=True)
df['X2'] = scaler.fit_transform(df[['X2']])
df
# Contoh plotting Optimal Hyperplane
# http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane.html#example-svm-plot-separating-hyperplane-py
X, y = make_blobs(n_samples=20, centers=2, random_state=6) # we create 20 separable points
clf = svm.SVC(kernel='linear', C=1000) # fit the model, don't regularize for illustration purposes
clf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, s=30, cmap=plt.cm.Paired)
ax = plt.gca();xlim = ax.get_xlim(); ylim = ax.get_ylim()
# create grid to evaluate model
xx = np.linspace(xlim[0], xlim[1], 30);yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)
ax.contour(XX, YY, Z, colors='k', levels=[-1, 0, 1], alpha=0.5,linestyles=['--', '-', '--'])# plot decision boundary and margins
ax.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=100,linewidth=1, facecolors='none', edgecolors='k')# plot support vectors
plt.show()
SVM Kernel (trick)
Definisi Fungsi Kernel¶
- Jika untuk semua $\bar{x},\bar{z} \in X$, memenuhi
$$\kappa (\bar{x},\bar{z})=<\phi (\bar{x}),\phi (\bar{z})>$$ maka $\kappa$ disebut fungsi Kernel (fungsi $\phi$ disebut feature map). - Perhatikan hasil pemetaan fungsi kernelnya adalah scalar (inner product).
- Fungsi ini digunakan di SVM (dan model DM/ML lain yang bisa dinyatakan dalam inner product).
- Perhatikan pemodelan SVM; kebanyakan dinyatakan dalam inner product (i.e. w.x).
- See here for more details: https://nlp.stanford.edu/IR-book/html/htmledition/nonlinear-svms-1.html
Contoh 1¶
- Misal $X\subseteq \Re^2$ dan $\phi : \bar{x}=(x_1,x_2)\rightarrow \phi (\bar{x})=(x_1^2, x_2^2,\sqrt{2}x_1x_2)\in F=\Re^3$.
- maka
$<\phi(\bar{x}),\phi(\bar{z})>$
$=<(x_1^2,x_2^2,\sqrt{2}x_1x_2),(z_1^2,z_2^2,\sqrt{2}z_1z_2)>$
$=x_1^2z_1^2+x_2^2z_2^2+2x_1x_2z_1z_2$
$=(x_1z_1+x_2z_2)^2=<\bar{x},\bar{z}>^2$ - Sehingga $\kappa(\bar{x},\bar{z})=<\bar{x},\bar{z}>^2$ adalah sebuah fungsi kernel dan $F$ adalah ruang feature-nya (feature space).
Contoh 2¶
- Misal x = (x1, x2, x3); y = (y1, y2, y3).
- dan fungsi pemetaan variabelnya f(x) = (x1², x1x2, x1x3, x2x1, x2², x2x3, x3x1, x3x2, x3²),
- maka kernelnya adalah K(x, y ) = <f(x), f(y)> = <x, y>².
- Contoh numerik misal x = (1, 2, 3) dan y = (4, 5, 6). maka:
- f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)
f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36) - <f(x), f(y)> = 16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024
- complicated!... Menggunakan fungsi kernel perhitungannya bisa disederhanakan:
- K(x, y) = (4 + 10 + 18)² = 32² = 1024
Well-Known Kernel Functions
SVM Binary to MultiClass
Pros
- Akurasinya Baik
- Bekerja dengan baik untuk sampel data yang relatif kecil
- Hanya bergantung pada SV ==> meningkatkan efisiensi
- Convex ==> Minimum Global ==> Pasti Konvergen
Cons
- Tidak efisien untuk data yang besar
- Akurasi terkadang rendah untuk multiklasifikasi (sulit mendapatkan hubungan antar kategori di modelnya)
- Tidak robust terhadap noise
Bacaan lebih lanjut:
# Contoh Binary SVM (dengan dan tanpa kernel)
# Loading Data
df = sns.load_dataset("iris")
df2 = df[df['species'].isin(['setosa','versicolor'])]
print(df2.shape)
df2.sample(7)
# Separate the data
X = df2[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df2['species']
seed = 9
validation_size = 0.3
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
print(X_train.shape, len(Y_test))
# Fitting and evaluate the model
dSVM = svm.SVC(C = 10**5, kernel = 'linear')
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
print(confusion_matrix(Y_test, y_SVM))
print(classification_report(Y_test, y_SVM))
# The Support Vectors
print('index dr SV-nya: ', dSVM.support_)
print('Vector Datanya: \n', dSVM.support_vectors_)
# Model Weights for interpretations
print('w = ',dSVM.coef_)
print('b = ',dSVM.intercept_)
# Menggunakan Kernel: http://scikit-learn.org/stable/modules/svm.html#svm-kernels
for kernel in ('sigmoid', 'poly', 'rbf'):
dSVM = svm.SVC(kernel=kernel)
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print(accuracy_score(Y_test, y_SVM))
# Contoh Multiklasifikasi SVM (dengan dan tanpa kernel)
# Separate the data
X = df[['sepal_length','sepal_width','petal_length','petal_width']]
Y = df['species']
seed = 9
validation_size = 0.3
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
print(X_train.shape, len(Y_test))
# One Versus All: http://www.jmlr.org/papers/volume5/rifkin04a/rifkin04a.pdf
dSVM = svm.LinearSVC()
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
y_SVM
# Ada 3 classifier (as expected)
dSVM.coef_
# All At Once Method http://www.jmlr.org/papers/volume2/crammer01a/crammer01a.pdf
dSVM = svm.SVC(decision_function_shape='ovo')
dSVM.fit(X_train, Y_train)
y_SVM = dSVM.predict(X_test)
print('Akurasi = ', accuracy_score(Y_test, y_SVM))
y_SVM
Hyperparameter Optimization¶
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
file = 'data/diabetes_data.csv'
try:
# Local jupyter notebook, assuming "file" is in the "data" directory
data = pd.read_csv(file, names=names).values # Rubah ke numpy array
except:
# it's a google colab... create folder data and then download the file from github
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/diabetes_data.csv
data = pd.read_csv(file, names=names).values # Rubah ke numpy array
print(data.shape)
prop_test = 0.2
X, Y = data[:,0:8], data[:,8] # Slice data
Y = [int(y) for y in Y]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=prop_test)
print(set(Y), x_train.shape, x_test.shape, sep='\n')
clf = LogisticRegression(solver='liblinear')
kNN = neighbors.KNeighborsClassifier()
gnb = GaussianNB()
dt = tree.DecisionTreeClassifier()
rf = RandomForestClassifier()
svm_ = svm.SVC()
Models = [('Regresi Logistik', clf), ('k-NN',kNN), ('Naive Bayes',gnb), ('Decision Tree', dt), ('Random Forest', rf), ('SVM', svm_)]
Scores = {}
for model_name, model in Models:
Scores[model_name] = cross_val_score(model, x_train, y_train, cv=10, scoring='accuracy')
dt = pd.DataFrame.from_dict(Scores)
ax = sns.boxplot(data=dt)
for m, s in Scores.items():
print(m, list(s)[:4])
Hyperparameter optimization¶
- Misal k-NN dan SVM
- Sebagai latihan silahkan lakukan pada model yang lain
- Preprocessing di ML di optimalkan bergantung model.
- Parameter tiap model di ML berbeda-beda dan nilai optimalnya berbeda pada setiap kasus.
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import make_pipeline
# Hyperparameter optimization pada model kNN menggunakan gridCV
kCV = 10
metric = 'accuracy'
params = {}
params['kneighborsclassifier__n_neighbors'] = [1, 3, 5, 10, 15, 20, 25, 30]
params['kneighborsclassifier__weights'] = ('distance', 'uniform')
pipe = make_pipeline(neighbors.KNeighborsClassifier())
optKnn = GridSearchCV(pipe, params, cv=kCV, scoring=metric, verbose=1, n_jobs=-2) # , pre_dispatch='2*n_jobs', pre_dispatch min 2* n_jobs
optKnn.fit(x_train, y_train)
print(optKnn.best_score_)
print(optKnn.best_params_)
# Hyperparameter optimization pada model SVM menggunakan RandomizedSearchCV
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
# Berikut ini contoh bagaimana mengetahui parameter yang dapat kita optimasi.
# Gunakan pengetahuan teori/analitik untuk mengoptimasi hanya parameter yang paling penting.
pipeSVM = make_pipeline(svm.SVC())
print(sorted(pipeSVM.get_params().keys()))
# Optimal parameter SVM dengan RandomizedSearch
# WARNING cell ini butuh waktu komputasi cukup lama
kCV = 10
paramsSVM = {}
paramsSVM['svc__C'] = [1, 5, 10] #sp.stats.uniform(scale=100)
paramsSVM['svc__gamma'] = [0.1, 1, 10]
paramsSVM['svc__kernel'] = ['rbf', 'sigmoid', 'linear'] # , 'poly'
#paramsSVM['svc__decision_function_shape'] = ['ovo', 'ovr']
optSvm = RandomizedSearchCV(pipeSVM, paramsSVM, cv=kCV, scoring=metric, verbose=2, n_jobs=-2) # refit=True, pre_dispatch='2*n_jobs' pre_dispatch min 2* n_jobs
optSvm.fit(x_train, y_train)
print(optSvm.best_score_)
print(optSvm.best_params_)
Model Selection¶
import seaborn as sns, matplotlib.pyplot as plt
kCV = 10
# Menggunakan parameter optimal
kNN = neighbors.KNeighborsClassifier(n_neighbors= 20, weights= 'uniform')
svm_ = svm.SVC(kernel= 'linear', gamma= 10, C= 10)
models = ['kNN', 'SVM']
knn_score = cross_val_score(kNN, x_train, y_train, cv=kCV, scoring='accuracy', n_jobs=-2, verbose=1)
svm_score = cross_val_score(svm_, x_train, y_train, cv=kCV, scoring='accuracy', n_jobs=-2, verbose=1)
scores = [knn_score, svm_score]
data = {m:s for m,s in zip(models, scores)}
for name in data.keys():
print("Accuracy %s: %0.2f (+/- %0.2f)" % (name, data[name].mean(), data[name].std() * 2))
sns.boxplot(data=pd.DataFrame(data), orient='h')
plt.show()
End of Module
Code Lesson SLCM-02 [Click Here]
Di link tersebut anda langsung bisa merubah code dan menjalankannya. Keterangan lebih lanjut di video yang disertakan. Sangat disarankan untuk membuka code dan video "side-by-side" untuk mendapatkan pengalaman belajar yang baik (Gambar dibawah). SIlahkan modifikasi (coba-coba) hal lain, selain yang ditunjukkan di video untuk mendapatkan pengalaman belajar yang lebih mendalam. Tentu saja juga silahkan akses berbagai referensi lain untuk memperkaya pengetahuan lalu diskusikan di forum yang telah disediakan.
Referensi
- Aggarwal, C. C. (2015). Data mining: the textbook. Springer.
- Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering Data Mining: From Concept to Implementation. IBM, 1997
- Fayyad, G. Piatetsky-Shapiro, and P. Smith. From data mining to knowledge discovery. AI Magzine,Volume 17, pages 37-54, 1996.
- Barry, A. J. Michael & Linoff, S. Gordon. 2004. Data Mining Techniques. Wiley Publishing, Inc. Indianapolis : xxiii + 615 hlm.
- Hand, David etc. 2001. Principles of Data Mining. MIT Press Cambridge, Massachusetts : xxvii + 467 hlm.
- Hornick, Mark F., Marcade, Erik & Vankayala, Sunil. 2007. Java Data Mining: Strategy,Standard, and Practice. Morgan Kaufman. San Francisco : xxi + 519 hlm.
- Tang, ZhaoHui & Jamie, MacLennan. 2005. Data Mining with SQL Server 2005. Wiley Publishing, Inc. Indianapolis : xvii + 435 hal
- Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
- Yang, X. S. (2019). Introduction to Algorithms for Data Mining and Machine Learning. Academic Press.
- Simovici, D. (2018). Mathematical Analysis for Machine Learning and Data Mining. World Scientific Publishing Co., Inc..
- Zheng, A. (2015). Evaluating machine learning models: a beginner’s guide to key concepts and pitfalls.
- Mitchell, T. M. (1997). Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37), 870-877.
- Jason Brownlee: A Gentle Introduction to XGBoost for Applied Machine Learning. Mach. Learn. Mastery. (2016).
- Ketkar, N.: Deep Learning with Python. (2017). https://doi.org/10.1007/978-1-4842-2766-4.
No comments:
Post a Comment
Relevant & Respectful Comments Only.