― Phil Dourado
Module/Code SLCM-03: Pendahuluan Model Klasifikasi III
Video SLCM-03
Code Lesson SLCM-03 [Click Here]
Code dari lesson ini dapat di akses di Link berikut (wajib login ke Google/Gmail): Code SLCM-03 [Click Here]
Di link tersebut anda langsung bisa merubah code dan menjalankannya. Keterangan lebih lanjut di video yang disertakan. Sangat disarankan untuk membuka code dan video "side-by-side" untuk mendapatkan pengalaman belajar yang baik (Gambar dibawah). SIlahkan modifikasi (coba-coba) hal lain, selain yang ditunjukkan di video untuk mendapatkan pengalaman belajar yang lebih mendalam. Tentu saja juga silahkan akses berbagai referensi lain untuk memperkaya pengetahuan lalu diskusikan di forum yang telah disediakan.
taudata Analytics
Supervised Learning - Classification 03
https://taudata.blogspot.com/2022/04/slcm-03.html
In [8]:
print("Detecting environment: ", end=' ')
try:
import google.colab
IN_COLAB = True
print("Running the code in Google Colab. Installing and downloading dependencies.\nPlease wait...")
!pip install --upgrade pandas
except:
IN_COLAB = False
print("Running the code locally.")
# Please visit https://github.com/taudataid/PINN-DCAI for further detail such as requirements.txt file.
Detecting environment: Running the code locally.
In [9]:
# Loading Modules
import warnings; warnings.simplefilter('ignore')
import pickle, time, numpy as np, seaborn as sns
import pandas as pd, matplotlib.pyplot as plt
from sklearn import svm, preprocessing
from sklearn import tree, neighbors
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection
from collections import Counter
from tqdm import tqdm
sns.set(style="ticks", color_codes=True)
print(pd.__version__)
"Done"
1.3.4
Out[9]:
'Done'
In [10]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
file = 'data/diabetes_data.csv'
try:
# Local jupyter notebook, assuming "file" is in the "data" directory
data = pd.read_csv(file, names=names)
except:
# it's a google colab... create folder data and then download the file from github
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/{file}
data = pd.read_csv(file, names=names)
print(data.shape, set(data['class']))
data.sample(5)
(768, 9) {0, 1}
Out[10]:
preg | plas | pres | skin | test | mass | pedi | age | class | |
---|---|---|---|---|---|---|---|---|---|
40 | 3 | 180 | 64 | 25 | 70 | 34.0 | 0.271 | 26 | 0 |
643 | 4 | 90 | 0 | 0 | 0 | 28.0 | 0.610 | 31 | 0 |
623 | 0 | 94 | 70 | 27 | 115 | 43.5 | 0.347 | 21 | 0 |
629 | 4 | 94 | 65 | 22 | 0 | 24.7 | 0.148 | 21 | 0 |
294 | 0 | 161 | 50 | 0 | 0 | 21.9 | 0.254 | 65 | 0 |
In [11]:
# Split Train-Test
X = data.values[:,:8] # Slice data (perhatikan disini struktur data adalah Numpy Array)
Y = data.values[:,8]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=99)
print(set(Y), x_train.shape, x_test.shape, sep=', ')
{0.0, 1.0}, (614, 8), (154, 8)
Ensemble Model ¶
- What? a learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.
- Why? Better prediction, More stable model
- How? Bagging & Boosting
“meta-algorithms” : Bagging & Boosting
Property of Boosting ¶
AdaBoost ¶
In [12]:
# Contoh Voting (Bagging) di Python
# Catatan : Random Forest termasuk Bagging Ensemble (walau modified)
# Best practicenya Model yang di ensemble semuanya menggunakan Optimal Parameter
kNN = neighbors.KNeighborsClassifier(3)
kNN.fit(x_train, y_train)
Y_kNN = kNN.score(x_test, y_test)
DT = tree.DecisionTreeClassifier(random_state=1)
DT.fit(x_train, y_train)
Y_DT = DT.score(x_test, y_test)
model = VotingClassifier(estimators=[('k-NN', kNN), ('Decision Tree', DT)], voting='hard')
model.fit(x_train,y_train)
Y_Vot = model.score(x_test,y_test)
print('Akurasi k-NN', Y_kNN)
print('Akurasi Decision Tree', Y_DT)
print('Akurasi Votting', Y_Vot)
Akurasi k-NN 0.7142857142857143 Akurasi Decision Tree 0.6818181818181818 Akurasi Votting 0.7337662337662337
In [13]:
# Averaging juga bisa digunakan di Klasifikasi (ndak hanya Regresi),
# tapi kita pakai probabilitas dari setiap kategori
T = tree.DecisionTreeClassifier()
K = neighbors.KNeighborsClassifier()
R = LogisticRegression()
T.fit(x_train,y_train)
K.fit(x_train,y_train)
R.fit(x_train,y_train)
y_T=T.predict_proba(x_test)
y_K=K.predict_proba(x_test)
y_R=R.predict_proba(x_test)
Ave = (y_T+y_K+y_R)/3
print(Ave[:5]) # Print just first 5
prediction = [v.index(max(v)) for v in Ave.tolist()]
print(prediction[:5]) # Print just first 5
print('Akurasi Averaging', accuracy_score(y_test, prediction))
[[0.86747807 0.13252193] [0.96569616 0.03430384] [0.90409317 0.09590683] [0.81735062 0.18264938] [0.97683155 0.02316845]] [0, 0, 0, 0, 0] Akurasi Averaging 0.7402597402597403
In [14]:
# AdaBoost
num_trees = 100
kfold = model_selection.KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=33)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
0.7421565276828435
Imbalance Data ¶
- Metric Trap
- Akurasi kategori tertentu lebih penting
- Contoh kasus
Imbalance Learning ¶
- Undersampling, Oversampling, Model Based (weight adjustment)
- https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
- Plot perbandingan: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html#sphx-glr-auto-examples-combine-plot-comparison-combine-py
In [15]:
Counter(Y)
Out[15]:
Counter({1.0: 268, 0.0: 500})
In [16]:
# fit the model and get the separating hyperplane using weighted classes
svm_ = svm.SVC(kernel='linear')
svm_.fit(x_train, y_train)
y_SVMib = svm_.predict(x_test)
print(confusion_matrix(y_test, y_SVMib))
print(classification_report(y_test, y_SVMib))
[[93 12] [19 30]] precision recall f1-score support 0.0 0.83 0.89 0.86 105 1.0 0.71 0.61 0.66 49 accuracy 0.80 154 macro avg 0.77 0.75 0.76 154 weighted avg 0.79 0.80 0.79 154
In [17]:
# fit the model and get the separating hyperplane using weighted classes
# x_train, x_test, y_train, y_test
svm_balanced = svm.SVC(kernel='linear', class_weight={1: 3}) #WEIGHTED SVM
svm_balanced.fit(x_train, y_train)
y_SVMb = svm_balanced.predict(x_test)
print(confusion_matrix(y_test, y_SVMb))
print(classification_report(y_test, y_SVMb))
[[67 38] [ 7 42]] precision recall f1-score support 0.0 0.91 0.64 0.75 105 1.0 0.53 0.86 0.65 49 accuracy 0.71 154 macro avg 0.72 0.75 0.70 154 weighted avg 0.78 0.71 0.72 154
In [18]:
# Example of model-based imbalance treatment - SVM
from sklearn.datasets import make_blobs
n_samples_1, n_samples_2 = 1000, 100
centers = [[0.0, 0.0], [2.0, 2.0]]
clusters_std = [1.5, 0.5]
X, y = make_blobs(n_samples=[n_samples_1, n_samples_2],centers=centers,cluster_std=clusters_std,random_state=33, shuffle=False)
# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)
# fit the model and get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10}) #WEIGHTED SVM
wclf.fit(X, y)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')# plot the samples
ax = plt.gca()# plot the decision functions for both classifiers
xlim = ax.get_xlim(); ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)# create grid to evaluate model
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane
a = ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5, linestyles=['-']) # plot decision boundary and margins
Z = wclf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane for weighted classes
b = ax.contour(XX, YY, Z, colors='r', levels=[0], alpha=0.5, linestyles=['-'])# plot decision boundary and margins for weighted classes
plt.legend([a.collections[0], b.collections[0]], ["non weighted", "weighted"], loc="upper right")
plt.show()
Weighted Decision Tree ¶
In [19]:
T = tree.DecisionTreeClassifier(random_state = 33)
T.fit(x_train,y_train)
y_DT = T.predict(x_test)
print('Akurasi (Decision tree Biasa) = ', accuracy_score(y_test, y_DT))
print(classification_report(y_test, y_DT))
T = tree.DecisionTreeClassifier(class_weight = 'balanced', random_state = 33)
T.fit(x_train, y_train)
y_DT = T.predict(x_test)
print('Akurasi (Weighted Decision tree) = ', accuracy_score(y_test, y_DT))
print(classification_report(y_test, y_DT))
Akurasi (Decision tree Biasa) = 0.6883116883116883 precision recall f1-score support 0.0 0.79 0.73 0.76 105 1.0 0.51 0.59 0.55 49 accuracy 0.69 154 macro avg 0.65 0.66 0.65 154 weighted avg 0.70 0.69 0.69 154 Akurasi (Weighted Decision tree) = 0.7207792207792207 precision recall f1-score support 0.0 0.83 0.74 0.78 105 1.0 0.55 0.67 0.61 49 accuracy 0.72 154 macro avg 0.69 0.71 0.69 154 weighted avg 0.74 0.72 0.73 154
Studi Kasus (Latihan) ENB2012: Prediksi Penggunaan Energi Gedung ¶
Task</center>¶
- Filter data EcoTest dan pilih hanya yang kategori di variabel target muncul min 10 kali (heat-cat)
- Lakukan EDA (Preprocessing dan visualisasi dasar)
- Tentukan model terbaik (dengan parameter optimal dan cross validasi)
- Hati-hati Naive Bayes, Decision Tree dan Random Forest tidak memerlukan one-hot encoding.
- Gunakan Metric Micro F1-Score untuk menentukan model terbaiknya.
Optional</center>¶
- Coba bandingkan model terbaik diatas dengan model ensemble.
- Apakah ada imbalance problem, coba atasi dengan over/under sampling.
In [20]:
file_ = "data/building-energy-efficiency-ENB2012_data.csv"
try: # Running Locally, yakinkan "file_" berada di folder "data"
data = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
except: # Running in Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/{file_}
data = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
print(data.shape)
data.sample(5)
(768, 12)
Out[20]:
compactness | surface-area | wall-area | roof-area | overall-height | orientation | glazing-area | glazing-dist | heating-load | cooling-load | heat-cat | cool-cat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
375 | 0.66 | 759.5 | 318.5 | 220.5 | 3.5 | 5 | 0.25 | 2 | 13.00 | 15.87 | 13 | 15 |
636 | 0.82 | 612.5 | 318.5 | 147.0 | 7.0 | 2 | 0.40 | 3 | 28.67 | 32.43 | 28 | 32 |
201 | 0.86 | 588.0 | 294.0 | 147.0 | 7.0 | 3 | 0.10 | 4 | 25.37 | 31.76 | 25 | 31 |
542 | 0.82 | 612.5 | 318.5 | 147.0 | 7.0 | 4 | 0.40 | 1 | 29.53 | 28.99 | 29 | 28 |
506 | 0.74 | 686.0 | 245.0 | 220.5 | 3.5 | 4 | 0.25 | 5 | 11.64 | 14.81 | 11 | 14 |
In [21]:
# Jawaban Latihan dimulai di cell ini
Akhir Modul SLCM-03 ¶
Referensi
- Aggarwal, C. C. (2015). Data mining: the textbook. Springer.
- Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering Data Mining: From Concept to Implementation. IBM, 1997
- Fayyad, G. Piatetsky-Shapiro, and P. Smith. From data mining to knowledge discovery. AI Magzine,Volume 17, pages 37-54, 1996.
- Barry, A. J. Michael & Linoff, S. Gordon. 2004. Data Mining Techniques. Wiley Publishing, Inc. Indianapolis : xxiii + 615 hlm.
- Hand, David etc. 2001. Principles of Data Mining. MIT Press Cambridge, Massachusetts : xxvii + 467 hlm.
- Hornick, Mark F., Marcade, Erik & Vankayala, Sunil. 2007. Java Data Mining: Strategy,Standard, and Practice. Morgan Kaufman. San Francisco : xxi + 519 hlm.
- Tang, ZhaoHui & Jamie, MacLennan. 2005. Data Mining with SQL Server 2005. Wiley Publishing, Inc. Indianapolis : xvii + 435 hal
- Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
- Yang, X. S. (2019). Introduction to Algorithms for Data Mining and Machine Learning. Academic Press.
- Simovici, D. (2018). Mathematical Analysis for Machine Learning and Data Mining. World Scientific Publishing Co., Inc..
- Zheng, A. (2015). Evaluating machine learning models: a beginner’s guide to key concepts and pitfalls.
- Mitchell, T. M. (1997). Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37), 870-877.
- Jason Brownlee: A Gentle Introduction to XGBoost for Applied Machine Learning. Mach. Learn. Mastery. (2016).
- Ketkar, N.: Deep Learning with Python. (2017). https://doi.org/10.1007/978-1-4842-2766-4.
No comments:
Post a Comment
Relevant & Respectful Comments Only.