“By the time your perfect information has been gathered, the world has moved on.”
― Phil Dourado

Module/Code SLCM-03: Pendahuluan Model Klasifikasi III

Video SLCM-03

Code Lesson SLCM-03 [Click Here]

Code dari lesson ini dapat di akses di Link berikut (wajib login ke Google/Gmail): Code SLCM-03 [Click Here]
Di link tersebut anda langsung bisa merubah code dan menjalankannya. Keterangan lebih lanjut di video yang disertakan. Sangat disarankan untuk membuka code dan video "side-by-side" untuk mendapatkan pengalaman belajar yang baik (Gambar dibawah). SIlahkan modifikasi (coba-coba) hal lain, selain yang ditunjukkan di video untuk mendapatkan pengalaman belajar yang lebih mendalam. Tentu saja juga silahkan akses berbagai referensi lain untuk memperkaya pengetahuan lalu diskusikan di forum yang telah disediakan.

"Side-by-Side": Ilustrasi bagaimana menggunakan code dan video dalam pembelajaran di tau-data. untuk mendapatkan pengalaman belajar yang baik.

slcm-03

taudata Analytics

Supervised Learning - Classification 03

https://taudata.blogspot.com/2022/04/slcm-03.html

Ensemble and Imbalance learning

In [8]:

print("Detecting environment: ", end=' ')
try:
    import google.colab
    IN_COLAB = True
    print("Running the code in Google Colab. Installing and downloading dependencies.\nPlease wait...")
    !pip install --upgrade pandas
except:
    IN_COLAB = False
    print("Running the code locally.")
# Please visit https://github.com/taudataid/PINN-DCAI for further detail such as requirements.txt file.

Detecting environment:  Running the code locally.

In [9]:

# Loading Modules
import warnings; warnings.simplefilter('ignore')
import pickle, time, numpy as np, seaborn as sns
import pandas as pd, matplotlib.pyplot as plt 
from sklearn import svm, preprocessing
from sklearn import  tree, neighbors
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.model_selection import cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.pipeline import make_pipeline 
from sklearn.ensemble import VotingClassifier
from sklearn import model_selection
from collections import Counter
from tqdm import tqdm
sns.set(style="ticks", color_codes=True)
print(pd.__version__)
"Done"

1.3.4

Out[9]:

'Done'

In [10]:

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
file = 'data/diabetes_data.csv'

try:
    # Local jupyter notebook, assuming "file" is in the "data" directory
    data = pd.read_csv(file, names=names)
except:
    # it's a google colab... create folder data and then download the file from github
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/{file}
    data = pd.read_csv(file, names=names)
    
print(data.shape, set(data['class']))
data.sample(5)

(768, 9) {0, 1}

Out[10]:

	preg	plas	pres	skin	test	mass	pedi	age
40	3	180	64	25	70	34.0	0.271	26
643	4	90	0	0	0	28.0	0.610	31
623	0	94	70	27	115	43.5	0.347	21
629	4	94	65	22	0	24.7	0.148	21
294	0	161	50	0	0	21.9	0.254	65

In [11]:

# Split Train-Test

X = data.values[:,:8]  # Slice data (perhatikan disini struktur data adalah Numpy Array)
Y = data.values[:,8]
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=99)

print(set(Y), x_train.shape, x_test.shape, sep=', ')

{0.0, 1.0}, (614, 8), (154, 8)

Ensemble Model¶

What? a learning algorithms that construct a set of classifiers and then classify new data points by taking a (weighted) vote of their predictions.
Why? Better prediction, More stable model
How? Bagging & Boosting

“meta-algorithms” : Bagging & Boosting

Ensemble https://www.youtube.com/watch?v=Un9zObFjBH0
Bagging https://www.youtube.com/watch?v=2Mg8QD0F1dQ
Boosting https://www.youtube.com/watch?v=GM3CDQfQ4sw

Property of Boosting¶

AdaBoost¶

https://youtu.be/BoGNyWW9-mE?t=70

In [12]:

# Contoh Voting (Bagging) di Python
# Catatan : Random Forest termasuk Bagging Ensemble (walau modified)
# Best practicenya Model yang di ensemble semuanya menggunakan Optimal Parameter

kNN = neighbors.KNeighborsClassifier(3)
kNN.fit(x_train, y_train)
Y_kNN = kNN.score(x_test, y_test)

DT = tree.DecisionTreeClassifier(random_state=1)
DT.fit(x_train, y_train)
Y_DT = DT.score(x_test, y_test)

model = VotingClassifier(estimators=[('k-NN', kNN), ('Decision Tree', DT)], voting='hard')
model.fit(x_train,y_train)
Y_Vot = model.score(x_test,y_test)

print('Akurasi k-NN', Y_kNN)
print('Akurasi Decision Tree', Y_DT)
print('Akurasi Votting', Y_Vot)

Akurasi k-NN 0.7142857142857143
Akurasi Decision Tree 0.6818181818181818
Akurasi Votting 0.7337662337662337

In [13]:

# Averaging juga bisa digunakan di Klasifikasi (ndak hanya Regresi), 
# tapi kita pakai probabilitas dari setiap kategori
T = tree.DecisionTreeClassifier()
K = neighbors.KNeighborsClassifier()
R = LogisticRegression()

T.fit(x_train,y_train)
K.fit(x_train,y_train)
R.fit(x_train,y_train)

y_T=T.predict_proba(x_test)
y_K=K.predict_proba(x_test)
y_R=R.predict_proba(x_test)

Ave = (y_T+y_K+y_R)/3
print(Ave[:5]) # Print just first 5
prediction = [v.index(max(v)) for v in Ave.tolist()]
print(prediction[:5]) # Print just first 5
print('Akurasi Averaging', accuracy_score(y_test, prediction))

[[0.86747807 0.13252193]
 [0.96569616 0.03430384]
 [0.90409317 0.09590683]
 [0.81735062 0.18264938]
 [0.97683155 0.02316845]]
[0, 0, 0, 0, 0]
Akurasi Averaging 0.7402597402597403

In [14]:

# AdaBoost
num_trees = 100
kfold = model_selection.KFold(n_splits=10)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=33)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

0.7421565276828435

Imbalance Data¶

Metric Trap
Akurasi kategori tertentu lebih penting
Contoh kasus

Imbalance Learning¶

Undersampling, Oversampling, Model Based (weight adjustment)
https://www.kaggle.com/rafjaa/resampling-strategies-for-imbalanced-datasets
Plot perbandingan: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/combine/plot_comparison_combine.html#sphx-glr-auto-examples-combine-plot-comparison-combine-py

In [15]:

Counter(Y)

Out[15]:

Counter({1.0: 268, 0.0: 500})

In [16]:

# fit the model and get the separating hyperplane using weighted classes

svm_ = svm.SVC(kernel='linear')
svm_.fit(x_train, y_train)
y_SVMib = svm_.predict(x_test)

print(confusion_matrix(y_test, y_SVMib))
print(classification_report(y_test, y_SVMib))

[[93 12]
 [19 30]]
              precision    recall  f1-score   support

         0.0       0.83      0.89      0.86       105
         1.0       0.71      0.61      0.66        49

    accuracy                           0.80       154
   macro avg       0.77      0.75      0.76       154
weighted avg       0.79      0.80      0.79       154

In [17]:

# fit the model and get the separating hyperplane using weighted classes
# x_train, x_test, y_train, y_test

svm_balanced = svm.SVC(kernel='linear', class_weight={1: 3}) #WEIGHTED SVM
svm_balanced.fit(x_train, y_train)
y_SVMb = svm_balanced.predict(x_test)

print(confusion_matrix(y_test, y_SVMb))
print(classification_report(y_test, y_SVMb))

[[67 38]
 [ 7 42]]
              precision    recall  f1-score   support

         0.0       0.91      0.64      0.75       105
         1.0       0.53      0.86      0.65        49

    accuracy                           0.71       154
   macro avg       0.72      0.75      0.70       154
weighted avg       0.78      0.71      0.72       154

In [18]:

# Example of model-based imbalance treatment - SVM
from sklearn.datasets import make_blobs
n_samples_1, n_samples_2 = 1000, 100
centers = [[0.0, 0.0], [2.0, 2.0]]
clusters_std = [1.5, 0.5]
X, y = make_blobs(n_samples=[n_samples_1, n_samples_2],centers=centers,cluster_std=clusters_std,random_state=33, shuffle=False)

# fit the model and get the separating hyperplane
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X, y)

# fit the model and get the separating hyperplane using weighted classes
wclf = svm.SVC(kernel='linear', class_weight={1: 10}) #WEIGHTED SVM
wclf.fit(X, y)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired, edgecolors='k')# plot the samples
ax = plt.gca()# plot the decision functions for both classifiers
xlim = ax.get_xlim(); ylim = ax.get_ylim()
xx = np.linspace(xlim[0], xlim[1], 30)# create grid to evaluate model
yy = np.linspace(ylim[0], ylim[1], 30)
YY, XX = np.meshgrid(yy, xx)
xy = np.vstack([XX.ravel(), YY.ravel()]).T
Z = clf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane
a = ax.contour(XX, YY, Z, colors='k', levels=[0], alpha=0.5, linestyles=['-']) # plot decision boundary and margins
Z = wclf.decision_function(xy).reshape(XX.shape)# get the separating hyperplane for weighted classes
b = ax.contour(XX, YY, Z, colors='r', levels=[0], alpha=0.5, linestyles=['-'])# plot decision boundary and margins for weighted classes
plt.legend([a.collections[0], b.collections[0]], ["non weighted", "weighted"], loc="upper right")
plt.show()

Weighted Decision Tree¶

In [19]:

T = tree.DecisionTreeClassifier(random_state = 33)
T.fit(x_train,y_train)
y_DT = T.predict(x_test)
print('Akurasi  (Decision tree Biasa) = ', accuracy_score(y_test, y_DT))
print(classification_report(y_test, y_DT))

T = tree.DecisionTreeClassifier(class_weight = 'balanced', random_state = 33)
T.fit(x_train, y_train)
y_DT = T.predict(x_test)
print('Akurasi  (Weighted Decision tree) = ', accuracy_score(y_test, y_DT))
print(classification_report(y_test, y_DT))

Akurasi  (Decision tree Biasa) =  0.6883116883116883
              precision    recall  f1-score   support

         0.0       0.79      0.73      0.76       105
         1.0       0.51      0.59      0.55        49

    accuracy                           0.69       154
   macro avg       0.65      0.66      0.65       154
weighted avg       0.70      0.69      0.69       154

Akurasi  (Weighted Decision tree) =  0.7207792207792207
              precision    recall  f1-score   support

         0.0       0.83      0.74      0.78       105
         1.0       0.55      0.67      0.61        49

    accuracy                           0.72       154
   macro avg       0.69      0.71      0.69       154
weighted avg       0.74      0.72      0.73       154

Studi Kasus (Latihan) ENB2012: Prediksi Penggunaan Energi Gedung¶

Task</center>¶

Filter data EcoTest dan pilih hanya yang kategori di variabel target muncul min 10 kali (heat-cat)
Lakukan EDA (Preprocessing dan visualisasi dasar)
Tentukan model terbaik (dengan parameter optimal dan cross validasi)
Hati-hati Naive Bayes, Decision Tree dan Random Forest tidak memerlukan one-hot encoding.
Gunakan Metric Micro F1-Score untuk menentukan model terbaiknya.

Optional</center>¶

Coba bandingkan model terbaik diatas dengan model ensemble.
Apakah ada imbalance problem, coba atasi dengan over/under sampling.

In [20]:

file_ = "data/building-energy-efficiency-ENB2012_data.csv"

try: # Running Locally, yakinkan "file_" berada di folder "data"
    data = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/{file_}
    data = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
print(data.shape)
data.sample(5)

(768, 12)

Out[20]:

	compactness	surface-area	wall-area	roof-area	overall-height	orientation	glazing-area	glazing-dist	heating-load	cooling-load	heat-cat	cool-cat
375	0.66	759.5	318.5	220.5	3.5	5	0.25	2	13.00	15.87	13	15
636	0.82	612.5	318.5	147.0	7.0	2	0.40	3	28.67	32.43	28	32
201	0.86	588.0	294.0	147.0	7.0	3	0.10	4	25.37	31.76	25	31
542	0.82	612.5	318.5	147.0	7.0	4	0.40	1	29.53	28.99	29	28
506	0.74	686.0	245.0	220.5	3.5	4	0.25	5	11.64	14.81	11	14

In [21]:

# Jawaban Latihan dimulai di cell ini

Akhir Modul SLCM-03¶

Referensi

Aggarwal, C. C. (2015). Data mining: the textbook. Springer.
Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi. Discovering Data Mining: From Concept to Implementation. IBM, 1997
Fayyad, G. Piatetsky-Shapiro, and P. Smith. From data mining to knowledge discovery. AI Magzine,Volume 17, pages 37-54, 1996.
Barry, A. J. Michael & Linoff, S. Gordon. 2004. Data Mining Techniques. Wiley Publishing, Inc. Indianapolis : xxiii + 615 hlm.
Hand, David etc. 2001. Principles of Data Mining. MIT Press Cambridge, Massachusetts : xxvii + 467 hlm.
Hornick, Mark F., Marcade, Erik & Vankayala, Sunil. 2007. Java Data Mining: Strategy,Standard, and Practice. Morgan Kaufman. San Francisco : xxi + 519 hlm.
Tang, ZhaoHui & Jamie, MacLennan. 2005. Data Mining with SQL Server 2005. Wiley Publishing, Inc. Indianapolis : xvii + 435 hal
Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
Yang, X. S. (2019). Introduction to Algorithms for Data Mining and Machine Learning. Academic Press.
Simovici, D. (2018). Mathematical Analysis for Machine Learning and Data Mining. World Scientific Publishing Co., Inc..
Zheng, A. (2015). Evaluating machine learning models: a beginner’s guide to key concepts and pitfalls.
Mitchell, T. M. (1997). Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37), 870-877.
Jason Brownlee: A Gentle Introduction to XGBoost for Applied Machine Learning. Mach. Learn. Mastery. (2016).
Ketkar, N.: Deep Learning with Python. (2017). https://doi.org/10.1007/978-1-4842-2766-4.

SLCM-03: (Supervised Learning) Classification Models 03

Module/Code SLCM-03: Pendahuluan Model Klasifikasi III

Video SLCM-03

Code Lesson SLCM-03 [Click Here]

taudata Analytics

Supervised Learning - Classification 03

https://taudata.blogspot.com/2022/04/slcm-03.html

Ensemble Model¶

“meta-algorithms” : Bagging & Boosting

Property of Boosting¶

AdaBoost¶

Imbalance Data¶

Imbalance Learning¶

Weighted Decision Tree¶

Studi Kasus (Latihan) ENB2012: Prediksi Penggunaan Energi Gedung¶

Task</center>¶

Optional</center>¶

Akhir Modul SLCM-03¶

Referensi

No comments:

Post a Comment

SEARCH

LATEST

FOLLOW ME

Visitors

Translate~Terjemahkan

Pages

Follow Us

Popular

Archive

Postingan Populer

Latest courses

Comments

About

Top Links Menu

SLCM-03: (Supervised Learning) Classification Models 03

Module/Code SLCM-03: Pendahuluan Model Klasifikasi III

Video SLCM-03

Code Lesson SLCM-03 [Click Here]

taudata Analytics

Supervised Learning - Classification 03

https://taudata.blogspot.com/2022/04/slcm-03.html

Ensemble Model¶

“meta-algorithms” : Bagging & Boosting

Property of Boosting¶

AdaBoost¶

Imbalance Data¶

Imbalance Learning¶

Weighted Decision Tree¶

Studi Kasus (Latihan) ENB2012: Prediksi Penggunaan Energi Gedung¶

Task</center>¶

Optional</center>¶

Akhir Modul SLCM-03¶

Referensi

No comments:

Post a Comment

SEARCH

LATEST

FOLLOW ME

Visitors

Translate~Terjemahkan

Pages

Follow Us

Popular

Archive

Postingan Populer

Latest courses

Comments

About