Video EDA-03
Code Lesson EDA-03 [Click Here]
Code dari lesson ini dapat di akses di Link berikut (wajib login ke Google/Gmail): Code EDA-03 [Click Here]
Di link tersebut anda langsung bisa merubah code dan menjalankannya. Keterangan lebih lanjut di video yang disertakan. Sangat disarankan untuk membuka code dan video "side-by-side" untuk mendapatkan pengalaman belajar yang baik (Gambar dibawah). SIlahkan modifikasi (coba-coba) hal lain, selain yang ditunjukkan di video untuk mendapatkan pengalaman belajar yang lebih mendalam. Tentu saja juga silahkan akses berbagai referensi lain untuk memperkaya pengetahuan lalu diskusikan di forum yang telah disediakan.
Module/Code EDA-03: Pendahuluan Clustering
taudata Analytics
EDA-03: Unsupervised Learning - Pendahuluan Clustering
https://taudata.blogspot.com/
https://www.youtube.com/c/taudataAnalytics
# Run this cell ONLY if this notebook run from Google Colab
# Kalau dijalankan lokal (Anaconda/WinPython) maka silahkan install di terminal/command prompt
# Lalu unduh secara manual file yang dibutuhkan dan letakkan di folder Python anda.
import warnings; warnings.simplefilter('ignore')
try:
import google.colab; IN_COLAB = True
!pip install umap-learn
!wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/tau_unsup.py
except:
IN_COLAB = False
print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")
Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded
# Importing Modules untuk Notebook ini
import umap, numpy as np, tau_unsup as tau, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
from sklearn import cluster, datasets
from sklearn.metrics import silhouette_score as siluet
from sklearn.metrics.cluster import homogeneity_score as purity
from sklearn.metrics import normalized_mutual_info_score as NMI
sns.set(style="ticks", color_codes=True)
random_state = 99
Definition¶
Clustering is as a process of finding group structures within data such that each instance within a group is similar to one another and dissimilar to instances in other groups [1]¶
[1]. Jain, A.K., Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 2010. 31(8): p. 651-666.¶
Applications¶
Clustering analysis applications can be divided into two broad categories:
- clustering for utility (e.g., data compression and indexing) and
- clustering for understanding data (e.g., finding latent structures or insights in the data)
Methods developed in Data Mining fall into the second category.
[2]. Pang-Ning, T., M. Steinbach, and V. Kumar, Introduction to data mining. Vol. 74. 2006, Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
Realworld Clustering Applications¶
- Recommendation engines
- Market segmentation
- Social network analysis
- Search result grouping
- Medical imaging
- Image segmentation
- Anomaly detection
k-Means¶
Source: https://imgflip.com/¶
Algoritma k-Means¶
Penting:¶
- Apakah pengaruh menggunakan centroid dan algoritma ini terhadap bentuk cluster?
- Dari pertanyaan sebelumnya pahami bias memilih algoritma clustering.
- k-Means tidak Robust terhadap outlier, Mengapa?
- Lalu apa yang sebaiknya dilakukan?<
Tantangan Clustering¶
- Computational Complexity
- Evaluation
- Interpretation
- Heavily depends on domain knowledge
# Kita akan menggunakan 2 data: [1]. Iris dan [2]. Data untuk Studi Kasus (segmentasi kustomer) - di bagian akhir
# load the iris data
df = sns.load_dataset("iris")
X = df[['sepal_length','sepal_width','petal_length','petal_width']].values
C = df['species'].values
print(X.shape)
df.sample(7)
(150, 4)
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
114 | 5.8 | 2.8 | 5.1 | 2.4 | virginica |
62 | 6.0 | 2.2 | 4.0 | 1.0 | versicolor |
33 | 5.5 | 4.2 | 1.4 | 0.2 | setosa |
107 | 7.3 | 2.9 | 6.3 | 1.8 | virginica |
7 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
100 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
40 | 5.0 | 3.5 | 1.3 | 0.3 | setosa |
g = sns.pairplot(df, hue="species")
# k-means: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
# Hapus "random_state = random_state" jika ingin melihat efek randomized centroid.
k = 3
km = cluster.KMeans(n_clusters=k, init='random', max_iter=300, tol=0.0001,
random_state = 99)
km.fit(X)
# Hasil clusteringnya
C_km = km.predict(X)
p= sns.countplot(C_km)
C_km
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2])
X2D = umap.UMAP(n_neighbors=5, min_dist=0.3, random_state=random_state).fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_km)
plt.show()
Apa beda label ini dengan klasifikasi ("labels")?¶
C_km
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2])
Evaluasi? - Inertia : Intra Cluster Distance¶
- Bagaimana memaknainya?
- Bukan Error! ... Mengapa?
- Belum ada faktor "inter distance" ==> nanti Silhouette Score
image source: https://www.unioviedo.es/compnum/labs/new/kmeans.html¶
km.inertia_
78.85144142614602
Optimal Number of Clusters? - Elbow Method -¶
- Menggunakan inertia
- Rekomendasi ... Bukan "wajib" ==> Lalu apa yang lebih penting?
distorsions, k1, kN = [], 2, 10
for k in range(k1, kN):
kmeans = cluster.KMeans(n_clusters=k).fit(X)
distorsions.append(kmeans.inertia_)
#fig = plt.figure(figsize=(15, 5))
plt.plot(range(k1, kN), distorsions); plt.grid(True)
plt.title('Elbow curve')
Text(0.5, 1.0, 'Elbow curve')
tau.km_initializations()
Evaluation of KMeans with k-means++ init Evaluation of KMeans with random init Evaluation of MiniBatchKMeans with k-means++ init Evaluation of MiniBatchKMeans with random init
k-Means++¶
- Original k-means memulai algoritmanya dengan mengacak centroid awal dan k-means tidak "robust" terhadap centroid awal ini (apa artinya?).
- k-Means akan menghasilkan hasil yang berbeda-beda jika di-run beberapa kali!....
- k-Means++ "mengatasi" hal ini:
- inisialisasi centroid tidak random, tapi dengan menghitung probabilitas terbaik bagi centroid awal.
- Keuntungan selain lebih robust, biasanya iterasi yang dibutuhkan jauh lebih sedikit ketimbang k-means biasa.
- Reference : http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
image Source: https://medium.com/@phil.busko/animation-of-k-means-clustering-31a484c30ba5¶
# k-means++ clustering http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
k=3
kmPP = cluster.KMeans(n_clusters=k, init='k-means++', max_iter=300, tol=0.0001, random_state = random_state)
kmPP.fit(X)
C_kmpp = kmPP.predict(X)
sns.countplot(C_kmpp)
C_kmpp[:10]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_kmpp)
plt.show()
Handling "Large Data" : Mini-Batch k-Means¶
- Referensi: *Sculley, D. (2010, April). Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web (pp. 1177-1178). ACM.
# MiniBatch k-Means
# http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html
# minibatch "tidak bisa parallel"!!!...
# parameter penting km = batch_size ... pada aplikasi sesungguhnya disarankan "minimal" 3xk
mbkm = cluster.MiniBatchKMeans(n_clusters=k, init='random', max_iter=300, tol=0.0001, batch_size = 100, random_state = random_state)
mbkm.fit(X)
C_mbkm = mbkm.predict(X)
sns.countplot(C_mbkm)
C_mbkm[:10]
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_mbkm)
plt.show()
Minibatch k-Means++¶
# MiniBatch k-Means++
mbkmPP = cluster.MiniBatchKMeans(n_clusters=k, init='k-means++', max_iter=300, tol=0.0001, random_state = random_state)
mbkmPP.fit(X)
C_mbkmPP = mbkmPP.predict(X)
sns.countplot(C_mbkmPP)
C_mbkmPP[:10]
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_mbkmPP)
plt.show()
k_means VS MiniBatch k-Means?¶
tau.sil_based_optimal_km()
For n_clusters = 2 The average silhouette_score is : 0.7049787496083262 For n_clusters = 3 The average silhouette_score is : 0.5882004012129721 For n_clusters = 4 The average silhouette_score is : 0.6505186632729437 For n_clusters = 5 The average silhouette_score is : 0.56376469026194 For n_clusters = 6 The average silhouette_score is : 0.4504666294372765
#Evaluasi : Internal . Contoh Silouette Coefficient ==> warning hanya cocok untuk k-means (centroid-based clustering)
Hasil_Clustering = [C_km, C_kmpp, C_mbkm, C_mbkmPP]
for res in Hasil_Clustering:
print(siluet(X,res), end=', ')
# Bagaimana cara kerja dan interpretasinya?
0.5528190123564102, 0.5528190123564102, 0.5528190123564102, 0.5528190123564102,
Clustering?¶
Tidak ada "Ground Truth" di Unsupervised Learning/Clustering.
Salah satu "Bias" terbesar adalah algoritma clustering yang kita pilih.
Catatan Penting dalam mengevaluasi Clustering secara internal:¶
- Tidak ada clustering yang "benar"
- Yang terpenting adalah interpretability/Informasi yang didapatkan (non-trivial information)
- Internal metric tertentu hanya cocok untuk suatu algoritma tertentu juga, sehingga di Penelitian/Aplikasi di dunia professional jangan membandingkan 2 macam clustering dengan ukuran internal yang spesifik untuk metode clustering tertentu (misal Silhouette untuk k-Means).
- Kleinberg, J. M. (2003). An impossibility theorem for clustering. In Advances in neural information processing systems (pp. 463-470).
- Referensi 1: http://papers.nips.cc/paper/2340-an-impossibility-theorem-for-clustering.pdf
- Referensi 2: https://core.ac.uk/download/pdf/34638775.pdf
# Bagaimana dengan evaluasi External?
# "C" adalah ground truth/golden standard
for res in Hasil_Clustering:
print(purity(C,res), end=', ')
0.7514854021988338, 0.7514854021988338, 0.7514854021988339, 0.7514854021988339,
# Evaluasi External NMI
for res in Hasil_Clustering:
print(NMI(C,res), end=', ')
# untuk F-Score ada juga code dan penjelasannya di blog post di atas
0.7581756800057784, 0.7581756800057784, 0.7581756800057785, 0.7581756800057785,
Please read more here: https://tau-data.id/evaluasi-eksternal/¶
Cara menarik kesimpulan dari k-Means: Interpretasi¶
kmPP.cluster_centers_
array([[5.006 , 3.428 , 1.462 , 0.246 ], [5.9016129 , 2.7483871 , 4.39354839, 1.43387097], [6.85 , 3.07368421, 5.74210526, 2.07105263]])
# Evaluasi sebenarnya tidak terlalu penting di Unsupervised learning.
# inilah yang membedakan "clustering" dan "clustering Analysis"
# yang lebih penting adalah interpretasi, tapi Bagaimana?
# contoh k-means++
cols = ['sepal_length','sepal_width','petal_length','petal_width']
dfC = pd.DataFrame(kmPP.cluster_centers_, columns=cols)
dfC['cluster'] = dfC.index
pd.plotting.parallel_coordinates(dfC, 'cluster', color=('r', 'g', 'b'))
plt.show()
kmPP.cluster_centers_
array([[5.006 , 3.428 , 1.462 , 0.246 ], [5.9016129 , 2.7483871 , 4.39354839, 1.43387097], [6.85 , 3.07368421, 5.74210526, 2.07105263]])
k-Means Best Practices¶
- Hati-hati faktor skala data ==> Normalisai/Standardized.
- Hati-hati asumsi topologi data di k-means.
- Nope... k-Means tidak bisa untuk data Kategori
- Sangat tidak disarankan untuk data tidak terstruktur berskala besar. Kalau datanya tidak besar cukup ganti jarak euclid dengan similarity Cosine.
End of Module¶
Referensi
- Everitt, B. S., Landau, S., & Leese, M. (1993). Cluster analysis. 1993. Edward Arnold and Halsted Press,.
- Arthur, D., & Vassilvitskii, S. (2006). k-means++: The advantages of careful seeding. Stanford.
- Sculley, D. (2010, April). Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web (pp. 1177-1178).
- Jain, A.K., Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 2010. 31(8): p. 651-666.
- Pang-Ning, T., M. Steinbach, and V. Kumar, Introduction to data mining. Vol. 74. 2006, Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
- Kleinberg, J. M. (2003). An impossibility theorem for clustering. In Advances in neural information processing systems (pp. 463-470).
Tidak ada komentar:
Posting Komentar
Relevant & Respectful Comments Only.