Module/Code EDA-03: Pendahuluan Clustering
taudata Analytics
EDA-03: Unsupervised Learning - Pendahuluan Clustering
# Run this cell ONLY if this notebook run from Google Colab
# Kalau dijalankan lokal (Anaconda/WinPython) maka silahkan install di terminal/command prompt
# Lalu unduh secara manual file yang dibutuhkan dan letakkan di folder Python anda.
import warnings; warnings.simplefilter('ignore')
import google.colab; IN_COLAB = True
!pip install umap-learn
!wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/tau_unsup.py
IN_COLAB = False
print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")
Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded
# Importing Modules untuk Notebook ini
import umap, numpy as np, tau_unsup as tau, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
from sklearn import cluster, datasets
from sklearn.metrics import silhouette_score as siluet
from sklearn.metrics.cluster import homogeneity_score as purity
from sklearn.metrics import normalized_mutual_info_score as NMI
sns.set(style="ticks", color_codes=True)
random_state = 99
Clustering is as a process of finding group structures within data such that each instance within a group is similar to one another and dissimilar to instances in other groups [1]¶
[1]. Jain, A.K., Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 2010. 31(8): p. 651-666.¶
Clustering analysis applications can be divided into two broad categories:
- clustering for utility (e.g., data compression and indexing) and
- clustering for understanding data (e.g., finding latent structures or insights in the data)
Methods developed in Data Mining fall into the second category.
[2]. Pang-Ning, T., M. Steinbach, and V. Kumar, Introduction to data mining. Vol. 74. 2006, Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
Realworld Clustering Applications¶
- Recommendation engines
- Market segmentation
- Social network analysis
- Search result grouping
- Medical imaging
- Image segmentation
- Anomaly detection
Algoritma k-Means¶
- Apakah pengaruh menggunakan centroid dan algoritma ini terhadap bentuk cluster?
- Dari pertanyaan sebelumnya pahami bias memilih algoritma clustering.
- k-Means tidak Robust terhadap outlier, Mengapa?
- Lalu apa yang sebaiknya dilakukan?<
Tantangan Clustering¶
- Computational Complexity
- Evaluation
- Interpretation
- Heavily depends on domain knowledge
# Kita akan menggunakan 2 data: [1]. Iris dan [2]. Data untuk Studi Kasus (segmentasi kustomer) - di bagian akhir
# load the iris data
df = sns.load_dataset("iris")
X = df[['sepal_length','sepal_width','petal_length','petal_width']].values
C = df['species'].values
(150, 4)
sepal_length | sepal_width | petal_length | petal_width | species | |
114 | 5.8 | 2.8 | 5.1 | 2.4 | virginica |
62 | 6.0 | 2.2 | 4.0 | 1.0 | versicolor |
33 | 5.5 | 4.2 | 1.4 | 0.2 | setosa |
107 | 7.3 | 2.9 | 6.3 | 1.8 | virginica |
7 | 5.0 | 3.4 | 1.5 | 0.2 | setosa |
100 | 6.3 | 3.3 | 6.0 | 2.5 | virginica |
40 | 5.0 | 3.5 | 1.3 | 0.3 | setosa |
g = sns.pairplot(df, hue="species")
# k-means: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
# Hapus "random_state = random_state" jika ingin melihat efek randomized centroid.
k = 3
km = cluster.KMeans(n_clusters=k, init='random', max_iter=300, tol=0.0001,
random_state = 99)
# Hasil clusteringnya
C_km = km.predict(X)
p= sns.countplot(C_km)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2])
X2D = umap.UMAP(n_neighbors=5, min_dist=0.3, random_state=random_state).fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_km)
Apa beda label ini dengan klasifikasi ("labels")?¶
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2])
Evaluasi? - Inertia : Intra Cluster Distance¶
- Bagaimana memaknainya?
- Bukan Error! ... Mengapa?
- Belum ada faktor "inter distance" ==> nanti Silhouette Score
image source: https://www.unioviedo.es/compnum/labs/new/kmeans.html¶
Optimal Number of Clusters? - Elbow Method -¶
- Menggunakan inertia
- Rekomendasi ... Bukan "wajib" ==> Lalu apa yang lebih penting?
distorsions, k1, kN = [], 2, 10
for k in range(k1, kN):
kmeans = cluster.KMeans(n_clusters=k).fit(X)
#fig = plt.figure(figsize=(15, 5))
plt.plot(range(k1, kN), distorsions); plt.grid(True)
plt.title('Elbow curve')
Text(0.5, 1.0, 'Elbow curve')
Evaluation of KMeans with k-means++ init Evaluation of KMeans with random init Evaluation of MiniBatchKMeans with k-means++ init Evaluation of MiniBatchKMeans with random init
- Original k-means memulai algoritmanya dengan mengacak centroid awal dan k-means tidak "robust" terhadap centroid awal ini (apa artinya?).
- k-Means akan menghasilkan hasil yang berbeda-beda jika di-run beberapa kali!....
- k-Means++ "mengatasi" hal ini:
- inisialisasi centroid tidak random, tapi dengan menghitung probabilitas terbaik bagi centroid awal.
- Keuntungan selain lebih robust, biasanya iterasi yang dibutuhkan jauh lebih sedikit ketimbang k-means biasa.
- Reference : http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf
image Source: https://medium.com/@phil.busko/animation-of-k-means-clustering-31a484c30ba5¶
# k-means++ clustering http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
kmPP = cluster.KMeans(n_clusters=k, init='k-means++', max_iter=300, tol=0.0001, random_state = random_state)
C_kmpp = kmPP.predict(X)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_kmpp)
Handling "Large Data" : Mini-Batch k-Means¶
- Referensi: *Sculley, D. (2010, April). Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web (pp. 1177-1178). ACM.
# MiniBatch k-Means
# http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html
# minibatch "tidak bisa parallel"!!!...
# parameter penting km = batch_size ... pada aplikasi sesungguhnya disarankan "minimal" 3xk
mbkm = cluster.MiniBatchKMeans(n_clusters=k, init='random', max_iter=300, tol=0.0001, batch_size = 100, random_state = random_state)
C_mbkm = mbkm.predict(X)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_mbkm)
Minibatch k-Means++¶
# MiniBatch k-Means++
mbkmPP = cluster.MiniBatchKMeans(n_clusters=k, init='k-means++', max_iter=300, tol=0.0001, random_state = random_state)
C_mbkmPP = mbkmPP.predict(X)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_mbkmPP)
k_means VS MiniBatch k-Means?¶
For n_clusters = 2 The average silhouette_score is : 0.7049787496083262 For n_clusters = 3 The average silhouette_score is : 0.5882004012129721 For n_clusters = 4 The average silhouette_score is : 0.6505186632729437 For n_clusters = 5 The average silhouette_score is : 0.56376469026194 For n_clusters = 6 The average silhouette_score is : 0.4504666294372765
#Evaluasi : Internal . Contoh Silouette Coefficient ==> warning hanya cocok untuk k-means (centroid-based clustering)
Hasil_Clustering = [C_km, C_kmpp, C_mbkm, C_mbkmPP]
for res in Hasil_Clustering:
print(siluet(X,res), end=', ')
# Bagaimana cara kerja dan interpretasinya?
0.5528190123564102, 0.5528190123564102, 0.5528190123564102, 0.5528190123564102,
Tidak ada "Ground Truth" di Unsupervised Learning/Clustering.
Salah satu "Bias" terbesar adalah algoritma clustering yang kita pilih.
Catatan Penting dalam mengevaluasi Clustering secara internal:¶
- Tidak ada clustering yang "benar"
- Yang terpenting adalah interpretability/Informasi yang didapatkan (non-trivial information)
- Internal metric tertentu hanya cocok untuk suatu algoritma tertentu juga, sehingga di Penelitian/Aplikasi di dunia professional jangan membandingkan 2 macam clustering dengan ukuran internal yang spesifik untuk metode clustering tertentu (misal Silhouette untuk k-Means).
- Kleinberg, J. M. (2003). An impossibility theorem for clustering. In Advances in neural information processing systems (pp. 463-470).
- Referensi 1: http://papers.nips.cc/paper/2340-an-impossibility-theorem-for-clustering.pdf
- Referensi 2: https://core.ac.uk/download/pdf/34638775.pdf
# Bagaimana dengan evaluasi External?
# "C" adalah ground truth/golden standard
for res in Hasil_Clustering:
print(purity(C,res), end=', ')
0.7514854021988338, 0.7514854021988338, 0.7514854021988339, 0.7514854021988339,
# Evaluasi External NMI
for res in Hasil_Clustering:
print(NMI(C,res), end=', ')
# untuk F-Score ada juga code dan penjelasannya di blog post di atas
0.7581756800057784, 0.7581756800057784, 0.7581756800057785, 0.7581756800057785,
Please read more here: https://tau-data.id/evaluasi-eksternal/¶
Cara menarik kesimpulan dari k-Means: Interpretasi¶
array([[5.006 , 3.428 , 1.462 , 0.246 ], [5.9016129 , 2.7483871 , 4.39354839, 1.43387097], [6.85 , 3.07368421, 5.74210526, 2.07105263]])
# Evaluasi sebenarnya tidak terlalu penting di Unsupervised learning.
# inilah yang membedakan "clustering" dan "clustering Analysis"
# yang lebih penting adalah interpretasi, tapi Bagaimana?
# contoh k-means++
cols = ['sepal_length','sepal_width','petal_length','petal_width']
dfC = pd.DataFrame(kmPP.cluster_centers_, columns=cols)
dfC['cluster'] = dfC.index
pd.plotting.parallel_coordinates(dfC, 'cluster', color=('r', 'g', 'b'))
array([[5.006 , 3.428 , 1.462 , 0.246 ], [5.9016129 , 2.7483871 , 4.39354839, 1.43387097], [6.85 , 3.07368421, 5.74210526, 2.07105263]])
k-Means Best Practices¶
- Hati-hati faktor skala data ==> Normalisai/Standardized.
- Hati-hati asumsi topologi data di k-means.
- Nope... k-Means tidak bisa untuk data Kategori
- Sangat tidak disarankan untuk data tidak terstruktur berskala besar. Kalau datanya tidak besar cukup ganti jarak euclid dengan similarity Cosine.
End of Module¶
