"**We must be careful not to confuse data with the abstractions we use to analyse them**” - ***William James***

Video EDA-03

Full Version: Link to Youtube

Code Lesson EDA-03 [Click Here]

Code dari lesson ini dapat di akses di Link berikut (wajib login ke Google/Gmail): Code EDA-03 [Click Here]
Di link tersebut anda langsung bisa merubah code dan menjalankannya. Keterangan lebih lanjut di video yang disertakan. Sangat disarankan untuk membuka code dan video "side-by-side" untuk mendapatkan pengalaman belajar yang baik (Gambar dibawah). SIlahkan modifikasi (coba-coba) hal lain, selain yang ditunjukkan di video untuk mendapatkan pengalaman belajar yang lebih mendalam. Tentu saja juga silahkan akses berbagai referensi lain untuk memperkaya pengetahuan lalu diskusikan di forum yang telah disediakan.

"**Side-by-Side**": Ilustrasi bagaimana menggunakan code dan video dalam pembelajaran di tau-data. untuk mendapatkan pengalaman belajar yang baik.

Module/Code EDA-03: Pendahuluan Clustering

eda-03

taudata Analytics

EDA-03: Unsupervised Learning - Pendahuluan Clustering

https://taudata.blogspot.com/

https://www.youtube.com/c/taudataAnalytics

In [1]:

# Run this cell ONLY if this notebook run from Google Colab
# Kalau dijalankan lokal (Anaconda/WinPython) maka silahkan install di terminal/command prompt 
# Lalu unduh secara manual file yang dibutuhkan dan letakkan di folder Python anda.
import warnings; warnings.simplefilter('ignore')

try:
    import google.colab; IN_COLAB = True
    !pip install umap-learn
    !wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/tau_unsup.py
except:
    IN_COLAB = False
    print("Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded")

Running the code locally, please make sure all the python module versions agree with colab environment and all data/assets downloaded

In [2]:

# Importing Modules untuk Notebook ini
import umap, numpy as np, tau_unsup as tau, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
from sklearn import cluster, datasets
from sklearn.metrics import silhouette_score as siluet
from sklearn.metrics.cluster import homogeneity_score as purity
from sklearn.metrics import normalized_mutual_info_score as NMI 

sns.set(style="ticks", color_codes=True)
random_state = 99

Definition¶

Clustering is as a process of finding group structures within data such that each instance within a group is similar to one another and dissimilar to instances in other groups [1]¶

[1]. Jain, A.K., Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 2010. 31(8): p. 651-666.¶

Applications¶

Clustering analysis applications can be divided into two broad categories:

clustering for utility (e.g., data compression and indexing) and
clustering for understanding data (e.g., finding latent structures or insights in the data)

Methods developed in Data Mining fall into the second category.

[2]. Pang-Ning, T., M. Steinbach, and V. Kumar, Introduction to data mining. Vol. 74. 2006, Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.

Realworld Clustering Applications¶

Recommendation engines
Market segmentation
Social network analysis
Search result grouping
Medical imaging
Image segmentation
Anomaly detection

k-Means¶

Source: https://imgflip.com/¶

Algoritma k-Means¶

How it works: https://www.learndatasci.com/tutorials/k-means-clustering-algorithms-python-intro/

Penting:¶

Apakah pengaruh menggunakan centroid dan algoritma ini terhadap bentuk cluster?
Dari pertanyaan sebelumnya pahami bias memilih algoritma clustering.
k-Means tidak Robust terhadap outlier, Mengapa?
Lalu apa yang sebaiknya dilakukan?<

Tantangan Clustering¶

Computational Complexity
Evaluation
Interpretation
Heavily depends on domain knowledge

In [3]:

# Kita akan menggunakan 2 data: [1]. Iris dan [2]. Data untuk Studi Kasus (segmentasi kustomer) - di bagian akhir
# load the iris data
df = sns.load_dataset("iris")
X = df[['sepal_length','sepal_width','petal_length','petal_width']].values
C = df['species'].values
print(X.shape)
df.sample(7)

(150, 4)

Out[3]:

	sepal_length	sepal_width	petal_length	petal_width	species
114	5.8	2.8	5.1	2.4	virginica
62	6.0	2.2	4.0	1.0	versicolor
33	5.5	4.2	1.4	0.2	setosa
107	7.3	2.9	6.3	1.8	virginica
7	5.0	3.4	1.5	0.2	setosa
100	6.3	3.3	6.0	2.5	virginica
40	5.0	3.5	1.3	0.3	setosa

In [4]:

g = sns.pairplot(df, hue="species")

In [5]:

# k-means: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans
# Hapus "random_state = random_state" jika ingin melihat efek randomized centroid.
k = 3
km = cluster.KMeans(n_clusters=k, init='random', max_iter=300, tol=0.0001, 
                    random_state = 99)
km.fit(X)
# Hasil clusteringnya
C_km = km.predict(X)
p= sns.countplot(C_km)

In [6]:

C_km

Out[6]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0,
       0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2])

In [7]:

X2D = umap.UMAP(n_neighbors=5, min_dist=0.3, random_state=random_state).fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_km)
plt.show()

Apa beda label ini dengan klasifikasi ("labels")?¶

In [8]:

C_km

Out[8]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 2, 0, 0, 0, 0,
       0, 2, 0, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2])

Evaluasi? - Inertia : Intra Cluster Distance¶

Bagaimana memaknainya?
Bukan Error! ... Mengapa?
Belum ada faktor "inter distance" ==> nanti Silhouette Score

image source: https://www.unioviedo.es/compnum/labs/new/kmeans.html ¶

In [9]:

km.inertia_

Out[9]:

78.85144142614602

Optimal Number of Clusters? - Elbow Method -¶

Menggunakan inertia
Rekomendasi ... Bukan "wajib" ==> Lalu apa yang lebih penting?

In [10]:

distorsions, k1, kN = [], 2, 10
for k in range(k1, kN):
    kmeans = cluster.KMeans(n_clusters=k).fit(X)
    distorsions.append(kmeans.inertia_)
#fig = plt.figure(figsize=(15, 5))
plt.plot(range(k1, kN), distorsions); plt.grid(True)
plt.title('Elbow curve')

Out[10]:

Text(0.5, 1.0, 'Elbow curve')

Ponder this¶

Apakah akibat dari mengacak (randomized) centroid di awal algoritma?¶

k-Means sangat tidak direkomendasikan untuk diaplikasikan di aplikasi nyata ... Loh? Mengapa?¶

In [11]:

tau.km_initializations()

Evaluation of KMeans with k-means++ init
Evaluation of KMeans with random init
Evaluation of MiniBatchKMeans with k-means++ init
Evaluation of MiniBatchKMeans with random init

k-Means++¶

Original k-means memulai algoritmanya dengan mengacak centroid awal dan k-means tidak "robust" terhadap centroid awal ini (apa artinya?).
k-Means akan menghasilkan hasil yang berbeda-beda jika di-run beberapa kali!....
k-Means++ "mengatasi" hal ini:
inisialisasi centroid tidak random, tapi dengan menghitung probabilitas terbaik bagi centroid awal.
Keuntungan selain lebih robust, biasanya iterasi yang dibutuhkan jauh lebih sedikit ketimbang k-means biasa.
Reference : http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

image Source: https://medium.com/@phil.busko/animation-of-k-means-clustering-31a484c30ba5 ¶

In [12]:

# k-means++ clustering http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
k=3
kmPP = cluster.KMeans(n_clusters=k, init='k-means++', max_iter=300, tol=0.0001, random_state = random_state)
kmPP.fit(X)
C_kmpp = kmPP.predict(X)

sns.countplot(C_kmpp)
C_kmpp[:10]

Out[12]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [13]:

fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_kmpp)
plt.show()

Handling "Large Data" : Mini-Batch k-Means¶

Referensi: *Sculley, D. (2010, April). Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web (pp. 1177-1178). ACM.
Google

In [14]:

# MiniBatch k-Means 
# http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html
# minibatch "tidak bisa parallel"!!!...
# parameter penting km = batch_size ... pada aplikasi sesungguhnya disarankan "minimal" 3xk
mbkm = cluster.MiniBatchKMeans(n_clusters=k, init='random', max_iter=300, tol=0.0001, batch_size = 100, random_state = random_state) 
mbkm.fit(X)
C_mbkm = mbkm.predict(X)
sns.countplot(C_mbkm)
C_mbkm[:10]

Out[14]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [15]:

fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_mbkm)
plt.show()

Minibatch k-Means++¶

In [16]:

# MiniBatch k-Means++
mbkmPP = cluster.MiniBatchKMeans(n_clusters=k, init='k-means++', max_iter=300, tol=0.0001, random_state = random_state) 
mbkmPP.fit(X)
C_mbkmPP = mbkmPP.predict(X)
sns.countplot(C_mbkmPP)
C_mbkmPP[:10]

Out[16]:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [17]:

fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_mbkmPP)
plt.show()

k_means VS MiniBatch k-Means?¶

Evaluasi dan interpretasi k-Means: Silhouette Coefficient¶

Apa makna intuitifnya?¶

In [18]:

tau.sil_based_optimal_km()

For n_clusters = 2 The average silhouette_score is : 0.7049787496083262
For n_clusters = 3 The average silhouette_score is : 0.5882004012129721
For n_clusters = 4 The average silhouette_score is : 0.6505186632729437
For n_clusters = 5 The average silhouette_score is : 0.56376469026194
For n_clusters = 6 The average silhouette_score is : 0.4504666294372765

In [19]:

#Evaluasi : Internal . Contoh Silouette Coefficient ==> warning hanya cocok untuk k-means (centroid-based clustering)
Hasil_Clustering = [C_km, C_kmpp, C_mbkm, C_mbkmPP]
for res in Hasil_Clustering:
    print(siluet(X,res), end=', ')
# Bagaimana cara kerja dan interpretasinya?

0.5528190123564102, 0.5528190123564102, 0.5528190123564102, 0.5528190123564102,

Clustering?¶

Tidak ada "Ground Truth" di Unsupervised Learning/Clustering.
Salah satu "Bias" terbesar adalah algoritma clustering yang kita pilih.

Catatan Penting dalam mengevaluasi Clustering secara internal:¶

Tidak ada clustering yang "benar"
Yang terpenting adalah interpretability/Informasi yang didapatkan (non-trivial information)
Internal metric tertentu hanya cocok untuk suatu algoritma tertentu juga, sehingga di Penelitian/Aplikasi di dunia professional jangan membandingkan 2 macam clustering dengan ukuran internal yang spesifik untuk metode clustering tertentu (misal Silhouette untuk k-Means).
Kleinberg, J. M. (2003). An impossibility theorem for clustering. In Advances in neural information processing systems (pp. 463-470).
Referensi 1: http://papers.nips.cc/paper/2340-an-impossibility-theorem-for-clustering.pdf
Referensi 2: https://core.ac.uk/download/pdf/34638775.pdf

Evaluasi Clustering: Internal VS External¶

Di dunia akademis hanya diperkenalkan 2 macam evaluasi, tapi di aplikasi (industri) ada cara evaluasi ke-03.¶

In [20]:

# Bagaimana dengan evaluasi External?
# "C" adalah ground truth/golden standard
for res in Hasil_Clustering:
    print(purity(C,res), end=', ')

0.7514854021988338, 0.7514854021988338, 0.7514854021988339, 0.7514854021988339,

In [21]:

# Evaluasi External NMI 
for res in Hasil_Clustering:
    print(NMI(C,res), end=', ')
# untuk F-Score ada juga code dan penjelasannya di blog post di atas

0.7581756800057784, 0.7581756800057784, 0.7581756800057785, 0.7581756800057785,

Please read more here: https://tau-data.id/evaluasi-eksternal/¶

Cara menarik kesimpulan dari k-Means: Interpretasi¶

In [22]:

kmPP.cluster_centers_

Out[22]:

array([[5.006     , 3.428     , 1.462     , 0.246     ],
       [5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
       [6.85      , 3.07368421, 5.74210526, 2.07105263]])

In [23]:

# Evaluasi sebenarnya tidak terlalu penting di Unsupervised learning.
# inilah yang membedakan "clustering" dan "clustering Analysis"
# yang lebih penting adalah interpretasi, tapi Bagaimana?
# contoh k-means++

cols = ['sepal_length','sepal_width','petal_length','petal_width']
dfC = pd.DataFrame(kmPP.cluster_centers_, columns=cols)
dfC['cluster'] = dfC.index

pd.plotting.parallel_coordinates(dfC, 'cluster', color=('r', 'g', 'b'))
plt.show()

In [24]:

kmPP.cluster_centers_

Out[24]:

array([[5.006     , 3.428     , 1.462     , 0.246     ],
       [5.9016129 , 2.7483871 , 4.39354839, 1.43387097],
       [6.85      , 3.07368421, 5.74210526, 2.07105263]])

k-Means Best Practices¶

Hati-hati faktor skala data ==> Normalisai/Standardized.
Hati-hati asumsi topologi data di k-means.
Nope... k-Means tidak bisa untuk data Kategori
Sangat tidak disarankan untuk data tidak terstruktur berskala besar. Kalau datanya tidak besar cukup ganti jarak euclid dengan similarity Cosine.

End of Module¶

Referensi

Everitt, B. S., Landau, S., & Leese, M. (1993). Cluster analysis. 1993. Edward Arnold and Halsted Press,.
Arthur, D., & Vassilvitskii, S. (2006). k-means++: The advantages of careful seeding. Stanford.
Sculley, D. (2010, April). Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web (pp. 1177-1178).
Jain, A.K., Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 2010. 31(8): p. 651-666.
Pang-Ning, T., M. Steinbach, and V. Kumar, Introduction to data mining. Vol. 74. 2006, Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
Kleinberg, J. M. (2003). An impossibility theorem for clustering. In Advances in neural information processing systems (pp. 463-470).

Top Links Menu

EDA-03: Pendahuluan Clustering Analysis

Video EDA-03

Code Lesson EDA-03 [Click Here]

Module/Code EDA-03: Pendahuluan Clustering

taudata Analytics

EDA-03: Unsupervised Learning - Pendahuluan Clustering

https://taudata.blogspot.com/

https://www.youtube.com/c/taudataAnalytics

Definition¶

Clustering is as a process of finding group structures within data such that each instance within a group is similar to one another and dissimilar to instances in other groups [1]¶

[1]. Jain, A.K., Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 2010. 31(8): p. 651-666.¶

Applications¶

Realworld Clustering Applications¶

k-Means¶

Source: https://imgflip.com/¶

Algoritma k-Means¶

Penting:¶

Tantangan Clustering¶

Apa beda label ini dengan klasifikasi ("labels")?¶

Evaluasi? - Inertia : Intra Cluster Distance¶

image source: https://www.unioviedo.es/compnum/labs/new/kmeans.html¶

Optimal Number of Clusters? - Elbow Method -¶

Ponder this¶

Apakah akibat dari mengacak (randomized) centroid di awal algoritma?¶

k-Means sangat tidak direkomendasikan untuk diaplikasikan di aplikasi nyata ... Loh? Mengapa?¶

k-Means++¶

image Source: https://medium.com/@phil.busko/animation-of-k-means-clustering-31a484c30ba5¶

Handling "Large Data" : Mini-Batch k-Means¶

Minibatch k-Means++¶

k_means VS MiniBatch k-Means?¶

Evaluasi dan interpretasi k-Means: Silhouette Coefficient¶

Apa makna intuitifnya?¶

Clustering?¶

Tidak ada "Ground Truth" di Unsupervised Learning/Clustering. Salah satu "Bias" terbesar adalah algoritma clustering yang kita pilih.

Catatan Penting dalam mengevaluasi Clustering secara internal:¶

Evaluasi Clustering: Internal VS External¶

Di dunia akademis hanya diperkenalkan 2 macam evaluasi, tapi di aplikasi (industri) ada cara evaluasi ke-03.¶

Please read more here: https://tau-data.id/evaluasi-eksternal/¶

Cara menarik kesimpulan dari k-Means: Interpretasi¶

k-Means Best Practices¶

End of Module¶

Referensi

No comments:

Post a Comment

SEARCH

LATEST

FOLLOW ME

Visitors

Translate~Terjemahkan

Pages

Follow Us

Popular

Archive

Postingan Populer

Latest courses

Comments

About

image source: https://www.unioviedo.es/compnum/labs/new/kmeans.html ¶

image Source: https://medium.com/@phil.busko/animation-of-k-means-clustering-31a484c30ba5 ¶

Tidak ada "Ground Truth" di Unsupervised Learning/Clustering.
Salah satu "Bias" terbesar adalah algoritma clustering yang kita pilih.