Video EDA-04: https://www.youtube.com/watch?v=oC-WyfkNdn8
Code Lesson EDA-04 [Click Here]
Code dari lesson ini dapat di akses di Link berikut (wajib login ke Google/Gmail): Code EDA-04 [Click Here]
Di link tersebut anda langsung bisa merubah code dan menjalankannya. Keterangan lebih lanjut di video yang disertakan. Sangat disarankan untuk membuka code dan video "side-by-side" untuk mendapatkan pengalaman belajar yang baik (Gambar dibawah). SIlahkan modifikasi (coba-coba) hal lain, selain yang ditunjukkan di video untuk mendapatkan pengalaman belajar yang lebih mendalam. Tentu saja juga silahkan akses berbagai referensi lain untuk memperkaya pengetahuan lalu diskusikan di forum yang telah disediakan.
Module/Code EDA-04: Pendahuluan Clustering
EDA-04: Unsupervised Learning - Clustering Bagian ke-02 ¶
(C) Taufik Sutanto - 2020
tau-data Indonesia ~ https://tau-data.id/eda-04/
# Run this cell ONLY if this notebook run from Google Colab
# Kalau dijalankan lokal (Anaconda/WinPython) maka silahkan install di terminal/command prompt
# Lalu unduh secara manual file yang dibutuhkan dan letakkan di folder Python anda.
!pip install --upgrade umap-learn
!wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/tau_unsup.py
# Importing Modules untuk Notebook ini
import warnings; warnings.simplefilter('ignore')
import time, umap, numpy as np, tau_unsup as tau, matplotlib.pyplot as plt, pandas as pd, seaborn as sns
from matplotlib.colors import ListedColormap
from sklearn import cluster, datasets
from sklearn.metrics.pairwise import pairwise_distances_argmin
from sklearn.preprocessing import StandardScaler
from itertools import cycle, islice
from sklearn.metrics import silhouette_score as siluet
from sklearn.metrics.cluster import homogeneity_score as purity
from sklearn.metrics import normalized_mutual_info_score as NMI
sns.set(style="ticks", color_codes=True)
random_state = 99
Hierarchical Clustering (Agglomerative)¶
image source: https://www.kdnuggets.com/2019/09/hierarchical-clustering.html¶
- Clustering Optimal = Garis terpanjang https://www.sciencedirect.com/topics/computer-science/agglomerative-algorithm
Beberapa contoh Linkages¶
Linkages Comparisons¶
- single linkage is fast, and can perform well on non-globular data, but it performs poorly in the presence of noise.
- average and complete linkage perform well on cleanly separated globular clusters, but have mixed results otherwise.
- Ward is the most effective method for noisy data.
- http://scikit-learn.org/stable/auto_examples/cluster/plot_linkage_comparison.html#sphx-glr-auto-examples-cluster-plot-linkage-comparison-py
tau.compare_linkages()
Hierarchical Clustering (Agglomerative vs Divisive)¶
image source: https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/¶
# Kita akan menggunakan data yang sama dengan EDA-03
df = sns.load_dataset("iris")
X = df[['sepal_length','sepal_width','petal_length','petal_width']].values
C = df['species'].values
print(X.shape)
df.head()
(150, 4)
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
# Hierarchical http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering
hierarchical = cluster.AgglomerativeClustering(n_clusters=3, linkage='average', affinity = 'euclidean')
hierarchical.fit(X) # Lambat .... dan menggunakan banyak memori O(N^2 log(N))
C_h = hierarchical.labels_.astype(np.int)
C_h[:10]
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
# Dendogram Example
# http://seaborn.pydata.org/generated/seaborn.clustermap.html
g = sns.clustermap(X, method="single", metric="euclidean")
# Scatter Plot of the hierarchical clustering results
X2D = umap.UMAP(n_neighbors=5, min_dist=0.3, random_state=random_state).fit_transform(X)
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_h)
plt.show()
Hierarchical Clustering Applications¶
image Source: https://www.sciencedirect.com/science/article/pii/S1532046416000307¶
Evaluasi Hierarchical Clustering?¶
- Silhoutte Coefficient, Dunn index, or Davies–Bouldin index
- Domain knowledge - interpretability
- External Evaluation
Read more here: https://www.ims.uni-stuttgart.de/document/team/schulte/theses/phd/algorithm.pdf¶
Spectral Clustering¶
- Rubah tabular data menjadi similarity S = S(xi,xj)
- Gunakan threshold untuk merubah S menjadi Graph tidak berarah yang terhubung/connected
- Karena baris=kolom dan simetris ==> positive definite
- Karena positive definite ==> eigenvalue real
- Eigenvalue dari matrix ini == Spectral Teory : https://en.wikipedia.org/wiki/Spectral_graph_theory
- Maknanya adalah seperti "centrality analysis" di Social media Analytic, yaitu:
- membentuk cluster graph sedemikian sehingga di suatu cluster tertentu weight (konektivitas)-nya paling besar.
referensi: Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On spectral clustering: Analysis and an algorithm. In Advances in neural information processing systems (pp. 849-856).
Karenanya Spectral bisa mengelompokkan data yang tidak "spherical" (bulat seperti k-means)
# Spectral : http://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html
spectral = cluster.SpectralClustering(n_clusters=3)
spectral.fit(X)
C_spec = spectral.labels_.astype(np.int)
sns.countplot(C_spec)
C_spec[:10]
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_spec)
plt.show()
DBSCAN¶
- Karena algoritma (cara kerjanya) ini maka DBSCAN sering digunakan untuk (multivariate) outlier detection.
# DBSCAN http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
# tidak membutuhkan input parameter k!!!... sangat bermanfaat untuk clustering data yang besar
dbscan = cluster.DBSCAN(eps=0.8, min_samples=5, metric='euclidean')
dbscan.fit(X)
C_db = dbscan.labels_.astype(np.int)
sns.countplot(C_db)
C_db[:10]
# apa makna cluster label -1?
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
sum([1 for i in C_db if i==-1])
2
fig, ax = plt.subplots()
ax.scatter(X2D[:,0], X2D[:,1], c=C_db)
plt.show()
Evaluation?¶
- Application-based ==> Outlier Detection
- internal validation indice called DBCV by Moulavi et al. Paper is available here: https://epubs.siam.org/doi/pdf/10.1137/1.9781611973440.96
- Python package: https://github.com/christopherjenness/DBCV
try:
# Should work in Google Colab
!wget https://raw.githubusercontent.com/christopherjenness/DBCV/master/DBCV/DBCV.py
except:
pass # Download manually on windows
import dbcv
'wget' is not recognized as an internal or external command, operable program or batch file.
dbcv.DBCV(X, C_db)
-0.09914623310743173
Computational Complexity Challenge of Clustering
Pelajari Studi Kasus Berikut (Customer Segmentation):¶
http://www.data-mania.com/blog/customer-profiling-and-segmentation-in-python/¶
End of Module¶
Referensi
- Everitt, B. S., Landau, S., & Leese, M. (1993). Cluster analysis. 1993. Edward Arnold and Halsted Press,.
- Arthur, D., & Vassilvitskii, S. (2006). k-means++: The advantages of careful seeding. Stanford.
- Sculley, D. (2010, April). Web-scale k-means clustering. In Proceedings of the 19th international conference on World wide web (pp. 1177-1178).
- Jain, A.K., Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 2010. 31(8): p. 651-666.
- Pang-Ning, T., M. Steinbach, and V. Kumar, Introduction to data mining. Vol. 74. 2006, Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc.
- Kleinberg, J. M. (2003). An impossibility theorem for clustering. In Advances in neural information processing systems (pp. 463-470).
Tidak ada komentar:
Posting Komentar
Relevant & Respectful Comments Only.