“There is no Royal Road to Geometry.” — Euclid
Distance and Similarity
Outline Similarity and Distance: ¶
- Pengertian/Pendahuluan Distance VS Similarity
- Distances + Similarity for Numerical Data
- Distance for Categorical Variables
- Distance/Similarity in Statistics
- Distance/Similarity for Text Data: Vector Space
- Distance in Non-Euclidean Space: Haversine
- Applications: Hierarchical Clustering & k-NN
Distance and Structures in Data¶
- Data semakin berkembang baik dari segi format maupun kuantitasnya.
- Salah satu kebutuhan dari data yang berjumlah besar ini adalah menemukan pola yang dapat digunakan untuk menghasilkan informasi/insight atau bahkan melakukan prediksi.
- Menentukan rumus jarak yang baik dapat menentukan performa model (prediksi).
- Contoh pencarian pola di Sttaistika adalah Principal Component Analysis (PCA) atau analisa pengelompokan (Clustering)
Pengertian Dasar¶
Intuitively, distance functions are mathematical functions that assign a numerical value (their distance) to each pair of objects in a given domain.
This numerical value represents an assessment of how similar they are: two very similar objects would be assigned a very low distance, and two very dissimilar objects would be assigned a larger distance.
Similarity functions are the complementary idea, and assign high similarity values to similar objects, and low values to dissimilar pairs of objects.
Sejarah/Taksonomi Jarak (Distance)¶
Contoh metric/Distance untuk Numerical data: Jarak Minkowsky¶
- When p=1 we have the Manhattan distance,
- when p=2 we have the Euclidean distance, and
- when p=∞ it converges to the Chebyshev distance
Excellent article: https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-used-in-data-science-3eb914d2681¶
Minkowsky untuk berbagai nilai "p"¶
Realworld Ilustration¶
# Aplikasi di Python
# Importing Modules untuk Notebook ini
import warnings; warnings.simplefilter('ignore')
import matplotlib.pyplot as plt, numpy as np
from scipy.spatial import distance as dist
"Done"
dist.minkowski([1, 0, 0], [0, 1, 0], 1)
dist.minkowski([1, 0, 0], [0, 1, 0], 2)
dist.minkowski([1, 0, 0], [0, 1, 0], 3)
Contoh Distance lain: Canberra Distance¶
It is a weighted version of manhattan distance used in Clustering, like Fuzzy Clustering, classification, computer security, and ham/spam detection systems. It is more robust to outliers in contrast to the previous metric.
# Python implemementation
dist.canberra([1, 0, 0], [0, 1, 0])
Contoh Cosine Similarity:¶
This metric is widely used in text mining, natural language processing, and information retrieval systems. For instance, it can be used to measure the similarity between two given documents. It can also be used to identify spam messages based on the length of the message.
Perhatikan cosine similarity pada dasarnya inner-product. Secara matematis berarti kita hanya memperhatikan persamaan di elemen vektornya.
dist.cosine([1, 1, 0], [0, 1, 0])
Hati-hati similarity dan Dissimilarity¶
dist.cosine([1, 1, 1], [1, 1, 1])
Jaccard Similarity¶
# Implementasi Python
dist.jaccard([1, 1, 0], [0, 1, 0])
Implementasi di Text Document¶
from sklearn.feature_extraction.text import CountVectorizer
a = 'halo selamat pagi petang atau senja'
b = 'hai hai selamat siang sore dan malam'
data = [a, b]
vectorizer = CountVectorizer(binary = False)
vsm = vectorizer.fit_transform(data)
vsm.toarray()
print(vsm.shape)
print(vsm[0].data)
print(vsm[0].indices)
vectorizer.vocabulary_
dist.cosine(vsm.toarray()[0,:], vsm.toarray()[1,:])
dist.jaccard(vsm.toarray()[0,:], vsm.toarray()[1,:])
Korelasi Pearson ¶
- Covariance punya makna geometric .... ia adalah Cosine!... https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Geometric_interpretation
- Korelasi merupakan normalisasi dari Covariance
import pandas as pd
data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138, 142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)
# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()
Spearman correlation¶
Like Pearson correlation, Spearman correlation is used whenever we are dealing with bivariate analysis. However, unlike Pearson correlation, Spearman correlation is used when both variables are rank-ordered.
It can be used for both categorical and numerical attributes.
The Spearman correlation is a nonparametric measure of the linear relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply a monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.
Chi-Square similarity read more here: https://towardsdatascience.com/statistics-in-python-using-chi-square-for-feature-selection-d44f467ca745¶
from scipy.stats import spearmanr, kendalltau, pearsonr
R = [106, 86, 100, 101, 99, 103, 97, 113, 112, 110]
P = ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
spearmanr(R, P)
Haversine Distance (Non-Euclidean)¶
from math import radians, cos, sin, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance in kilometers between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
return c * r
Menghitung Jarak dari FMIPA UI Depok ke FST UIN Ciputat¶
- Buka https://www.latlong.net/ atau https://www.gps-coordinates.net/
- Catat Latitude dan longitude lokasi
UI = [-6.379800, 106.807083]
UIN = [-6.306384108601761, 106.75270199775696]
jarak = haversine(UI[1], UI[0], UIN[1], UIN[0])
print("Jarak = {}".format(jarak))
Jarak (Distance) untuk Data Terstruktur¶
Contoh aplikasi penggunaan jarak di Statistik/Machine Learning: k-Nearest Neighbour ¶
- Classifier yang paling sederhana, namun dapat juga digunakan untuk regresi (dan bahkan clustering).
- Sering disebut sebagai Instance based Learner
- Tidak memiliki "persamaan", pendekatannya lebih ke algoritmik berdasarkan konsep jarak/similarity
- Mirip konsep DBSCAN </ul>
k-NN Neighbour Size & Weights
- Uniform: all points in each neighborhood are weighted equally.
- Distance: closer neighbors of a query point have a greater influence than the neighbors further away.
Kelebihan dan Kekurangan ¶
- Pros:
-
- Relatif cepat (efisien) untuk data yang tidak terlalu besar
- Sederhana, mudah untuk diimplementasikan
- Mudah untuk di modifikasi: Berbagai macam formula jarak/similaritas
- Menangani data Multiclass dengan mudah
- Akurasi cukup baik jika data representatif
- Cons:
-
- Menemukan nearest neighbours tidak efisien untuk data besar
- Storage of data
- Meyakinkan rumus jarak yang tepat
Aplikasi di Python¶
Kasus Sederhana Klasifikasi 01: Klasifikasi Spesies Bunga Iris ¶
- Data klasifikasi bunga Iris sebagai studi kasus sederhana
- Link data: https://archive.ics.uci.edu/ml/datasets/iris
- Paper sumber data: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
- Masalah klasifikasinya adalah mengklasifikasikan jenis Bunga Iris berdasarkan bentuk (e.g. panjang dan lebar) bunga.
# Load data Bunga Iris
import seaborn as sns
data = sns.load_dataset("iris")
print(data.shape)
data.sample(5)
data['species'] = data['species'].astype('category')
print(data.info())
# Kita membuat dataframe baru, hati-hati jika datanya besar.
X = data[['sepal_length','sepal_width','petal_length','petal_width']]
Y = data['species']
X.shape, Y.shape
from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size=0.3, random_state=0)
xTrain.shape, xTest.shape
# k-NN: http://scikit-learn.org/stable/modules/neighbors.html
from sklearn import neighbors
n_neighbors = 3
weights = 'distance'
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNN.fit(xTrain, yTrain)
print('Done!')
# Prediksi dengan k-NN
y_kNN1 = kNN.predict(xTest)
y_kNN1[-10:]
from sklearn.metrics import confusion_matrix, classification_report
print("Kasus - Bunga Iris: kNN")
print(confusion_matrix(yTest, y_kNN1))
print(classification_report(yTest, y_kNN1))
Tidak ada komentar:
Posting Komentar
Relevant & Respectful Comments Only.