Distance and Similarity

Outline Similarity and Distance: ¶

Pengertian/Pendahuluan Distance VS Similarity
Distances + Similarity for Numerical Data
Distance for Categorical Variables
Distance/Similarity in Statistics
Distance/Similarity for Text Data: Vector Space
Distance in Non-Euclidean Space: Haversine
Applications: Hierarchical Clustering & k-NN

Distance and Structures in Data¶

Data semakin berkembang baik dari segi format maupun kuantitasnya.
Salah satu kebutuhan dari data yang berjumlah besar ini adalah menemukan pola yang dapat digunakan untuk menghasilkan informasi/insight atau bahkan melakukan prediksi.
Menentukan rumus jarak yang baik dapat menentukan performa model (prediksi).
Contoh pencarian pola di Sttaistika adalah Principal Component Analysis (PCA) atau analisa pengelompokan (Clustering)

Pengertian Dasar¶

Intuitively, distance functions are mathematical functions that assign a numerical value (their distance) to each pair of objects in a given domain.
This numerical value represents an assessment of how similar they are: two very similar objects would be assigned a very low distance, and two very dissimilar objects would be assigned a larger distance.
Similarity functions are the complementary idea, and assign high similarity values to similar objects, and low values to dissimilar pairs of objects.

Sejarah/Taksonomi Jarak (Distance)¶

Source: https://franknielsen.github.io/Divergence/index.html

Definition 01 Distance/Metric:¶

Apa arti/konsekuensinya?¶

Contoh metric/Distance untuk Numerical data: Jarak Minkowsky¶

When p=1 we have the Manhattan distance,
when p=2 we have the Euclidean distance, and
when p=∞ it converges to the Chebyshev distance

Excellent article: https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-used-in-data-science-3eb914d2681 ¶

Minkowsky untuk berbagai nilai "p"¶

Realworld Ilustration¶

# Aplikasi di Python
# Importing Modules untuk Notebook ini
import warnings; warnings.simplefilter('ignore')
import matplotlib.pyplot as plt, numpy as np
from scipy.spatial import distance as dist

"Done"

'Done'

dist.minkowski([1, 0, 0], [0, 1, 0], 1)

2.0

dist.minkowski([1, 0, 0], [0, 1, 0], 2)

1.4142135623730951

dist.minkowski([1, 0, 0], [0, 1, 0], 3)

1.2599210498948732

Contoh Distance lain: Canberra Distance¶

It is a weighted version of manhattan distance used in Clustering, like Fuzzy Clustering, classification, computer security, and ham/spam detection systems. It is more robust to outliers in contrast to the previous metric.

# Python implemementation

dist.canberra([1, 0, 0], [0, 1, 0])

2.0

Definition 02 Similarity:¶

Hati-hati ... Similarity != Distance¶

a similarity function is the complementary concept to a distance function
Bayangkan similarity sebagai "interval" dan distance sebagai "ratio"

Contoh Cosine Similarity:¶

This metric is widely used in text mining, natural language processing, and information retrieval systems. For instance, it can be used to measure the similarity between two given documents. It can also be used to identify spam messages based on the length of the message.
Perhatikan cosine similarity pada dasarnya inner-product. Secara matematis berarti kita hanya memperhatikan persamaan di elemen vektornya.

dist.cosine([1, 1, 0], [0, 1, 0])

0

Hati-hati similarity dan Dissimilarity¶

dist.cosine([1, 1, 1], [1, 1, 1])

0

Jaccard Similarity¶

# Implementasi Python
dist.jaccard([1, 1, 0], [0, 1, 0])

0.5

Implementasi di Text Document¶

from sklearn.feature_extraction.text import CountVectorizer

a = 'halo selamat pagi petang atau senja'
b = 'hai hai selamat siang sore dan malam'

data = [a, b]
vectorizer = CountVectorizer(binary = False)
vsm = vectorizer.fit_transform(data)
vsm.toarray()

array([[1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0],
       [0, 1, 2, 0, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)

print(vsm.shape)
print(vsm[0].data) 
print(vsm[0].indices) 
vectorizer.vocabulary_

(2, 5)
[1 1 1]
[1 3 2]

{'halo': 1, 'selamat': 3, 'pagi': 2, 'hai': 0, 'siang': 4}

dist.cosine(vsm.toarray()[0,:], vsm.toarray()[1,:])

0.8639172365120457

dist.jaccard(vsm.toarray()[0,:], vsm.toarray()[1,:])

0.9090909090909091

Korelasi Pearson¶

Covariance punya makna geometric .... ia adalah Cosine!... https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Geometric_interpretation
Korelasi merupakan normalisasi dari Covariance

import pandas as pd

data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138, 142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)

# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()

Covariance =  76.953125
Correlations =  [[1.         0.88746015]
 [0.88746015 1.        ]]

Spearman correlation¶

Like Pearson correlation, Spearman correlation is used whenever we are dealing with bivariate analysis. However, unlike Pearson correlation, Spearman correlation is used when both variables are rank-ordered.
It can be used for both categorical and numerical attributes.
The Spearman correlation is a nonparametric measure of the linear relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply a monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

Chi-Square similarity read more here: https://towardsdatascience.com/statistics-in-python-using-chi-square-for-feature-selection-d44f467ca745 ¶

from scipy.stats import spearmanr, kendalltau, pearsonr

R = [106, 86, 100, 101, 99, 103, 97, 113, 112, 110]

P = ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']

spearmanr(R, P)

SpearmanrResult(correlation=-0.17407765595569785, pvalue=0.6305360755569764)

Haversine Distance (Non-Euclidean)¶

from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance in kilometers between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
    return c * r

Menghitung Jarak dari FMIPA UI Depok ke FST UIN Ciputat¶

Buka https://www.latlong.net/ atau https://www.gps-coordinates.net/
Catat Latitude dan longitude lokasi

UI = [-6.379800, 106.807083]
UIN = [-6.306384108601761, 106.75270199775696]
jarak = haversine(UI[1], UI[0], UIN[1], UIN[0])

print("Jarak = {}".format(jarak))

Jarak = 10.137104668054363

Jarak (Distance) untuk Data Terstruktur¶

Source: https://link.springer.com/article/10.1007/s10462-020-09821-w

Contoh aplikasi penggunaan jarak di Statistik/Machine Learning: k-Nearest Neighbour¶

Classifier yang paling sederhana, namun dapat juga digunakan untuk regresi (dan bahkan clustering).

Sering disebut sebagai Instance based Learner

Tidak memiliki "persamaan", pendekatannya lebih ke algoritmik berdasarkan konsep jarak/similarity

Mirip konsep DBSCAN

k-NN Neighbour Size & Weights

Uniform: all points in each neighborhood are weighted equally.
Distance: closer neighbors of a query point have a greater influence than the neighbors further away.

Kelebihan dan Kekurangan¶

Pros:

Relatif cepat (efisien) untuk data yang tidak terlalu besar
Sederhana, mudah untuk diimplementasikan
Mudah untuk di modifikasi: Berbagai macam formula jarak/similaritas
Menangani data Multiclass dengan mudah
Akurasi cukup baik jika data representatif

Cons:

Menemukan nearest neighbours tidak efisien untuk data besar
Storage of data
Meyakinkan rumus jarak yang tepat

Aplikasi di Python¶

Kasus Sederhana Klasifikasi 01: Klasifikasi Spesies Bunga Iris¶

Data klasifikasi bunga Iris sebagai studi kasus sederhana
Link data: https://archive.ics.uci.edu/ml/datasets/iris
Paper sumber data: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
Masalah klasifikasinya adalah mengklasifikasikan jenis Bunga Iris berdasarkan bentuk (e.g. panjang dan lebar) bunga.

# Load data Bunga Iris
import seaborn as sns

data = sns.load_dataset("iris")
print(data.shape)
data.sample(5)

(150, 5)

data['species'] = data['species'].astype('category')
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   sepal_length  150 non-null    float64 
 1   sepal_width   150 non-null    float64 
 2   petal_length  150 non-null    float64 
 3   petal_width   150 non-null    float64 
 4   species       150 non-null    category
dtypes: category(1), float64(4)
memory usage: 5.1 KB
None

# Kita membuat dataframe baru, hati-hati jika datanya besar.
X = data[['sepal_length','sepal_width','petal_length','petal_width']]
Y = data['species']
X.shape, Y.shape

((150, 4), (150,))

from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size=0.3, random_state=0)

xTrain.shape, xTest.shape

((105, 4), (45, 4))

# k-NN: http://scikit-learn.org/stable/modules/neighbors.html
from sklearn import neighbors

n_neighbors = 3
weights = 'distance'
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNN.fit(xTrain, yTrain)
print('Done!')

Done!

# Prediksi dengan k-NN
y_kNN1 = kNN.predict(xTest)
y_kNN1[-10:]

array(['versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'virginica', 'setosa', 'virginica', 'setosa', 'setosa'],
      dtype=object)

from sklearn.metrics import confusion_matrix, classification_report


print("Kasus - Bunga Iris: kNN")
print(confusion_matrix(yTest, y_kNN1))
print(classification_report(yTest, y_kNN1))

Kasus - Bunga Iris: kNN
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        16
  versicolor       1.00      0.94      0.97        18
   virginica       0.92      1.00      0.96        11

    accuracy                           0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

More Reference:¶

https://link.springer.com/article/10.1007/s10462-020-09821-w#Fig2

Akhir Modul Similarity and Distance¶

	sepal_length	sepal_width	petal_length	petal_width	species
86	6.7	3.1	4.7	1.5	versicolor
64	5.6	2.9	3.6	1.3	versicolor
8	4.4	2.9	1.4	0.2	setosa
13	4.3	3.0	1.1	0.1	setosa
29	4.7	3.2	1.6	0.2	setosa

	usia	tekanan_darah
0	40	126
1	45	124
2	50	135
3	53	138
4	60	142
5	65	139
6	69	140
7	71	151

Distance & Similarity

Distance and Similarity

Outline Similarity and Distance: ¶

Distance and Structures in Data¶

Pengertian Dasar¶

Sejarah/Taksonomi Jarak (Distance)¶

Definition 01 Distance/Metric:¶

Apa arti/konsekuensinya?¶

Contoh metric/Distance untuk Numerical data: Jarak Minkowsky¶

Excellent article: https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-used-in-data-science-3eb914d2681¶

Minkowsky untuk berbagai nilai "p"¶

Realworld Ilustration¶

Contoh Distance lain: Canberra Distance¶

Definition 02 Similarity:¶

Hati-hati ... Similarity != Distance¶

Contoh Cosine Similarity:¶

Hati-hati similarity dan Dissimilarity¶

Jaccard Similarity¶

Implementasi di Text Document¶

Korelasi Pearson¶

Spearman correlation¶

Chi-Square similarity read more here: https://towardsdatascience.com/statistics-in-python-using-chi-square-for-feature-selection-d44f467ca745¶

Haversine Distance (Non-Euclidean)¶

Menghitung Jarak dari FMIPA UI Depok ke FST UIN Ciputat¶

Read more: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa¶

Jarak (Distance) untuk Data Terstruktur¶

Contoh aplikasi penggunaan jarak di Statistik/Machine Learning: k-Nearest Neighbour¶

k-NN Neighbour Size & Weights

Kelebihan dan Kekurangan¶

Aplikasi di Python¶

Kasus Sederhana Klasifikasi 01: Klasifikasi Spesies Bunga Iris¶

More Reference:¶

Akhir Modul Similarity and Distance¶

No comments:

Post a Comment

SEARCH

LATEST

FOLLOW ME

Visitors

Translate~Terjemahkan

Pages

Follow Us

Popular

Archive

Postingan Populer

Latest courses

Comments

About

Excellent article: https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-used-in-data-science-3eb914d2681 ¶

Chi-Square similarity read more here: https://towardsdatascience.com/statistics-in-python-using-chi-square-for-feature-selection-d44f467ca745 ¶

Read more: https://towardsdatascience.com/9-distance-measures-in-data-science-918109d069fa ¶