Distance & Similarity


 

“There is no Royal Road to Geometry.” — Euclid






KompStat-Distance and Similarities

Distance and Similarity

Outline Similarity and Distance:

  • Pengertian/Pendahuluan Distance VS Similarity
  • Distances + Similarity for Numerical Data
  • Distance for Categorical Variables
  • Distance/Similarity in Statistics
  • Distance/Similarity for Text Data: Vector Space
  • Distance in Non-Euclidean Space: Haversine
  • Applications: Hierarchical Clustering & k-NN

Distance and Structures in Data

  • Data semakin berkembang baik dari segi format maupun kuantitasnya.
  • Salah satu kebutuhan dari data yang berjumlah besar ini adalah menemukan pola yang dapat digunakan untuk menghasilkan informasi/insight atau bahkan melakukan prediksi.
  • Menentukan rumus jarak yang baik dapat menentukan performa model (prediksi).
  • Contoh pencarian pola di Sttaistika adalah Principal Component Analysis (PCA) atau analisa pengelompokan (Clustering)

Pengertian Dasar

  • Intuitively, distance functions are mathematical functions that assign a numerical value (their distance) to each pair of objects in a given domain.

  • This numerical value represents an assessment of how similar they are: two very similar objects would be assigned a very low distance, and two very dissimilar objects would be assigned a larger distance.

  • Similarity functions are the complementary idea, and assign high similarity values to similar objects, and low values to dissimilar pairs of objects.

Sejarah/Taksonomi Jarak (Distance)

Definition 01 Distance/Metric:

Apa arti/konsekuensinya?

Contoh metric/Distance untuk Numerical data: Jarak Minkowsky

  • When p=1 we have the Manhattan distance,
  • when p=2 we have the Euclidean distance, and
  • when p=∞ it converges to the Chebyshev distance

Excellent article: https://towardsdatascience.com/17-types-of-similarity-and-dissimilarity-measures-used-in-data-science-3eb914d2681

Minkowsky untuk berbagai nilai "p"

Realworld Ilustration

In [1]:
# Aplikasi di Python
# Importing Modules untuk Notebook ini
import warnings; warnings.simplefilter('ignore')
import matplotlib.pyplot as plt, numpy as np
from scipy.spatial import distance as dist

"Done"
Out[1]:
'Done'
In [2]:
dist.minkowski([1, 0, 0], [0, 1, 0], 1)
Out[2]:
2.0
In [3]:
dist.minkowski([1, 0, 0], [0, 1, 0], 2)
Out[3]:
1.4142135623730951
In [4]:
dist.minkowski([1, 0, 0], [0, 1, 0], 3)
Out[4]:
1.2599210498948732

Contoh Distance lain: Canberra Distance

It is a weighted version of manhattan distance used in Clustering, like Fuzzy Clustering, classification, computer security, and ham/spam detection systems. It is more robust to outliers in contrast to the previous metric.

In [5]:
# Python implemementation

dist.canberra([1, 0, 0], [0, 1, 0])
Out[5]:
2.0

Definition 02 Similarity:

Hati-hati ... Similarity != Distance

  • a similarity function is the complementary concept to a distance function
  • Bayangkan similarity sebagai "interval" dan distance sebagai "ratio"

Contoh Cosine Similarity:

  • This metric is widely used in text mining, natural language processing, and information retrieval systems. For instance, it can be used to measure the similarity between two given documents. It can also be used to identify spam messages based on the length of the message.

  • Perhatikan cosine similarity pada dasarnya inner-product. Secara matematis berarti kita hanya memperhatikan persamaan di elemen vektornya.

In [26]:
dist.cosine([1, 1, 0], [0, 1, 0])
Out[26]:
0

Hati-hati similarity dan Dissimilarity

In [27]:
dist.cosine([1, 1, 1], [1, 1, 1])
Out[27]:
0

Jaccard Similarity

In [8]:
# Implementasi Python
dist.jaccard([1, 1, 0], [0, 1, 0])
Out[8]:
0.5

Implementasi di Text Document

In [41]:
from sklearn.feature_extraction.text import CountVectorizer

a = 'halo selamat pagi petang atau senja'
b = 'hai hai selamat siang sore dan malam'

data = [a, b]
vectorizer = CountVectorizer(binary = False)
vsm = vectorizer.fit_transform(data)
vsm.toarray()
Out[41]:
array([[1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0],
       [0, 1, 2, 0, 1, 0, 0, 1, 0, 1, 1]], dtype=int64)
In [10]:
print(vsm.shape)
print(vsm[0].data) 
print(vsm[0].indices) 
vectorizer.vocabulary_
(2, 5)
[1 1 1]
[1 3 2]
Out[10]:
{'halo': 1, 'selamat': 3, 'pagi': 2, 'hai': 0, 'siang': 4}
In [44]:
dist.cosine(vsm.toarray()[0,:], vsm.toarray()[1,:])
Out[44]:
0.8639172365120457
In [43]:
dist.jaccard(vsm.toarray()[0,:], vsm.toarray()[1,:])
Out[43]:
0.9090909090909091

Korelasi Pearson

In [45]:
import pandas as pd

data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138, 142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)
Out[45]:
usia tekanan_darah
0 40 126
1 45 124
2 50 135
3 53 138
4 60 142
5 65 139
6 69 140
7 71 151
In [14]:
# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()
Covariance =  76.953125
Correlations =  [[1.         0.88746015]
 [0.88746015 1.        ]]

Spearman correlation

  • Like Pearson correlation, Spearman correlation is used whenever we are dealing with bivariate analysis. However, unlike Pearson correlation, Spearman correlation is used when both variables are rank-ordered.

  • It can be used for both categorical and numerical attributes.

  • The Spearman correlation is a nonparametric measure of the linear relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply a monotonic relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases.

In [15]:
from scipy.stats import spearmanr, kendalltau, pearsonr

R = [106, 86, 100, 101, 99, 103, 97, 113, 112, 110]

P = ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']

spearmanr(R, P)
Out[15]:
SpearmanrResult(correlation=-0.17407765595569785, pvalue=0.6305360755569764)

Haversine Distance (Non-Euclidean)

In [16]:
from math import radians, cos, sin, asin, sqrt

def haversine(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance in kilometers between two points 
    on the earth (specified in decimal degrees)
    """
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles. Determines return value units.
    return c * r

Menghitung Jarak dari FMIPA UI Depok ke FST UIN Ciputat

In [17]:
UI = [-6.379800, 106.807083]
UIN = [-6.306384108601761, 106.75270199775696]
jarak = haversine(UI[1], UI[0], UIN[1], UIN[0])

print("Jarak = {}".format(jarak))
Jarak = 10.137104668054363

Jarak (Distance) untuk Data Terstruktur

Contoh aplikasi penggunaan jarak di Statistik/Machine Learning: k-Nearest Neighbour

  • Classifier yang paling sederhana, namun dapat juga digunakan untuk regresi (dan bahkan clustering).
  • Sering disebut sebagai Instance based Learner
  • Tidak memiliki "persamaan", pendekatannya lebih ke algoritmik berdasarkan konsep jarak/similarity
  • Mirip konsep DBSCAN
  • </ul>

k-NN Neighbour Size & Weights

  • Uniform: all points in each neighborhood are weighted equally.
  • Distance: closer neighbors of a query point have a greater influence than the neighbors further away.

Kelebihan dan Kekurangan

Pros:
  • Relatif cepat (efisien) untuk data yang tidak terlalu besar
  • Sederhana, mudah untuk diimplementasikan
  • Mudah untuk di modifikasi: Berbagai macam formula jarak/similaritas
  • Menangani data Multiclass dengan mudah
  • Akurasi cukup baik jika data representatif
Cons:
  • Menemukan nearest neighbours tidak efisien untuk data besar
  • Storage of data
  • Meyakinkan rumus jarak yang tepat

Aplikasi di Python

Kasus Sederhana Klasifikasi 01: Klasifikasi Spesies Bunga Iris

  • Data klasifikasi bunga Iris sebagai studi kasus sederhana
  • Link data: https://archive.ics.uci.edu/ml/datasets/iris
  • Paper sumber data: Fisher,R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).
  • Masalah klasifikasinya adalah mengklasifikasikan jenis Bunga Iris berdasarkan bentuk (e.g. panjang dan lebar) bunga.

In [18]:
# Load data Bunga Iris
import seaborn as sns

data = sns.load_dataset("iris")
print(data.shape)
data.sample(5)
(150, 5)
Out[18]:
sepal_length sepal_width petal_length petal_width species
86 6.7 3.1 4.7 1.5 versicolor
64 5.6 2.9 3.6 1.3 versicolor
8 4.4 2.9 1.4 0.2 setosa
13 4.3 3.0 1.1 0.1 setosa
29 4.7 3.2 1.6 0.2 setosa
In [19]:
data['species'] = data['species'].astype('category')
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   sepal_length  150 non-null    float64 
 1   sepal_width   150 non-null    float64 
 2   petal_length  150 non-null    float64 
 3   petal_width   150 non-null    float64 
 4   species       150 non-null    category
dtypes: category(1), float64(4)
memory usage: 5.1 KB
None
In [20]:
# Kita membuat dataframe baru, hati-hati jika datanya besar.
X = data[['sepal_length','sepal_width','petal_length','petal_width']]
Y = data['species']
X.shape, Y.shape
Out[20]:
((150, 4), (150,))
In [21]:
from sklearn.model_selection import train_test_split

xTrain, xTest, yTrain, yTest = train_test_split(X, Y, test_size=0.3, random_state=0)

xTrain.shape, xTest.shape
Out[21]:
((105, 4), (45, 4))
In [22]:
# k-NN: http://scikit-learn.org/stable/modules/neighbors.html
from sklearn import neighbors

n_neighbors = 3
weights = 'distance'
kNN = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
kNN.fit(xTrain, yTrain)
print('Done!')
Done!
In [23]:
# Prediksi dengan k-NN
y_kNN1 = kNN.predict(xTest)
y_kNN1[-10:]
Out[23]:
array(['versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'virginica', 'setosa', 'virginica', 'setosa', 'setosa'],
      dtype=object)
In [24]:
from sklearn.metrics import confusion_matrix, classification_report


print("Kasus - Bunga Iris: kNN")
print(confusion_matrix(yTest, y_kNN1))
print(classification_report(yTest, y_kNN1))
Kasus - Bunga Iris: kNN
[[16  0  0]
 [ 0 17  1]
 [ 0  0 11]]
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        16
  versicolor       1.00      0.94      0.97        18
   virginica       0.92      1.00      0.96        11

    accuracy                           0.98        45
   macro avg       0.97      0.98      0.98        45
weighted avg       0.98      0.98      0.98        45

Akhir Modul Similarity and Distance


No comments:

Post a Comment

Relevant & Respectful Comments Only.