EDA - Outlier and Anomaly

Exploratory Data Analysis (EDA) ~ Anomaly & Outlier¶

Outline:¶

Outlier dan Anomaly
Outlier Detection: Univariate & Multivariate
Outlier di Regresi
Anomaly Detection

Apa itu Outliers (pencilan)?¶

Outliers Sering Terjadi di Big Data¶

Di dunia nyata data jarang berdistribusi Normal¶

Outlier vs Noise ¶

Noisy (Big) Data¶

Noise dapat terjadi karena:¶

Kesalahan instrumen pengukuran: Misal di alat IoT pada saat cuaca buruk/baterai yang lemah.
Kesalahan input/entry
Transmisi yang tidak sempurna
inkonsistensi penamaan, dsb

Mild dan Extreme Outlier?¶

Mild Outlier: An observation which lies within a range greater than 1.5 times the inter-quartile range (IQR) and less than 3 times the IQR, beyond the 3rd quartile (or 75th percentile) or below the 1st quartile (or 25th percentile) is termed as a mild outlier.

In general, in a set of values such as {2, 6,11, 13, 14, 15, 16 ,17}, outliers are defined as: Mild Outlier if value is less than 1st Quartile - 1.5 times Interquartile Range < Q1 - 1.5(IQR) OR value is greater than 3rd Quartile + 1.5 times Interquartile Range "> Q3 + 1.5(IQR)

Extreme Outlier: An observation which lies within a range greater than 3 times the inter-quartile range (IQR), beyond the 3rd quartile (or 75th percentile) or below the 1st quartile (or 25th percentile) is termed as an extreme outlier.

Extreme Outlier if value is less than 1st Quartile - 3 times Interquartile Range < Q1 - 3(IQR) OR value is greater than 3rd Quartile + 3 times Interquartile Range "> Q3 + 3(IQR)

Analogy jika menggunakan selang kepercayaan.

Outlier(s) effect on model's performace¶

lalu apa yang sebaiknya dilakukan ke outliers?¶

https://medium.com/analytics-vidhya/effect-of-outliers-on-neural-networks-performance-ca1d9185dce9

Penanganan Outlier¶

Extrem outlier membuat visualisasi "tidak bekerja", namun non-extrem outlier biasanya bagian dari informasi/insight.
Saat pemodelan ada model yang robust terhadap outlier, ada yang tidak. Contoh sederhana k-Means VS k-Medoid atau model decision tree vs Regresi OLS.

Trimming/Exclude (Analisa Terpisah)¶

Hati-hati melakukan ini jika outlier justru "data of interest" atau minimal outliernya adalah informasi yang berharga.

Imputasi¶

Lakukan dengan hati-hati. In general tidak disarankan. Hanya lakukan jika ada dugaan kuat terjadi kesalahan pencatatan data.

Ignore¶

Selama model/visualisasi masih efektif, membiarkan outlier membuat analisa lebih natural.

Transformasi¶

Biasa dilakukan di visualisasi atau model Deep Learning, ... Dilakukan ketika fokus kita adalah hubungan antar variable dan bukan nilai eksaknya.

Outlier VS Anomali¶

Outlier adalah nilai "ekstrim" di data kita.
- Secara statistika: jika data diasumsikan berasal dari suatu distribusi outlier adalah data-data yang memiliki probability kecil.
- Contoh: Kekayaan segelintir orang di Indonesia (misal ...), atau IQ William James Sidis, dsb
- Nilainya Valid (bukan noise), penyebab munculnya outlier biasanya natural/wajar.
- Untuk cek outlier bukanlah noise hanya dapat dilakukan lewat domain/business knowledge.
- Median lebih robust thd outlier dibandingkan rata-rata.
- Saat pemodelan (misal prediksi), seringnya outlier tidak diikutsertakan dalam training.
- Namun beberapa kasus secara specific ingin memodelkan outlier, misal Mean-Reversion with Jumps Models di stokastik.
Anomali adalah nilai-nilai di data (biasanya jamak) yang memiliki pola/distribusi yang berbeda dengan nilai-nilai lainnya di data.
- Secara statistika seolah-olah datanya berasal dari distribusi yang berbeda.
- Dalam kasus time series bisa jadi adanya "concept drift"
- Dalam aplikasinya bisa jadi pertanda adanya "fraud", "cyber attack", aktivitas teroris, dsb.
- https://www.senseon.io/blog/cyber-threats-evading-signatures-outlier-anomaly-or-both
- https://www.slideshare.net/ShantanuDeosthale/outlier-analysis-and-anomaly-detection
- Diskusi yang menarik mengapa kita sebaiknya tidak menyamakan outlier dan anomaly: https://datascience.stackexchange.com/questions/24760/what-is-the-difference-between-outlier-detection-and-anomaly-detection/36828#36828

Diskusi: Contoh Anomali yang bukan outlier & Contoh Outlier yang bukan Anomali?¶

Sumber Gambar: https://cdn.analyticsvidhya.com/wp-content/uploads/2020/11/outlier.png

Berbagai Macam Algoritma untuk Mendeteksi Anomaly¶

Kita akan membahas anomaly detection lebih detail di Module yang lain.¶

Di Module ini hanya dibahas outlier detection untuk satu variable: Melalui Confident Interval dan BoxPlot (interquantile range).

* https://www.mdpi.com/2226-4310/6/11/117/htm

Novelty di Data¶

Deteksi Anomaly pada real-time data processing (streaming data) biasa disebut sebagai "Novelty Detection".
outlier detection: The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.
novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.
Di beberapa literature outlier dianggap sebagai subset/contoh dari Anomali.
https://scikit-learn.org/stable/modules/outlier_detection.html

Studi Kasus¶

Sumber Data: http://byebuyhome.com/
Objective: menemukan harga rumah yang berada di bawah pasaran untuk melakukan investasi.
Variable:

Dist_Taxi – distance to nearest taxi stand from the property
Dist_Market – distance to nearest grocery market from the property
Dist_Hospital – distance to nearest hospital from the property
Carpet – carpet area of the property in square feet
Builtup – built-up area of the property in square feet
Parking – type of car parking available with the property
City_Category – categorization of the city based on the size
Rainfall – annual rainfall in the area where property is located
House_Price – price at which the property was sold

In [1]:

# Importing Some Python Modules
import warnings; warnings.simplefilter('ignore')
import pandas as pd

file_ = 'data/price.csv'
try: # Running Locally, yakinkan "file_" berada di folder "data"
    price = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
except: # Running in Google Colab
    !mkdir data
    !wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/price.csv
    price = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
    
N, P = price.shape # Ukuran Data
print('baris = ', N, ', Kolom (jumlah variabel) = ', P)
print("Tipe Variabe df = ", type(price))
price

baris =  936 , Kolom (jumlah variabel) =  10
Tipe Variabe df =  <class 'pandas.core.frame.DataFrame'>

Out[1]:

	Observation	Dist_Taxi	Dist_Market	Dist_Hospital	Carpet	Builtup	Parking	City_Category	Rainfall	House_Price
0	1	9796.0	5250.0	10703.0	1659.0	1961.0	Open	CAT B	530	6649000
1	2	8294.0	8186.0	12694.0	1461.0	1752.0	Not Provided	CAT B	210	3982000
2	3	11001.0	14399.0	16991.0	1340.0	1609.0	Not Provided	CAT A	720	5401000
3	4	8301.0	11188.0	12289.0	1451.0	1748.0	Covered	CAT B	620	5373000
4	5	10510.0	12629.0	13921.0	1770.0	2111.0	Not Provided	CAT B	450	4662000
...	...	...	...	...	...	...	...	...	...	...
931	932	9297.0	12537.0	14418.0	1174.0	1429.0	Covered	CAT C	1110	5434000
932	933	10915.0	17486.0	15964.0	1549.0	1851.0	Not Provided	CAT C	1220	7062000
933	934	9205.0	10418.0	14496.0	1118.0	1337.0	Open	CAT A	560	7227000
934	935	10915.0	17486.0	15964.0	1549.0	1851.0	Not Provided	CAT C	1220	7062000
935	936	10915.0	17486.0	15964.0	1549.0	1851.0	Not Provided	CAT C	1220	7062000

936 rows × 10 columns

image source: http://writer.lk/portfolio-item/statistics/¶

In [2]:

price.describe(include='all')

Out[2]:

	Observation	Dist_Taxi	Dist_Market	Dist_Hospital	Carpet	Builtup	Parking	City_Category	Rainfall	House_Price
count	936.000000	923.000000	923.000000	935.000000	928.000000	921.000000	936	936	936.000000	9.360000e+02
unique	NaN	NaN	NaN	NaN	NaN	NaN	4	3	NaN	NaN
top	NaN	NaN	NaN	NaN	NaN	NaN	Open	CAT B	NaN	NaN
freq	NaN	NaN	NaN	NaN	NaN	NaN	373	365	NaN	NaN
mean	468.500000	8239.512459	11039.122427	13082.894118	1511.558190	1794.610206	NaN	NaN	786.730769	6.089048e+06
std	270.344225	2561.188953	2565.058074	2586.507654	789.370074	467.395372	NaN	NaN	266.218109	5.015046e+06
min	1.000000	146.000000	1666.000000	3227.000000	775.000000	932.000000	NaN	NaN	-110.000000	3.000000e+04
25%	234.750000	6481.500000	9366.000000	11308.000000	1318.000000	1583.000000	NaN	NaN	600.000000	4.661000e+06
50%	468.500000	8233.000000	11166.000000	13179.000000	1481.000000	1775.000000	NaN	NaN	780.000000	5.879500e+06
75%	702.250000	9967.000000	12688.500000	14848.000000	1653.500000	1982.000000	NaN	NaN	970.000000	7.187250e+06
max	936.000000	20662.000000	20945.000000	23294.000000	24300.000000	12730.000000	NaN	NaN	1560.000000	1.500000e+08

Beberapa Catatan Statistika Deskriptif¶

Modus tidak selalu ada
Kapan saat yang lebih tepat menggunakan Mean atau Median (outlier-wise)
Min/max dapat digunakan untuk mendeteksi Noise/Outlier
Perbedaan noise dan outlier hanya dapat dilakukan lewat domain/business knowledge.
Banyak literatur yang menyatakan outlier sebagai noise (outlier adalah subset/contoh noise).
Outlier/noise harus "ditangani" saat preprocessing.

Distribusi nilai pada setiap variabel kategorik¶

Di module setelah ini kita akan menelaah lebih jauh lewat visualisasi¶

Pada tahap ini tujuan melihat distribusi variabel kategorik adalah bagian dari preprocessing/data cleaning, yaitu memeriksa apakah ada noise di variabel kategorik (biasanya typo).
Jika variabel kategorik-nya adalah variabel target dan terjadi perbedaan proporsi yang mencolok maka tahap ini juga bermanfaat untuk mempersiapkan pemodelan imbalance learning pada tahap selanjutnya.
Dapat dilakukan via fungsi "value_counts" di Pandas atau Fungsi "Counter" di module Collections.

In [3]:

price['Parking'].value_counts()

Out[3]:

Open            373
Not Provided    230
Covered         188
No Parking      145
Name: Parking, dtype: int64

In [4]:

from collections import Counter

# Again: struktur data penting. Module Counter memberikan output dictionary yang biasanya lebih useful
Counter(price['Parking'])

Out[4]:

Counter({'Open': 373, 'Not Provided': 230, 'Covered': 188, 'No Parking': 145})

Asumsi kenormalan, Selang Kepercayaan, & Outlier¶

Misal Selang Kepercayaan 95% = $\bar{x}-2\sigma\leq X \leq \bar{x}+2\sigma$ diluar selang ini dianggap sebagai outlier.
Misal Selang Kepercayaan 99% = $\bar{x}-3\sigma\leq X \leq \bar{x}+3\sigma$ diluar selang ini dianggap sebagai outlier.
Pakai yang mana di dunia nyata?

In [5]:

# Distributions, kita mulai dengan import module untuk visualisasi
import matplotlib.pyplot as plt, seaborn as sns
sns.set(style="ticks", color_codes=True)
random_state = 99
plt.style.use('bmh'); sns.set() #style visualisasi

p = sns.distplot(price['House_Price'], kde=True, rug=True)
#Dari plot nampak adanya outlier dengan cukup jelas.

No description has been provided for this image

In [6]:

# Misal dengan asumsi data berdistribusi normal & menggunakan 95% confidence interval di sekitar variabel "harga"
normal_data = abs(price.House_Price - price.House_Price.mean())<=(2*price.House_Price.std()) # mu-2s<x<mu+2s
print(normal_data.shape, type(normal_data), set(normal_data))
Counter(normal_data)

(936,) <class 'pandas.core.series.Series'> {False, True}

Out[6]:

Counter({True: 935, False: 1})

In [7]:

price2 = price[normal_data] # Data tanpa outlier harga
print(price2.shape, price.shape)
# Perhatikan disini sengaja data yang telah di remove outliernya 
# disimpan dalam variabel baru "Price2"
# Jika datanya besar hati-hati melakukan hal ini

(935, 10) (936, 10)

In [8]:

# Distributions
p = sns.distplot(price2['House_Price'], kde=True, rug=True)

Boxplot & Outlier¶

Tidak ada asumsi distribusi (normal)
Lower Extreme kurang dari: $Q_1 - 1.5(Q_3-Q_1)$ Upper Extreme lebih dari: $Q_3 + 1.5(Q_3-Q_1)$

In [9]:

# Jika ada outlier grafiknya menjadi tidak jelas (data = price, bukan price2)
# Insight yang di dapat akan salah atau bahkan tidak mendapat insight sama sekali
p = sns.boxplot(x="House_Price", data=price)

In [10]:

Q1 = price['House_Price'].quantile(0.25)
Q2 = price['House_Price'].quantile(0.50)
Q3 = price['House_Price'].quantile(0.75)
IQR = Q3 - Q1 #IQR is interquartile range. 
print("Q1={}, Q3={}, IQR={}".format(Q1, Q3, IQR))

#outliers_bawah = (price['House_Price'] < (Q1 - 1.5 *IQR)) # Outlier bawah
#outliers_atas = (price['House_Price'] > (Q3 + 1.5 *IQR)) # Outlier atas
#rumah_murah = price.loc[outliers_bawah]
#rumah_kemahalan = price.loc[outliers_atas]

no_outlier = (price['House_Price'] >= Q1 - 1.5 * IQR) & (price['House_Price'] <= Q3 + 1.5 *IQR)
price3 = price[no_outlier]
print(price3.shape)
price3.head()

Q1=4661000.0, Q3=7187250.0, IQR=2526250.0
(933, 10)

Out[10]:

	Observation	Dist_Taxi	Dist_Market	Dist_Hospital	Carpet	Builtup	Parking	City_Category	Rainfall	House_Price
0	1	9796.0	5250.0	10703.0	1659.0	1961.0	Open	CAT B	530	6649000
1	2	8294.0	8186.0	12694.0	1461.0	1752.0	Not Provided	CAT B	210	3982000
2	3	11001.0	14399.0	16991.0	1340.0	1609.0	Not Provided	CAT A	720	5401000
3	4	8301.0	11188.0	12289.0	1451.0	1748.0	Covered	CAT B	620	5373000
4	5	10510.0	12629.0	13921.0	1770.0	2111.0	Not Provided	CAT B	450	4662000

In [11]:

p = sns.boxplot(x="House_Price", data=price3)

Diskusi: Bilamana menggunakan CI dan bilamana menggunakan BoxPlot?¶

DBSCAN: Multivariate Anomaly Detection¶

Karena algoritma (cara kerjanya) ini maka DBSCAN sering digunakan untuk (multivariate) Anomaly detection.

In [12]:

df = sns.load_dataset("iris")
X = df[['sepal_length','sepal_width','petal_length','petal_width']].values
C = df['species'].values
print(X.shape)
df.head()

(150, 4)

Out[12]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In [13]:

# DBSCAN http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
# tidak membutuhkan input parameter k!!!... sangat bermanfaat untuk clustering data yang besar
from sklearn import cluster
import numpy as np
from collections import Counter

dbscan = cluster.DBSCAN(eps=0.75, min_samples=5, metric='euclidean')
dbscan.fit(X)
C_db = dbscan.labels_
fig, ax = plt.subplots(figsize=(6,4))
p = sns.countplot(x=C_db.astype(np.object), ax=ax, dodge=False)
print(C_db[:10])
# apa makna cluster label -1?
plt.show()

[0 0 0 0 0 0 0 0 0 0]

In [14]:

Counter(C_db)

Out[14]:

Counter({0: 50, 1: 98, -1: 2})

In [15]:

df['outlier'] = C_db
g = sns.pairplot(df, hue="outlier")

Contoh Pengaruh outlier di Regresi: Metode Difference in Fits¶

Referensi: https://refactored.ai/microcourse/notebook?path=content%2F04-Math_and_Stats_for_Data_Science%2F07-Outlier_Analysis%2F01-Outlier-analysis.ipynb

In [16]:

import seaborn as sns
anscombe = sns.load_dataset("anscombe")

lm = sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
           ci=None,  scatter_kws={"s": 80});
fig = lm.fig 
fig.suptitle("With Outlier", fontsize=12)
#In case of such ouliers, we can use parameter robust=True
#This removes the outlier from the estimate function and plots the other data points better.
lm2 = sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
           robust=True, ci=None, scatter_kws={"s": 80});
fig2 = lm2.fig 
fig2.suptitle('With Outlier Excluded', fontsize=12)

Out[16]:

Text(0.5, 0.98, 'With Outlier Excluded')

Diskusikan dengan baik contoh sederhana diatas ¶

Module PyOD: Python Outlier Detection¶

pip install pyod

Linear Models for Outlier Detection:

PCA: Principal Component Analysis use the sum of weighted projected distances to the eigenvector hyperplane as the outlier outlier scores)
MCD: Minimum Covariance Determinant (use the mahalanobis distances as the outlier scores)
OCSVM: One-Class Support Vector Machines

Proximity-Based Outlier Detection Models:

LOF: Local Outlier Factor
CBLOF: Clustering-Based Local Outlier Factor
kNN: k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score)
Median kNN Outlier Detection (use the median distance to k nearest neighbors as the outlier score)
HBOS: Histogram-based Outlier Score

Probabilistic Models for Outlier Detection:

ABOD: Angle-Based Outlier Detection
Outlier Ensembles and Combination Frameworks

Isolation Forest

Feature Bagging
LSCP

In [17]:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.font_manager
from pyod.models.abod import ABOD
from pyod.models.knn import KNN
from pyod.models.iforest import IForest
from pyod.utils.data import generate_data, get_outliers_inliers
random_state = np.random.RandomState(50)

#generate random data with two features
X_train, Y_train = generate_data(n_train=200,train_only=True, n_features=2)

# by default the outlier fraction is 0.1 in generate data function 
contamination = 0.1
outlier_fraction = 0.05

# store outliers and inliers in different numpy arrays
x_outliers, x_inliers = get_outliers_inliers(X_train,Y_train)

n_inliers = len(x_inliers)
n_outliers = len(x_outliers)

#separate the two features and use it to plot the data 
F1 = X_train[:,[0]].reshape(-1,1)
F2 = X_train[:,[1]].reshape(-1,1)

# create a meshgrid 
xx , yy = np.meshgrid(np.linspace(-10, 10, 200), np.linspace(-10, 10, 200))

# scatter plot 
plt.scatter(F1,F2)
plt.xlabel('F1')
plt.ylabel('F2') 
plt.title('Outliers')

classifiers = {
     'Angle-based Outlier Detector (ABOD)'   : ABOD(contamination=contamination),
#     'Isolation Forest': IForest(contamination=contamination),
     'K Nearest Neighbors (KNN)' :  KNN(contamination=contamination)
}

#set the figure size
plt.figure(figsize=(10, 10))

for i, (clf_name,clf) in enumerate(classifiers.items()) :
    # fit the dataset to the model
    clf.fit(X_train)

    # predict raw anomaly score
    scores_pred = clf.decision_function(X_train)*-1

    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X_train)

    # no of errors in prediction
    n_errors = (y_pred != Y_train).sum()
    print('No of Errors : ',clf_name, n_errors)

    # rest of the code is to create the visualization

    # threshold value to consider a datapoint inlier or outlier
    threshold = stats.scoreatpercentile(scores_pred,100 *outlier_fraction)

    # decision function calculates the raw anomaly score for every point
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
    Z = Z.reshape(xx.shape)

    subplot = plt.subplot(1, 2, i + 1)

    # fill blue colormap from minimum anomaly score to threshold value
    subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(), threshold, 10),cmap=plt.cm.Blues_r)

    # draw red contour line where anomaly score is equal to threshold
    a = subplot.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')

    # fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score
    subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')

    # scatter plot of inliers with white dots
    b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c='white',s=20, edgecolor='k') 
    # scatter plot of outliers with black dots
    c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c='black',s=20, edgecolor='k')
    subplot.axis('tight')

    subplot.legend(
        [a.collections[0], b, c],
        ['learned decision function', 'true inliers', 'true outliers'],
        prop=matplotlib.font_manager.FontProperties(size=10),
        loc='lower right')

    subplot.set_title(clf_name)
    subplot.set_xlim((-10, 10))
    subplot.set_ylim((-10, 10))
plt.show()

No of Errors :  Angle-based Outlier Detector (ABOD) 20
No of Errors :  K Nearest Neighbors (KNN) 35

Ringkasan Novelty Detection¶

OneClassSVM dikenal sensitif terhadap pencilan dan karenanya tidak berkinerja baik untuk deteksi pencilan. Estimator ini paling cocok untuk deteksi kebaruan ketika set pelatihan tidak terkontaminasi oleh pencilan.
sklearn.covariance.EllipticEnvelope mengasumsikan data berdistribusi Gaussian dan mempelajari sebuah elips. Oleh karena itu, kinerjanya menurun ketika data tidak unimodal. Namun, perlu diperhatikan bahwa estimator ini tahan terhadap pencilan.
IsolationForest dan LocalOutlierFactor tampaknya berkinerja cukup baik untuk kumpulan data multi-modal. Keunggulan LocalOutlierFactor dibandingkan dengan estimator lainnya terlihat pada kumpulan data ketiga, di mana dua mode memiliki kerapatan yang berbeda. Keunggulan ini dijelaskan oleh aspek lokal LOF, yang berarti bahwa LOF hanya membandingkan skor ketidaknormalan satu sampel dengan skor tetangganya.
dikutip dari: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html#sphx-glr-auto-examples-miscellaneous-plot-anomaly-comparison-py
Sumber https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html#comparing-anomaly-detection-algorithms-for-outlier-detection-on-toy-datasets

In [18]:

# Outlier Detection menggunakan Scikit Learn: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html#sphx-glr-auto-examples-miscellaneous-plot-anomaly-comparison-py
import time, matplotlib, matplotlib.pyplot as plt, numpy as np
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs, make_moons
from sklearn.ensemble import IsolationForest
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import SGDOneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.pipeline import make_pipeline

matplotlib.rcParams["contour.negative_linestyle"] = "solid"
# Example settings
n_samples = 300
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers

anomaly_algorithms = [
    (   "Robust covariance",
        EllipticEnvelope(contamination=outliers_fraction, random_state=42),),
    ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.1)),
    (  "One-Class SVM (SGD)",
        make_pipeline(
            Nystroem(gamma=0.1, random_state=42, n_components=150),
            SGDOneClassSVM(
                nu=outliers_fraction,
                shuffle=True,
                fit_intercept=True,
                random_state=42,
                tol=1e-6,),),),
    (  "Isolation Forest",
        IsolationForest(contamination=outliers_fraction, random_state=42),),
    (  "Local Outlier Factor",
        LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction),),]
# Define datasets
blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
    make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5, **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5], **blobs_params)[0],
    make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, 0.3], **blobs_params)[0],
    4.0 * ( make_moons(n_samples=n_samples, noise=0.05, random_state=0)[0]
        - np.array([0.5, 0.25])), 14.0 * (np.random.RandomState(42).rand(n_samples, 2) - 0.5),]

# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 150), np.linspace(-7, 7, 150))
plt.figure(figsize=(len(anomaly_algorithms) * 2 + 4, 12.5))
plt.subplots_adjust(left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01)
plot_num = 1
rng = np.random.RandomState(42)
for i_dataset, X in enumerate(datasets):
    # Add outliers
    X = np.concatenate([X, rng.uniform(low=-6, high=6, size=(n_outliers, 2))], axis=0)
    for name, algorithm in anomaly_algorithms:
        t0 = time.time()
        algorithm.fit(X)
        t1 = time.time()
        plt.subplot(len(datasets), len(anomaly_algorithms), plot_num)
        if i_dataset == 0:
            plt.title(name, size=18)
        if name == "Local Outlier Factor":
            y_pred = algorithm.fit_predict(X)
        else:
            y_pred = algorithm.fit(X).predict(X)
        # plot the levels lines and the points
        if name != "Local Outlier Factor":  # LOF does not implement predict
            Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="black")
        colors = np.array(["#377eb8", "#ff7f00"])
        plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])
        plt.xlim(-7, 7)
        plt.ylim(-7, 7)
        plt.xticks(())
        plt.yticks(())
        plt.text(0.99, 0.01,
            ("%.2fs" % (t1 - t0)).lstrip("0"),
            transform=plt.gca().transAxes,
            size=15,
            horizontalalignment="right",        )
        plot_num += 1

plt.show()

Akhir Modul - Outlier dan Anomaly pada Data¶

Top Links Menu

EDA: Anomaly & Outlier