Exploratory Data Analysis (EDA) ~ Anomaly & Outlier ¶
Outline: ¶
- Outlier dan Anomaly
- Outlier Detection: Univariate & Multivariate
- Outlier di Regresi
- Anomaly Detection
Apa itu Outliers (pencilan)? ¶
Outlier vs Noise ¶
Mild dan Extreme Outlier? ¶
- Mild Outlier: An observation which lies within a range greater than 1.5 times the inter-quartile range (IQR) and less than 3 times the IQR, beyond the 3rd quartile (or 75th percentile) or below the 1st quartile (or 25th percentile) is termed as a mild outlier.
In general, in a set of values such as {2, 6,11, 13, 14, 15, 16 ,17}, outliers are defined as: Mild Outlier if value is less than 1st Quartile - 1.5 times Interquartile Range < Q1 - 1.5(IQR) OR value is greater than 3rd Quartile + 1.5 times Interquartile Range "> Q3 + 1.5(IQR)
- Extreme Outlier: An observation which lies within a range greater than 3 times the inter-quartile range (IQR), beyond the 3rd quartile (or 75th percentile) or below the 1st quartile (or 25th percentile) is termed as an extreme outlier.
Extreme Outlier if value is less than 1st Quartile - 3 times Interquartile Range < Q1 - 3(IQR) OR value is greater than 3rd Quartile + 3 times Interquartile Range "> Q3 + 3(IQR)
- Analogy jika menggunakan selang kepercayaan.
Penanganan Outlier ¶
- Extrem outlier membuat visualisasi "tidak bekerja", namun non-extrem outlier biasanya bagian dari informasi/insight.
- Saat pemodelan ada model yang robust terhadap outlier, ada yang tidak. Contoh sederhana k-Means VS k-Medoid atau model decision tree vs Regresi OLS.
Trimming/Exclude (Analisa Terpisah)¶
- Hati-hati melakukan ini jika outlier justru "data of interest" atau minimal outliernya adalah informasi yang berharga.
Imputasi¶
- Lakukan dengan hati-hati. In general tidak disarankan. Hanya lakukan jika ada dugaan kuat terjadi kesalahan pencatatan data.
Ignore¶
- Selama model/visualisasi masih efektif, membiarkan outlier membuat analisa lebih natural.
Transformasi¶
- Biasa dilakukan di visualisasi atau model Deep Learning, ... Dilakukan ketika fokus kita adalah hubungan antar variable dan bukan nilai eksaknya.
Outlier VS Anomali ¶
- Outlier adalah nilai "ekstrim" di data kita.
- Secara statistika: jika data diasumsikan berasal dari suatu distribusi outlier adalah data-data yang memiliki probability kecil.
- Contoh: Kekayaan segelintir orang di Indonesia (misal ...), atau IQ William James Sidis, dsb
- Nilainya Valid (bukan noise), penyebab munculnya outlier biasanya natural/wajar.
- Untuk cek outlier bukanlah noise hanya dapat dilakukan lewat domain/business knowledge.
- Median lebih robust thd outlier dibandingkan rata-rata.
- Saat pemodelan (misal prediksi), seringnya outlier tidak diikutsertakan dalam training.
- Namun beberapa kasus secara specific ingin memodelkan outlier, misal Mean-Reversion with Jumps Models di stokastik.
- Anomali adalah nilai-nilai di data (biasanya jamak) yang memiliki pola/distribusi yang berbeda dengan nilai-nilai lainnya di data.
- Secara statistika seolah-olah datanya berasal dari distribusi yang berbeda.
- Dalam kasus time series bisa jadi adanya "concept drift"
- Dalam aplikasinya bisa jadi pertanda adanya "fraud", "cyber attack", aktivitas teroris, dsb.
- https://www.senseon.io/blog/cyber-threats-evading-signatures-outlier-anomaly-or-both
- https://www.slideshare.net/ShantanuDeosthale/outlier-analysis-and-anomaly-detection
- Diskusi yang menarik mengapa kita sebaiknya tidak menyamakan outlier dan anomaly: https://datascience.stackexchange.com/questions/24760/what-is-the-difference-between-outlier-detection-and-anomaly-detection/36828#36828
Diskusi: Contoh Anomali yang bukan outlier & Contoh Outlier yang bukan Anomali? ¶
Novelty di Data ¶
- Deteksi Anomaly pada real-time data processing (streaming data) biasa disebut sebagai "Novelty Detection".
- outlier detection: The training data contains outliers which are defined as observations that are far from the others. Outlier detection estimators thus try to fit the regions where the training data is the most concentrated, ignoring the deviant observations.
- novelty detection: The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier. In this context an outlier is also called a novelty.
- Di beberapa literature outlier dianggap sebagai subset/contoh dari Anomali.
- https://scikit-learn.org/stable/modules/outlier_detection.html
Studi Kasus ¶
- Sumber Data: http://byebuyhome.com/
- Objective: menemukan harga rumah yang berada di bawah pasaran untuk melakukan investasi.
- Variable:
- Dist_Taxi – distance to nearest taxi stand from the property
- Dist_Market – distance to nearest grocery market from the property
- Dist_Hospital – distance to nearest hospital from the property
- Carpet – carpet area of the property in square feet
- Builtup – built-up area of the property in square feet
- Parking – type of car parking available with the property
- City_Category – categorization of the city based on the size
- Rainfall – annual rainfall in the area where property is located
- House_Price – price at which the property was sold
In [1]:
# Importing Some Python Modules
import warnings; warnings.simplefilter('ignore')
import pandas as pd
file_ = 'data/price.csv'
try: # Running Locally, yakinkan "file_" berada di folder "data"
price = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
except: # Running in Google Colab
!mkdir data
!wget -P data/ https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/price.csv
price = pd.read_csv(file_, error_bad_lines=False, low_memory = False, encoding='utf8')
N, P = price.shape # Ukuran Data
print('baris = ', N, ', Kolom (jumlah variabel) = ', P)
print("Tipe Variabe df = ", type(price))
price
baris = 936 , Kolom (jumlah variabel) = 10 Tipe Variabe df = <class 'pandas.core.frame.DataFrame'>
Out[1]:
Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
931 | 932 | 9297.0 | 12537.0 | 14418.0 | 1174.0 | 1429.0 | Covered | CAT C | 1110 | 5434000 |
932 | 933 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
933 | 934 | 9205.0 | 10418.0 | 14496.0 | 1118.0 | 1337.0 | Open | CAT A | 560 | 7227000 |
934 | 935 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
935 | 936 | 10915.0 | 17486.0 | 15964.0 | 1549.0 | 1851.0 | Not Provided | CAT C | 1220 | 7062000 |
936 rows × 10 columns
image source: http://writer.lk/portfolio-item/statistics/¶
In [2]:
price.describe(include='all')
Out[2]:
Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
---|---|---|---|---|---|---|---|---|---|---|
count | 936.000000 | 923.000000 | 923.000000 | 935.000000 | 928.000000 | 921.000000 | 936 | 936 | 936.000000 | 9.360000e+02 |
unique | NaN | NaN | NaN | NaN | NaN | NaN | 4 | 3 | NaN | NaN |
top | NaN | NaN | NaN | NaN | NaN | NaN | Open | CAT B | NaN | NaN |
freq | NaN | NaN | NaN | NaN | NaN | NaN | 373 | 365 | NaN | NaN |
mean | 468.500000 | 8239.512459 | 11039.122427 | 13082.894118 | 1511.558190 | 1794.610206 | NaN | NaN | 786.730769 | 6.089048e+06 |
std | 270.344225 | 2561.188953 | 2565.058074 | 2586.507654 | 789.370074 | 467.395372 | NaN | NaN | 266.218109 | 5.015046e+06 |
min | 1.000000 | 146.000000 | 1666.000000 | 3227.000000 | 775.000000 | 932.000000 | NaN | NaN | -110.000000 | 3.000000e+04 |
25% | 234.750000 | 6481.500000 | 9366.000000 | 11308.000000 | 1318.000000 | 1583.000000 | NaN | NaN | 600.000000 | 4.661000e+06 |
50% | 468.500000 | 8233.000000 | 11166.000000 | 13179.000000 | 1481.000000 | 1775.000000 | NaN | NaN | 780.000000 | 5.879500e+06 |
75% | 702.250000 | 9967.000000 | 12688.500000 | 14848.000000 | 1653.500000 | 1982.000000 | NaN | NaN | 970.000000 | 7.187250e+06 |
max | 936.000000 | 20662.000000 | 20945.000000 | 23294.000000 | 24300.000000 | 12730.000000 | NaN | NaN | 1560.000000 | 1.500000e+08 |
Beberapa Catatan Statistika Deskriptif ¶
- Modus tidak selalu ada
- Kapan saat yang lebih tepat menggunakan Mean atau Median (outlier-wise)
- Min/max dapat digunakan untuk mendeteksi Noise/Outlier
- Perbedaan noise dan outlier hanya dapat dilakukan lewat domain/business knowledge.
- Banyak literatur yang menyatakan outlier sebagai noise (outlier adalah subset/contoh noise).
- Outlier/noise harus "ditangani" saat preprocessing.
Distribusi nilai pada setiap variabel kategorik ¶
Di module setelah ini kita akan menelaah lebih jauh lewat visualisasi¶
- Pada tahap ini tujuan melihat distribusi variabel kategorik adalah bagian dari preprocessing/data cleaning, yaitu memeriksa apakah ada noise di variabel kategorik (biasanya typo).
- Jika variabel kategorik-nya adalah variabel target dan terjadi perbedaan proporsi yang mencolok maka tahap ini juga bermanfaat untuk mempersiapkan pemodelan imbalance learning pada tahap selanjutnya.
- Dapat dilakukan via fungsi "value_counts" di Pandas atau Fungsi "Counter" di module Collections.
In [3]:
price['Parking'].value_counts()
Out[3]:
Open 373 Not Provided 230 Covered 188 No Parking 145 Name: Parking, dtype: int64
In [4]:
from collections import Counter
# Again: struktur data penting. Module Counter memberikan output dictionary yang biasanya lebih useful
Counter(price['Parking'])
Out[4]:
Counter({'Open': 373, 'Not Provided': 230, 'Covered': 188, 'No Parking': 145})
Asumsi kenormalan, Selang Kepercayaan, & Outlier ¶
- Misal Selang Kepercayaan 95% = $\bar{x}-2\sigma\leq X \leq \bar{x}+2\sigma$ diluar selang ini dianggap sebagai outlier.
- Misal Selang Kepercayaan 99% = $\bar{x}-3\sigma\leq X \leq \bar{x}+3\sigma$ diluar selang ini dianggap sebagai outlier.
- Pakai yang mana di dunia nyata?
In [5]:
# Distributions, kita mulai dengan import module untuk visualisasi
import matplotlib.pyplot as plt, seaborn as sns
sns.set(style="ticks", color_codes=True)
random_state = 99
plt.style.use('bmh'); sns.set() #style visualisasi
p = sns.distplot(price['House_Price'], kde=True, rug=True)
#Dari plot nampak adanya outlier dengan cukup jelas.
In [6]:
# Misal dengan asumsi data berdistribusi normal & menggunakan 95% confidence interval di sekitar variabel "harga"
normal_data = abs(price.House_Price - price.House_Price.mean())<=(2*price.House_Price.std()) # mu-2s<x<mu+2s
print(normal_data.shape, type(normal_data), set(normal_data))
Counter(normal_data)
(936,) <class 'pandas.core.series.Series'> {False, True}
Out[6]:
Counter({True: 935, False: 1})
In [7]:
price2 = price[normal_data] # Data tanpa outlier harga
print(price2.shape, price.shape)
# Perhatikan disini sengaja data yang telah di remove outliernya
# disimpan dalam variabel baru "Price2"
# Jika datanya besar hati-hati melakukan hal ini
(935, 10) (936, 10)
In [8]:
# Distributions
p = sns.distplot(price2['House_Price'], kde=True, rug=True)
Boxplot & Outlier ¶
- Tidak ada asumsi distribusi (normal)
- Lower Extreme kurang dari: $Q_1 - 1.5(Q_3-Q_1)$ Upper Extreme lebih dari: $Q_3 + 1.5(Q_3-Q_1)$
In [9]:
# Jika ada outlier grafiknya menjadi tidak jelas (data = price, bukan price2)
# Insight yang di dapat akan salah atau bahkan tidak mendapat insight sama sekali
p = sns.boxplot(x="House_Price", data=price)
In [10]:
Q1 = price['House_Price'].quantile(0.25)
Q2 = price['House_Price'].quantile(0.50)
Q3 = price['House_Price'].quantile(0.75)
IQR = Q3 - Q1 #IQR is interquartile range.
print("Q1={}, Q3={}, IQR={}".format(Q1, Q3, IQR))
#outliers_bawah = (price['House_Price'] < (Q1 - 1.5 *IQR)) # Outlier bawah
#outliers_atas = (price['House_Price'] > (Q3 + 1.5 *IQR)) # Outlier atas
#rumah_murah = price.loc[outliers_bawah]
#rumah_kemahalan = price.loc[outliers_atas]
no_outlier = (price['House_Price'] >= Q1 - 1.5 * IQR) & (price['House_Price'] <= Q3 + 1.5 *IQR)
price3 = price[no_outlier]
print(price3.shape)
price3.head()
Q1=4661000.0, Q3=7187250.0, IQR=2526250.0 (933, 10)
Out[10]:
Observation | Dist_Taxi | Dist_Market | Dist_Hospital | Carpet | Builtup | Parking | City_Category | Rainfall | House_Price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 9796.0 | 5250.0 | 10703.0 | 1659.0 | 1961.0 | Open | CAT B | 530 | 6649000 |
1 | 2 | 8294.0 | 8186.0 | 12694.0 | 1461.0 | 1752.0 | Not Provided | CAT B | 210 | 3982000 |
2 | 3 | 11001.0 | 14399.0 | 16991.0 | 1340.0 | 1609.0 | Not Provided | CAT A | 720 | 5401000 |
3 | 4 | 8301.0 | 11188.0 | 12289.0 | 1451.0 | 1748.0 | Covered | CAT B | 620 | 5373000 |
4 | 5 | 10510.0 | 12629.0 | 13921.0 | 1770.0 | 2111.0 | Not Provided | CAT B | 450 | 4662000 |
In [11]:
p = sns.boxplot(x="House_Price", data=price3)
Diskusi: Bilamana menggunakan CI dan bilamana menggunakan BoxPlot? ¶
DBSCAN: Multivariate Anomaly Detection¶
- Karena algoritma (cara kerjanya) ini maka DBSCAN sering digunakan untuk (multivariate) Anomaly detection.
In [12]:
df = sns.load_dataset("iris")
X = df[['sepal_length','sepal_width','petal_length','petal_width']].values
C = df['species'].values
print(X.shape)
df.head()
(150, 4)
Out[12]:
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
In [13]:
# DBSCAN http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
# tidak membutuhkan input parameter k!!!... sangat bermanfaat untuk clustering data yang besar
from sklearn import cluster
import numpy as np
from collections import Counter
dbscan = cluster.DBSCAN(eps=0.75, min_samples=5, metric='euclidean')
dbscan.fit(X)
C_db = dbscan.labels_
fig, ax = plt.subplots(figsize=(6,4))
p = sns.countplot(x=C_db.astype(np.object), ax=ax, dodge=False)
print(C_db[:10])
# apa makna cluster label -1?
plt.show()
[0 0 0 0 0 0 0 0 0 0]
In [14]:
Counter(C_db)
Out[14]:
Counter({0: 50, 1: 98, -1: 2})
In [15]:
df['outlier'] = C_db
g = sns.pairplot(df, hue="outlier")
Contoh Pengaruh outlier di Regresi: Metode Difference in Fits ¶
In [16]:
import seaborn as sns
anscombe = sns.load_dataset("anscombe")
lm = sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
ci=None, scatter_kws={"s": 80});
fig = lm.fig
fig.suptitle("With Outlier", fontsize=12)
#In case of such ouliers, we can use parameter robust=True
#This removes the outlier from the estimate function and plots the other data points better.
lm2 = sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"),
robust=True, ci=None, scatter_kws={"s": 80});
fig2 = lm2.fig
fig2.suptitle('With Outlier Excluded', fontsize=12)
Out[16]:
Text(0.5, 0.98, 'With Outlier Excluded')
Diskusikan dengan baik contoh sederhana diatas ¶
Module PyOD: Python Outlier Detection ¶
- pip install pyod
- Linear Models for Outlier Detection:
- PCA: Principal Component Analysis use the sum of weighted projected distances to the eigenvector hyperplane as the outlier outlier scores)
- MCD: Minimum Covariance Determinant (use the mahalanobis distances as the outlier scores)
- OCSVM: One-Class Support Vector Machines
- Proximity-Based Outlier Detection Models:
- LOF: Local Outlier Factor
- CBLOF: Clustering-Based Local Outlier Factor
- kNN: k Nearest Neighbors (use the distance to the kth nearest neighbor as the outlier score)
- Median kNN Outlier Detection (use the median distance to k nearest neighbors as the outlier score)
- HBOS: Histogram-based Outlier Score
- Probabilistic Models for Outlier Detection:
- ABOD: Angle-Based Outlier Detection
- Outlier Ensembles and Combination Frameworks
- Isolation Forest
- Feature Bagging
- LSCP
In [17]:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.font_manager
from pyod.models.abod import ABOD
from pyod.models.knn import KNN
from pyod.models.iforest import IForest
from pyod.utils.data import generate_data, get_outliers_inliers
random_state = np.random.RandomState(50)
#generate random data with two features
X_train, Y_train = generate_data(n_train=200,train_only=True, n_features=2)
# by default the outlier fraction is 0.1 in generate data function
contamination = 0.1
outlier_fraction = 0.05
# store outliers and inliers in different numpy arrays
x_outliers, x_inliers = get_outliers_inliers(X_train,Y_train)
n_inliers = len(x_inliers)
n_outliers = len(x_outliers)
#separate the two features and use it to plot the data
F1 = X_train[:,[0]].reshape(-1,1)
F2 = X_train[:,[1]].reshape(-1,1)
# create a meshgrid
xx , yy = np.meshgrid(np.linspace(-10, 10, 200), np.linspace(-10, 10, 200))
# scatter plot
plt.scatter(F1,F2)
plt.xlabel('F1')
plt.ylabel('F2')
plt.title('Outliers')
classifiers = {
'Angle-based Outlier Detector (ABOD)' : ABOD(contamination=contamination),
# 'Isolation Forest': IForest(contamination=contamination),
'K Nearest Neighbors (KNN)' : KNN(contamination=contamination)
}
#set the figure size
plt.figure(figsize=(10, 10))
for i, (clf_name,clf) in enumerate(classifiers.items()) :
# fit the dataset to the model
clf.fit(X_train)
# predict raw anomaly score
scores_pred = clf.decision_function(X_train)*-1
# prediction of a datapoint category outlier or inlier
y_pred = clf.predict(X_train)
# no of errors in prediction
n_errors = (y_pred != Y_train).sum()
print('No of Errors : ',clf_name, n_errors)
# rest of the code is to create the visualization
# threshold value to consider a datapoint inlier or outlier
threshold = stats.scoreatpercentile(scores_pred,100 *outlier_fraction)
# decision function calculates the raw anomaly score for every point
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()]) * -1
Z = Z.reshape(xx.shape)
subplot = plt.subplot(1, 2, i + 1)
# fill blue colormap from minimum anomaly score to threshold value
subplot.contourf(xx, yy, Z, levels = np.linspace(Z.min(), threshold, 10),cmap=plt.cm.Blues_r)
# draw red contour line where anomaly score is equal to threshold
a = subplot.contour(xx, yy, Z, levels=[threshold],linewidths=2, colors='red')
# fill orange contour lines where range of anomaly score is from threshold to maximum anomaly score
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()],colors='orange')
# scatter plot of inliers with white dots
b = subplot.scatter(X_train[:-n_outliers, 0], X_train[:-n_outliers, 1], c='white',s=20, edgecolor='k')
# scatter plot of outliers with black dots
c = subplot.scatter(X_train[-n_outliers:, 0], X_train[-n_outliers:, 1], c='black',s=20, edgecolor='k')
subplot.axis('tight')
subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'],
prop=matplotlib.font_manager.FontProperties(size=10),
loc='lower right')
subplot.set_title(clf_name)
subplot.set_xlim((-10, 10))
subplot.set_ylim((-10, 10))
plt.show()
No of Errors : Angle-based Outlier Detector (ABOD) 20 No of Errors : K Nearest Neighbors (KNN) 35
Ringkasan Novelty Detection ¶
- OneClassSVM dikenal sensitif terhadap pencilan dan karenanya tidak berkinerja baik untuk deteksi pencilan. Estimator ini paling cocok untuk deteksi kebaruan ketika set pelatihan tidak terkontaminasi oleh pencilan.
- sklearn.covariance.EllipticEnvelope mengasumsikan data berdistribusi Gaussian dan mempelajari sebuah elips. Oleh karena itu, kinerjanya menurun ketika data tidak unimodal. Namun, perlu diperhatikan bahwa estimator ini tahan terhadap pencilan.
- IsolationForest dan LocalOutlierFactor tampaknya berkinerja cukup baik untuk kumpulan data multi-modal. Keunggulan LocalOutlierFactor dibandingkan dengan estimator lainnya terlihat pada kumpulan data ketiga, di mana dua mode memiliki kerapatan yang berbeda. Keunggulan ini dijelaskan oleh aspek lokal LOF, yang berarti bahwa LOF hanya membandingkan skor ketidaknormalan satu sampel dengan skor tetangganya.
- dikutip dari: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html#sphx-glr-auto-examples-miscellaneous-plot-anomaly-comparison-py
- Sumber https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html#comparing-anomaly-detection-algorithms-for-outlier-detection-on-toy-datasets
In [18]:
# Outlier Detection menggunakan Scikit Learn: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html#sphx-glr-auto-examples-miscellaneous-plot-anomaly-comparison-py
import time, matplotlib, matplotlib.pyplot as plt, numpy as np
from sklearn import svm
from sklearn.covariance import EllipticEnvelope
from sklearn.datasets import make_blobs, make_moons
from sklearn.ensemble import IsolationForest
from sklearn.kernel_approximation import Nystroem
from sklearn.linear_model import SGDOneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from sklearn.pipeline import make_pipeline
matplotlib.rcParams["contour.negative_linestyle"] = "solid"
# Example settings
n_samples = 300
outliers_fraction = 0.15
n_outliers = int(outliers_fraction * n_samples)
n_inliers = n_samples - n_outliers
anomaly_algorithms = [
( "Robust covariance",
EllipticEnvelope(contamination=outliers_fraction, random_state=42),),
("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.1)),
( "One-Class SVM (SGD)",
make_pipeline(
Nystroem(gamma=0.1, random_state=42, n_components=150),
SGDOneClassSVM(
nu=outliers_fraction,
shuffle=True,
fit_intercept=True,
random_state=42,
tol=1e-6,),),),
( "Isolation Forest",
IsolationForest(contamination=outliers_fraction, random_state=42),),
( "Local Outlier Factor",
LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction),),]
# Define datasets
blobs_params = dict(random_state=0, n_samples=n_inliers, n_features=2)
datasets = [
make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5, **blobs_params)[0],
make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[0.5, 0.5], **blobs_params)[0],
make_blobs(centers=[[2, 2], [-2, -2]], cluster_std=[1.5, 0.3], **blobs_params)[0],
4.0 * ( make_moons(n_samples=n_samples, noise=0.05, random_state=0)[0]
- np.array([0.5, 0.25])), 14.0 * (np.random.RandomState(42).rand(n_samples, 2) - 0.5),]
# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 150), np.linspace(-7, 7, 150))
plt.figure(figsize=(len(anomaly_algorithms) * 2 + 4, 12.5))
plt.subplots_adjust(left=0.02, right=0.98, bottom=0.001, top=0.96, wspace=0.05, hspace=0.01)
plot_num = 1
rng = np.random.RandomState(42)
for i_dataset, X in enumerate(datasets):
# Add outliers
X = np.concatenate([X, rng.uniform(low=-6, high=6, size=(n_outliers, 2))], axis=0)
for name, algorithm in anomaly_algorithms:
t0 = time.time()
algorithm.fit(X)
t1 = time.time()
plt.subplot(len(datasets), len(anomaly_algorithms), plot_num)
if i_dataset == 0:
plt.title(name, size=18)
if name == "Local Outlier Factor":
y_pred = algorithm.fit_predict(X)
else:
y_pred = algorithm.fit(X).predict(X)
# plot the levels lines and the points
if name != "Local Outlier Factor": # LOF does not implement predict
Z = algorithm.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contour(xx, yy, Z, levels=[0], linewidths=2, colors="black")
colors = np.array(["#377eb8", "#ff7f00"])
plt.scatter(X[:, 0], X[:, 1], s=10, color=colors[(y_pred + 1) // 2])
plt.xlim(-7, 7)
plt.ylim(-7, 7)
plt.xticks(())
plt.yticks(())
plt.text(0.99, 0.01,
("%.2fs" % (t1 - t0)).lstrip("0"),
transform=plt.gca().transAxes,
size=15,
horizontalalignment="right", )
plot_num += 1
plt.show()
No comments:
Post a Comment
Relevant & Respectful Comments Only.