Prasyarat GLM-01
- Artikel ini: https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html
- ADSP-01 : https://taudata.blogspot.com/2022/04/adsp-01.html (Memahami dasar Python dengan baik)
- Kalkulus Dasar
- Statistika Dasar
Video Lesson GLM-01
Referensi GLM-01:
- A. Seber and A.J. Lee, Linear Regression Analysis (2nd Ed), 2003, John Wiley & Sons.
- P. McCullagh and J.A. Nelder, Generalized Linear Models (2nd Ed), 1989, Chapman & Hall.
Data Mining: Review Model Regresi
Pendahuluan Model Regresi ¶
- Digunakan saat variabel tak bebas (Dependent variable - Y) bertipe numerik (float/real) dan variabel bebasnya bisa numerik dan-atau kategorik
Berawal dari Pusat data dan Variansi ¶
$\bar{x}=\sum_{i=1}^{N}{x_i}$ dan $s^2=\frac{\sum_{i=1}^{N}{(x_i-\bar{x})}}{N-1}$
- Perhatikan makna rumus/formula variansi, lalu bandingkan dengan formula "covariansi" berikut:
Variance ke Covariance: Menghitung Hubungan Linear antara 2 variabel ¶
- How? Bagaimana cara kerjanya? (Statistical Thinking)
- Konsepnya: "Co-Vary" sama-sama bervariansi menjauh dari rata-rata.
- Gunakan "reverse" thinking untuk memahaminya.
- Penggunaan: cov(x,y) = 2 VS cov(x,y) = -2 VS Cov(x,y) = 0
- Covariance = 3000? Apa artinya?
Covariance ke korelasi: Statistical Thinking ¶
- Korelasi sebenarnya adalah Covariance dibagi dengan masing-masing standar deviasinya.
- Apa maksud/maknanya?
- Covariance punya makna geometric .... ia adalah Cosine!...
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Geometric_interpretation
Nilai koefisien korelasi (Linear) "Pearson" ¶
- Nilai dari koefisien korelasi Pearson adalah dari -1 hingga +1
Hati-hati¶
- Koefisien korelasi = 0 bukan berarti tidak ada hubungan antara kedua variable. Yang benar adalah: tidak ada hubungan LINIER, tapi bisa jadi ada hubungan dalam bentuk lain; misal: kuadratik, atau fungsi lain selain linier, seperti pada contoh di atas.
Penilaian Kualitatif terhadap nilai korelasi seperti ini? ... Really? Why? Why not? ¶
- image Source: https://spencermath.weebly.com/home/interpreting-the-correlation-coefficient
- Cases (social, medicine, etc)
- Objective, prediction vs insights
Contoh kasus sederhana ¶
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt, numpy as np
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import statsmodels.api as sm, scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
from sklearn.datasets import load_boston
from sklearn.preprocessing import MinMaxScaler
plt.style.use('bmh'); sns.set()
"Done"
'Done'
data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138, 142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)
| usia | tekanan_darah | |
|---|---|---|
| 0 | 40 | 126 |
| 1 | 45 | 124 |
| 2 | 50 | 135 |
| 3 | 53 | 138 |
| 4 | 60 | 142 |
| 5 | 65 | 139 |
| 6 | 69 | 140 |
| 7 | 71 | 151 |
# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()
Covariance = 76.953125 Correlations = [[1. 0.88746015] [0.88746015 1. ]]
# Better
print(df.corr())
sns.heatmap(df.corr(),cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,annot=True, annot_kws={"size": 16}, square=True)
p = sns.pairplot(df)
usia tekanan_darah usia 1.00000 0.88746 tekanan_darah 0.88746 1.00000
Interpretasi ¶
Nilai ~0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.
WARNING
Korelasi tidak sama (meng-implikasikan) dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia, tapi faktor lain yang tidak teramati pada data.
Contoh lain penelitian di Machine learning (kecantikan dan confidence/Panjang Jari dan IQ)
WARNING¶
Korelasi tidak sama dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia, tapi faktor lain yang tidak teramati pada data.
Sampai sini kita memahami kalau keduanya berhubungan, tapi seperti apa hubungannya kita masih belum bisa ketahui (lewat korelasi). Itulah Mengapa kita perlu Model Regresi.
Regresi Linier Sederhana ¶
Korelasi ke Regresi ¶
Bagaimana menghitung parameter Regresi yang Optimal? ¶
- Kenapa rumusnya seperti ini?
- Pentingnya memahami "Loss Function"
Evaluasi Error (Mean Squared Error) ¶
- Hati-hati,... perhatikan rumusnya dengan baik .... ia tidak robust terhadap outlier
- $\hat{y}=β_0+β_1x_1+...+β_nx_p$
- MSE = total jarak/selisih antara prediksi dan nilai dari data (sesungguhnya)
- RMSE = $\sqrt{MSE}$ ... why?
- Evaluasi penting ketika kita ingin melakukan prediksi
# Fitting model Regresi Sederhana
lm = smf.ols("tekanan_darah ~ usia", data=df[['tekanan_darah','usia']]).fit()
lm.summary()
# 1. F-Stat.
#.2. Uji Koef model
#.3. R^2
#.4. Interpretasi Model
#.5. Durbin-Watson ==> Time Series?
| Dep. Variable: | tekanan_darah | R-squared: | 0.788 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.752 |
| Method: | Least Squares | F-statistic: | 22.25 |
| Date: | Fri, 27 Aug 2021 | Prob (F-statistic): | 0.00327 |
| Time: | 07:07:05 | Log-Likelihood: | -21.920 |
| No. Observations: | 8 | AIC: | 47.84 |
| Df Residuals: | 6 | BIC: | 48.00 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 98.5623 | 8.266 | 11.924 | 0.000 | 78.337 | 118.788 |
| usia | 0.6766 | 0.143 | 4.717 | 0.003 | 0.326 | 1.028 |
| Omnibus: | 3.192 | Durbin-Watson: | 2.005 |
|---|---|---|---|
| Prob(Omnibus): | 0.203 | Jarque-Bera (JB): | 1.016 |
| Skew: | -0.340 | Prob(JB): | 0.602 |
| Kurtosis: | 1.392 | Cond. No. | 311. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Plot the Data
p = sns.regplot(df.usia, df.tekanan_darah)
Evaluasi $R^2$: Model VS Tidak Pakai Model? ¶
- $SSR=SST-SSE=\sum{(y_i-\bar{y}_i)^2}-\sum{(y_i-\hat{y}_i)^2}$
Adjusted R-Squared? Why? ¶
Pengaruh Variabel Tak Bebas ke Model ¶
Pahami Asumsi-Asumsi di Regresi dengan Baik ¶
https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html¶
Regresi Non-Linier ¶
- Why?
- Kapan tidak disarankan menambah kompleksitas model?
- Regression for insights VS regression for prediction.
- Masih linear terhadap parameter
# Loading Data Sampel dari Modul
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head(), df.shape, set(df['Region'])
( Lottery Literacy Wealth Region
0 41 37 73 E
1 38 51 22 N
2 66 13 61 C
3 80 46 76 E
4 79 69 83 E,
(85, 4),
{'C', 'E', 'N', 'S', 'W'})
# Set "Region" sebagai variabel Kategorik
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)
print(res.summary())
Intercept 38.651655
C(Region)[T.E] -15.427785
C(Region)[T.N] -10.016961
C(Region)[T.S] -4.548257
C(Region)[T.W] -10.091276
Literacy -0.185819
Wealth 0.451475
dtype: float64
OLS Regression Results
==============================================================================
Dep. Variable: Lottery R-squared: 0.338
Model: OLS Adj. R-squared: 0.287
Method: Least Squares F-statistic: 6.636
Date: Fri, 27 Aug 2021 Prob (F-statistic): 1.07e-05
Time: 07:07:05 Log-Likelihood: -375.30
No. Observations: 85 AIC: 764.6
Df Residuals: 78 BIC: 781.7
Df Model: 6
Covariance Type: nonrobust
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 38.6517 9.456 4.087 0.000 19.826 57.478
C(Region)[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938
C(Region)[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419
C(Region)[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943
C(Region)[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235
Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232
Wealth 0.4515 0.103 4.390 0.000 0.247 0.656
==============================================================================
Omnibus: 3.049 Durbin-Watson: 1.785
Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694
Skew: -0.340 Prob(JB): 0.260
Kurtosis: 2.454 Cond. No. 371.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Non Linear transformation
res = smf.ols(formula='Lottery ~ np.log(Literacy) + Wealth -1', data=df).fit()
print(res.summary())
OLS Regression Results
=======================================================================================
Dep. Variable: Lottery R-squared (uncentered): 0.799
Model: OLS Adj. R-squared (uncentered): 0.794
Method: Least Squares F-statistic: 165.2
Date: Fri, 27 Aug 2021 Prob (F-statistic): 1.16e-29
Time: 07:07:05 Log-Likelihood: -384.16
No. Observations: 85 AIC: 772.3
Df Residuals: 83 BIC: 777.2
Df Model: 2
Covariance Type: nonrobust
====================================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------------
np.log(Literacy) 4.6426 1.246 3.727 0.000 2.165 7.120
Wealth 0.5853 0.089 6.571 0.000 0.408 0.762
==============================================================================
Omnibus: 4.188 Durbin-Watson: 1.892
Prob(Omnibus): 0.123 Jarque-Bera (JB): 4.034
Skew: -0.480 Prob(JB): 0.133
Kurtosis: 2.533 Cond. No. 25.8
==============================================================================
Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Studi Kasus (Boston House Pricing) - Another Property Case Study ¶
# Loading Data
boston = load_boston()
# Convert ke Pandas Dataframe
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
bos.head()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
# Deskripsi Data
print(boston.DESCR)
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
bos.describe(include='all')
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
| mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
| std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
| min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
| 25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
| 50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
| 75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
| max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
p = sns.pairplot(bos)
Checking Correlations between Predictors ¶
# HeatMap untuk menyelidiki korelasi
corr2 = bos.corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 8}, square=True);
m = ols('PRICE ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m.summary())
# Jangan lupa analisa dan interpretasi hasilnya
OLS Regression Results
==============================================================================
Dep. Variable: PRICE R-squared: 0.679
Model: OLS Adj. R-squared: 0.677
Method: Least Squares F-statistic: 353.3
Date: Fri, 27 Aug 2021 Prob (F-statistic): 2.69e-123
Time: 07:08:10 Log-Likelihood: -1553.0
No. Observations: 506 AIC: 3114.
Df Residuals: 502 BIC: 3131.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 18.5671 3.913 4.745 0.000 10.879 26.255
RM 4.5154 0.426 10.603 0.000 3.679 5.352
PTRATIO -0.9307 0.118 -7.911 0.000 -1.162 -0.700
LSTAT -0.5718 0.042 -13.540 0.000 -0.655 -0.489
==============================================================================
Omnibus: 202.072 Durbin-Watson: 0.901
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1022.153
Skew: 1.700 Prob(JB): 1.10e-222
Kurtosis: 9.076 Cond. No. 402.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
m2 = ols('np.log(PRICE) ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m2.summary())
OLS Regression Results
==============================================================================
Dep. Variable: np.log(PRICE) R-squared: 0.714
Model: OLS Adj. R-squared: 0.713
Method: Least Squares F-statistic: 418.4
Date: Fri, 27 Aug 2021 Prob (F-statistic): 3.96e-136
Time: 07:08:10 Log-Likelihood: 52.201
No. Observations: 506 AIC: -96.40
Df Residuals: 502 BIC: -79.50
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 3.5469 0.164 21.632 0.000 3.225 3.869
RM 0.1044 0.018 5.849 0.000 0.069 0.139
PTRATIO -0.0391 0.005 -7.927 0.000 -0.049 -0.029
LSTAT -0.0353 0.002 -19.974 0.000 -0.039 -0.032
==============================================================================
Omnibus: 44.245 Durbin-Watson: 0.916
Prob(Omnibus): 0.000 Jarque-Bera (JB): 179.110
Skew: 0.246 Prob(JB): 1.28e-39
Kurtosis: 5.873 Cond. No. 402.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#define figure size
fig = plt.figure(figsize=(12,8))
#produce regression plots
fig = sm.graphics.plot_regress_exog(m,'RM', fig=fig)
plt.rc("figure", figsize=(16,12))
plt.rc("font", size=14)
fig = sm.graphics.plot_fit(m, "RM")
fig.tight_layout(pad=1.0)
model_fitted_y = m.fittedvalues # model residuals
model_residuals = m.resid # normalized residuals
model_norm_residuals = m.get_influence().resid_studentized_internal # absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = m.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = m.get_influence().cooks_distance[0]
plot_lm_1 = plt.figure()
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, bos["PRICE"], data=bos,
lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals');
Variable Selection: Stepwise di Analisis Regresi ¶
| |
|
def forward_selected(data, response):
"""Linear model designed by forward selection.
https://planspace.org/20150423-forward_selection_with_statsmodels/
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response
response: string, name of response column in data
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response, ' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
model = forward_selected(bos, 'PRICE')
print(model.model.formula)
print(model.rsquared_adj)
PRICE ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + CRIM + RAD + TAX + 1 0.7348057723274566
# Interpretasi koefisien?
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: PRICE R-squared: 0.741
Model: OLS Adj. R-squared: 0.735
Method: Least Squares F-statistic: 128.2
Date: Fri, 27 Aug 2021 Prob (F-statistic): 5.54e-137
Time: 07:08:15 Log-Likelihood: -1498.9
No. Observations: 506 AIC: 3022.
Df Residuals: 494 BIC: 3072.
Df Model: 11
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 36.3411 5.067 7.171 0.000 26.385 46.298
LSTAT -0.5226 0.047 -11.019 0.000 -0.616 -0.429
RM 3.8016 0.406 9.356 0.000 3.003 4.600
PTRATIO -0.9465 0.129 -7.334 0.000 -1.200 -0.693
DIS -1.4927 0.186 -8.037 0.000 -1.858 -1.128
NOX -17.3760 3.535 -4.915 0.000 -24.322 -10.430
CHAS 2.7187 0.854 3.183 0.002 1.040 4.397
B 0.0093 0.003 3.475 0.001 0.004 0.015
ZN 0.0458 0.014 3.390 0.001 0.019 0.072
CRIM -0.1084 0.033 -3.307 0.001 -0.173 -0.044
RAD 0.2996 0.063 4.726 0.000 0.175 0.424
TAX -0.0118 0.003 -3.493 0.001 -0.018 -0.005
==============================================================================
Omnibus: 178.430 Durbin-Watson: 1.078
Prob(Omnibus): 0.000 Jarque-Bera (JB): 787.785
Skew: 1.523 Prob(JB): 8.60e-172
Kurtosis: 8.300 Cond. No. 1.47e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.47e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Bandingkan Durbin-Watson in[15] dengan Durbin-Watson di [29] Comment on Jarque-Bera
Data Scaling "for Insights" ¶
| |
|
|
- Pentingnya "scaling" di Regresi (atau clustering) untuk mencari insight dari data
- image source: https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e
scaler = MinMaxScaler()
bos[['TAX', 'AGE', 'B']] = scaler.fit_transform(bos[['TAX', 'AGE', 'B']])
bos.head()
# Continue to Modelling
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 0.641607 | 4.0900 | 1.0 | 0.208015 | 15.3 | 1.000000 | 4.98 | 24.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 0.782698 | 4.9671 | 2.0 | 0.104962 | 17.8 | 1.000000 | 9.14 | 21.6 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 0.599382 | 4.9671 | 2.0 | 0.104962 | 17.8 | 0.989737 | 4.03 | 34.7 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 0.441813 | 6.0622 | 3.0 | 0.066794 | 18.7 | 0.994276 | 2.94 | 33.4 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 0.528321 | 6.0622 | 3.0 | 0.066794 | 18.7 | 1.000000 | 5.33 | 36.2 |
Pitfalls: Regresi Interpolation "bukan" Extrapolation (Forecasting/Peramalan) ¶
Belum dibahas:
- Logistic Regression [akan dibahas saat Topik Klasifikasi]
- Piecewise Regression (Non Linear)
- Probit/Tobit Regression (Probabilistic)
- Bayesian Regressian
- Logic Regression (lebih robust dari logistic regression utk Fraud Detection)
- Quantile regression (extreme events)
- LAD regression (L1)
- Jackknife regression
- SVR
- ARIMA (Time Series)
- Ecologic Regression
Latihan Studi Kasus Investasi Biaya Iklan¶
# Contoh
# Load DataFile CSV
try:
df = pd.read_csv('data/iklan.csv') # run locally
except:
!wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/iklan.csv # "Google Colab"
df = pd.read_csv('iklan.csv')
df.head()
| No | Iklan | Laba | Tipe | |
|---|---|---|---|---|
| 0 | 1 | 10 | 9.17 | 1 |
| 1 | 2 | 1 | 1.32 | 0 |
| 2 | 3 | 12 | 8.54 | 1 |
| 3 | 4 | 12 | 7.68 | 1 |
| 4 | 5 | 5 | 7.15 | 1 |
p = sns.pairplot(df, hue="Tipe")
# Do Modelling Here ... Don't forget to interpret.

No comments:
Post a Comment
Relevant & Respectful Comments Only.