Prasyarat GLM-01
- Artikel ini: https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html
- ADSP-01 : https://taudata.blogspot.com/2022/04/adsp-01.html (Memahami dasar Python dengan baik)
- Kalkulus Dasar
- Statistika Dasar
Video Lesson GLM-01
Referensi GLM-01:
- A. Seber and A.J. Lee, Linear Regression Analysis (2nd Ed), 2003, John Wiley & Sons.
- P. McCullagh and J.A. Nelder, Generalized Linear Models (2nd Ed), 1989, Chapman & Hall.
Data Mining: Review Model Regresi
Pendahuluan Model Regresi ¶
- Digunakan saat variabel tak bebas (Dependent variable - Y) bertipe numerik (float/real) dan variabel bebasnya bisa numerik dan-atau kategorik
Berawal dari Pusat data dan Variansi ¶
$\bar{x}=\sum_{i=1}^{N}{x_i}$ dan $s^2=\frac{\sum_{i=1}^{N}{(x_i-\bar{x})}}{N-1}$
- Perhatikan makna rumus/formula variansi, lalu bandingkan dengan formula "covariansi" berikut:
Variance ke Covariance: Menghitung Hubungan Linear antara 2 variabel ¶
- How? Bagaimana cara kerjanya? (Statistical Thinking)
- Konsepnya: "Co-Vary" sama-sama bervariansi menjauh dari rata-rata.
- Gunakan "reverse" thinking untuk memahaminya.
- Penggunaan: cov(x,y) = 2 VS cov(x,y) = -2 VS Cov(x,y) = 0
- Covariance = 3000? Apa artinya?
Covariance ke korelasi: Statistical Thinking ¶
- Korelasi sebenarnya adalah Covariance dibagi dengan masing-masing standar deviasinya.
- Apa maksud/maknanya?
- Covariance punya makna geometric .... ia adalah Cosine!...
https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Geometric_interpretation
Nilai koefisien korelasi (Linear) "Pearson" ¶
- Nilai dari koefisien korelasi Pearson adalah dari -1 hingga +1
Hati-hati¶
- Koefisien korelasi = 0 bukan berarti tidak ada hubungan antara kedua variable. Yang benar adalah: tidak ada hubungan LINIER, tapi bisa jadi ada hubungan dalam bentuk lain; misal: kuadratik, atau fungsi lain selain linier, seperti pada contoh di atas.
Korelasi dan Sebab-Akibat ¶
- Semua orang yang minum air putih mati
Penilaian Kualitatif terhadap nilai korelasi seperti ini? ... Really? Why? Why not? ¶
- image Source: https://spencermath.weebly.com/home/interpreting-the-correlation-coefficient
- Cases (social, medicine, etc)
- Objective, prediction vs insights
Contoh kasus sederhana ¶
import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt, numpy as np
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import statsmodels.api as sm, scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
from sklearn.datasets import load_boston
from sklearn.preprocessing import MinMaxScaler
plt.style.use('bmh'); sns.set()
"Done"
'Done'
data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138, 142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)
usia | tekanan_darah | |
---|---|---|
0 | 40 | 126 |
1 | 45 | 124 |
2 | 50 | 135 |
3 | 53 | 138 |
4 | 60 | 142 |
5 | 65 | 139 |
6 | 69 | 140 |
7 | 71 | 151 |
# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()
Covariance = 76.953125 Correlations = [[1. 0.88746015] [0.88746015 1. ]]
# Better
print(df.corr())
sns.heatmap(df.corr(),cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,annot=True, annot_kws={"size": 16}, square=True)
p = sns.pairplot(df)
usia tekanan_darah usia 1.00000 0.88746 tekanan_darah 0.88746 1.00000
Interpretasi ¶
Nilai ~0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.
WARNING
Korelasi tidak sama (meng-implikasikan) dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia, tapi faktor lain yang tidak teramati pada data.
Contoh lain penelitian di Machine learning (kecantikan dan confidence/Panjang Jari dan IQ)
WARNING¶
Korelasi tidak sama dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia, tapi faktor lain yang tidak teramati pada data.
Sampai sini kita memahami kalau keduanya berhubungan, tapi seperti apa hubungannya kita masih belum bisa ketahui (lewat korelasi). Itulah Mengapa kita perlu Model Regresi.
Regresi Linier Sederhana ¶
Korelasi ke Regresi ¶
Bagaimana menghitung parameter Regresi yang Optimal? ¶
- Kenapa rumusnya seperti ini?
- Pentingnya memahami "Loss Function"
Evaluasi Error (Mean Squared Error) ¶
- Hati-hati,... perhatikan rumusnya dengan baik .... ia tidak robust terhadap outlier
- $\hat{y}=β_0+β_1x_1+...+β_nx_p$
- MSE = total jarak/selisih antara prediksi dan nilai dari data (sesungguhnya)
- RMSE = $\sqrt{MSE}$ ... why?
- Evaluasi penting ketika kita ingin melakukan prediksi
# Fitting model Regresi Sederhana
lm = smf.ols("tekanan_darah ~ usia", data=df[['tekanan_darah','usia']]).fit()
lm.summary()
# 1. F-Stat.
#.2. Uji Koef model
#.3. R^2
#.4. Interpretasi Model
#.5. Durbin-Watson ==> Time Series?
Dep. Variable: | tekanan_darah | R-squared: | 0.788 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.752 |
Method: | Least Squares | F-statistic: | 22.25 |
Date: | Fri, 27 Aug 2021 | Prob (F-statistic): | 0.00327 |
Time: | 07:07:05 | Log-Likelihood: | -21.920 |
No. Observations: | 8 | AIC: | 47.84 |
Df Residuals: | 6 | BIC: | 48.00 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 98.5623 | 8.266 | 11.924 | 0.000 | 78.337 | 118.788 |
usia | 0.6766 | 0.143 | 4.717 | 0.003 | 0.326 | 1.028 |
Omnibus: | 3.192 | Durbin-Watson: | 2.005 |
---|---|---|---|
Prob(Omnibus): | 0.203 | Jarque-Bera (JB): | 1.016 |
Skew: | -0.340 | Prob(JB): | 0.602 |
Kurtosis: | 1.392 | Cond. No. | 311. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Plot the Data
p = sns.regplot(df.usia, df.tekanan_darah)
Evaluasi $R^2$: Model VS Tidak Pakai Model? ¶
- $SSR=SST-SSE=\sum{(y_i-\bar{y}_i)^2}-\sum{(y_i-\hat{y}_i)^2}$
Adjusted R-Squared? Why? ¶
Pengaruh Variabel Tak Bebas ke Model ¶
All Models Are Wrong ¶
- Perfect/true-best model tidak ada, bahkan seringnya tidak diperlukan
Pahami Asumsi-Asumsi di Regresi dengan Baik ¶
https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html¶
Regresi Non-Linier ¶
- Why?
- Kapan tidak disarankan menambah kompleksitas model?
- Regression for insights VS regression for prediction.
- Masih linear terhadap parameter
# Loading Data Sampel dari Modul
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head(), df.shape, set(df['Region'])
( Lottery Literacy Wealth Region 0 41 37 73 E 1 38 51 22 N 2 66 13 61 C 3 80 46 76 E 4 79 69 83 E, (85, 4), {'C', 'E', 'N', 'S', 'W'})
# Set "Region" sebagai variabel Kategorik
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)
print(res.summary())
Intercept 38.651655 C(Region)[T.E] -15.427785 C(Region)[T.N] -10.016961 C(Region)[T.S] -4.548257 C(Region)[T.W] -10.091276 Literacy -0.185819 Wealth 0.451475 dtype: float64 OLS Regression Results ============================================================================== Dep. Variable: Lottery R-squared: 0.338 Model: OLS Adj. R-squared: 0.287 Method: Least Squares F-statistic: 6.636 Date: Fri, 27 Aug 2021 Prob (F-statistic): 1.07e-05 Time: 07:07:05 Log-Likelihood: -375.30 No. Observations: 85 AIC: 764.6 Df Residuals: 78 BIC: 781.7 Df Model: 6 Covariance Type: nonrobust ================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------- Intercept 38.6517 9.456 4.087 0.000 19.826 57.478 C(Region)[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938 C(Region)[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419 C(Region)[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943 C(Region)[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235 Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232 Wealth 0.4515 0.103 4.390 0.000 0.247 0.656 ============================================================================== Omnibus: 3.049 Durbin-Watson: 1.785 Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694 Skew: -0.340 Prob(JB): 0.260 Kurtosis: 2.454 Cond. No. 371. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Non Linear transformation
res = smf.ols(formula='Lottery ~ np.log(Literacy) + Wealth -1', data=df).fit()
print(res.summary())
OLS Regression Results ======================================================================================= Dep. Variable: Lottery R-squared (uncentered): 0.799 Model: OLS Adj. R-squared (uncentered): 0.794 Method: Least Squares F-statistic: 165.2 Date: Fri, 27 Aug 2021 Prob (F-statistic): 1.16e-29 Time: 07:07:05 Log-Likelihood: -384.16 No. Observations: 85 AIC: 772.3 Df Residuals: 83 BIC: 777.2 Df Model: 2 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------ np.log(Literacy) 4.6426 1.246 3.727 0.000 2.165 7.120 Wealth 0.5853 0.089 6.571 0.000 0.408 0.762 ============================================================================== Omnibus: 4.188 Durbin-Watson: 1.892 Prob(Omnibus): 0.123 Jarque-Bera (JB): 4.034 Skew: -0.480 Prob(JB): 0.133 Kurtosis: 2.533 Cond. No. 25.8 ============================================================================== Notes: [1] R² is computed without centering (uncentered) since the model does not contain a constant. [2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Studi Kasus (Boston House Pricing) - Another Property Case Study ¶
# Loading Data
boston = load_boston()
# Convert ke Pandas Dataframe
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
bos.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
# Deskripsi Data
print(boston.DESCR)
.. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems. .. topic:: References - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261. - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
bos.describe(include='all')
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
p = sns.pairplot(bos)
Checking Correlations between Predictors ¶
# HeatMap untuk menyelidiki korelasi
corr2 = bos.corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))
sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)],
cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
annot=True, annot_kws={"size": 8}, square=True);
m = ols('PRICE ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m.summary())
# Jangan lupa analisa dan interpretasi hasilnya
OLS Regression Results ============================================================================== Dep. Variable: PRICE R-squared: 0.679 Model: OLS Adj. R-squared: 0.677 Method: Least Squares F-statistic: 353.3 Date: Fri, 27 Aug 2021 Prob (F-statistic): 2.69e-123 Time: 07:08:10 Log-Likelihood: -1553.0 No. Observations: 506 AIC: 3114. Df Residuals: 502 BIC: 3131. Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 18.5671 3.913 4.745 0.000 10.879 26.255 RM 4.5154 0.426 10.603 0.000 3.679 5.352 PTRATIO -0.9307 0.118 -7.911 0.000 -1.162 -0.700 LSTAT -0.5718 0.042 -13.540 0.000 -0.655 -0.489 ============================================================================== Omnibus: 202.072 Durbin-Watson: 0.901 Prob(Omnibus): 0.000 Jarque-Bera (JB): 1022.153 Skew: 1.700 Prob(JB): 1.10e-222 Kurtosis: 9.076 Cond. No. 402. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
m2 = ols('np.log(PRICE) ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m2.summary())
OLS Regression Results ============================================================================== Dep. Variable: np.log(PRICE) R-squared: 0.714 Model: OLS Adj. R-squared: 0.713 Method: Least Squares F-statistic: 418.4 Date: Fri, 27 Aug 2021 Prob (F-statistic): 3.96e-136 Time: 07:08:10 Log-Likelihood: 52.201 No. Observations: 506 AIC: -96.40 Df Residuals: 502 BIC: -79.50 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3.5469 0.164 21.632 0.000 3.225 3.869 RM 0.1044 0.018 5.849 0.000 0.069 0.139 PTRATIO -0.0391 0.005 -7.927 0.000 -0.049 -0.029 LSTAT -0.0353 0.002 -19.974 0.000 -0.039 -0.032 ============================================================================== Omnibus: 44.245 Durbin-Watson: 0.916 Prob(Omnibus): 0.000 Jarque-Bera (JB): 179.110 Skew: 0.246 Prob(JB): 1.28e-39 Kurtosis: 5.873 Cond. No. 402. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
#define figure size
fig = plt.figure(figsize=(12,8))
#produce regression plots
fig = sm.graphics.plot_regress_exog(m,'RM', fig=fig)
plt.rc("figure", figsize=(16,12))
plt.rc("font", size=14)
fig = sm.graphics.plot_fit(m, "RM")
fig.tight_layout(pad=1.0)
model_fitted_y = m.fittedvalues # model residuals
model_residuals = m.resid # normalized residuals
model_norm_residuals = m.get_influence().resid_studentized_internal # absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = m.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = m.get_influence().cooks_distance[0]
plot_lm_1 = plt.figure()
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, bos["PRICE"], data=bos,
lowess=True,
scatter_kws={'alpha': 0.5},
line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})
plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals');
Variable Selection: Stepwise di Analisis Regresi ¶
def forward_selected(data, response):
"""Linear model designed by forward selection.
https://planspace.org/20150423-forward_selection_with_statsmodels/
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response
response: string, name of response column in data
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = []
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response, ' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
model = forward_selected(bos, 'PRICE')
print(model.model.formula)
print(model.rsquared_adj)
PRICE ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + CRIM + RAD + TAX + 1 0.7348057723274566
# Interpretasi koefisien?
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: PRICE R-squared: 0.741 Model: OLS Adj. R-squared: 0.735 Method: Least Squares F-statistic: 128.2 Date: Fri, 27 Aug 2021 Prob (F-statistic): 5.54e-137 Time: 07:08:15 Log-Likelihood: -1498.9 No. Observations: 506 AIC: 3022. Df Residuals: 494 BIC: 3072. Df Model: 11 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 36.3411 5.067 7.171 0.000 26.385 46.298 LSTAT -0.5226 0.047 -11.019 0.000 -0.616 -0.429 RM 3.8016 0.406 9.356 0.000 3.003 4.600 PTRATIO -0.9465 0.129 -7.334 0.000 -1.200 -0.693 DIS -1.4927 0.186 -8.037 0.000 -1.858 -1.128 NOX -17.3760 3.535 -4.915 0.000 -24.322 -10.430 CHAS 2.7187 0.854 3.183 0.002 1.040 4.397 B 0.0093 0.003 3.475 0.001 0.004 0.015 ZN 0.0458 0.014 3.390 0.001 0.019 0.072 CRIM -0.1084 0.033 -3.307 0.001 -0.173 -0.044 RAD 0.2996 0.063 4.726 0.000 0.175 0.424 TAX -0.0118 0.003 -3.493 0.001 -0.018 -0.005 ============================================================================== Omnibus: 178.430 Durbin-Watson: 1.078 Prob(Omnibus): 0.000 Jarque-Bera (JB): 787.785 Skew: 1.523 Prob(JB): 8.60e-172 Kurtosis: 8.300 Cond. No. 1.47e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.47e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Bandingkan Durbin-Watson in[15] dengan Durbin-Watson di [29] Comment on Jarque-Bera
Data Scaling "for Insights" ¶
- Pentingnya "scaling" di Regresi (atau clustering) untuk mencari insight dari data
- image source: https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e
scaler = MinMaxScaler()
bos[['TAX', 'AGE', 'B']] = scaler.fit_transform(bos[['TAX', 'AGE', 'B']])
bos.head()
# Continue to Modelling
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | PRICE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 0.641607 | 4.0900 | 1.0 | 0.208015 | 15.3 | 1.000000 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 0.782698 | 4.9671 | 2.0 | 0.104962 | 17.8 | 1.000000 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 0.599382 | 4.9671 | 2.0 | 0.104962 | 17.8 | 0.989737 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 0.441813 | 6.0622 | 3.0 | 0.066794 | 18.7 | 0.994276 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 0.528321 | 6.0622 | 3.0 | 0.066794 | 18.7 | 1.000000 | 5.33 | 36.2 |
Pitfalls: Regresi Interpolation "bukan" Extrapolation (Forecasting/Peramalan) ¶
Belum dibahas:
- Logistic Regression [akan dibahas saat Topik Klasifikasi]
- Piecewise Regression (Non Linear)
- Probit/Tobit Regression (Probabilistic)
- Bayesian Regressian
- Logic Regression (lebih robust dari logistic regression utk Fraud Detection)
- Quantile regression (extreme events)
- LAD regression (L1)
- Jackknife regression
- SVR
- ARIMA (Time Series)
- Ecologic Regression
Latihan Studi Kasus Investasi Biaya Iklan¶
# Contoh
# Load DataFile CSV
try:
df = pd.read_csv('data/iklan.csv') # run locally
except:
!wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/iklan.csv # "Google Colab"
df = pd.read_csv('iklan.csv')
df.head()
No | Iklan | Laba | Tipe | |
---|---|---|---|---|
0 | 1 | 10 | 9.17 | 1 |
1 | 2 | 1 | 1.32 | 0 |
2 | 3 | 12 | 8.54 | 1 |
3 | 4 | 12 | 7.68 | 1 |
4 | 5 | 5 | 7.15 | 1 |
p = sns.pairplot(df, hue="Tipe")
# Do Modelling Here ... Don't forget to interpret.
Tidak ada komentar:
Posting Komentar
Relevant & Respectful Comments Only.