All models are wrong, but some are useful. (George E. P. Box)

Prasyarat GLM-01

Artikel ini: https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html
ADSP-01 : https://taudata.blogspot.com/2022/04/adsp-01.html (Memahami dasar Python dengan baik)
Kalkulus Dasar
Statistika Dasar

Video Lesson GLM-01

Link ke Youtube.

Referensi GLM-01:

A. Seber and A.J. Lee, Linear Regression Analysis (2nd Ed), 2003, John Wiley & Sons.
P. McCullagh and J.A. Nelder, Generalized Linear Models (2nd Ed), 1989, Chapman & Hall.

Data_Mining_10-Review_Regresi

Data Mining: Review Model Regresi

Pendahuluan Model Regresi¶

Digunakan saat variabel tak bebas (Dependent variable - Y) bertipe numerik (float/real) dan variabel bebasnya bisa numerik dan-atau kategorik

Berawal dari Pusat data dan Variansi¶

$\bar{x}=\sum_{i=1}^{N}{x_i}$ dan $s^2=\frac{\sum_{i=1}^{N}{(x_i-\bar{x})}}{N-1}$

Perhatikan makna rumus/formula variansi, lalu bandingkan dengan formula "covariansi" berikut:

Variance ke Covariance: Menghitung Hubungan Linear antara 2 variabel¶

How? Bagaimana cara kerjanya? (Statistical Thinking)
Konsepnya: "Co-Vary" sama-sama bervariansi menjauh dari rata-rata.
Gunakan "reverse" thinking untuk memahaminya.
Penggunaan: cov(x,y) = 2 VS cov(x,y) = -2 VS Cov(x,y) = 0
Covariance = 3000? Apa artinya?

No description has been provided for this image

Covariance ke korelasi: Statistical Thinking¶

Korelasi sebenarnya adalah Covariance dibagi dengan masing-masing standar deviasinya.
Apa maksud/maknanya?
Covariance punya makna geometric .... ia adalah Cosine!...

https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#Geometric_interpretation

Nilai koefisien korelasi (Linear) "Pearson"¶

Nilai dari koefisien korelasi Pearson adalah dari -1 hingga +1

Hati-hati¶

Koefisien korelasi = 0 bukan berarti tidak ada hubungan antara kedua variable. Yang benar adalah: tidak ada hubungan LINIER, tapi bisa jadi ada hubungan dalam bentuk lain; misal: kuadratik, atau fungsi lain selain linier, seperti pada contoh di atas.

Korelasi dan Sebab-Akibat¶

Semua orang yang minum air putih mati

Penilaian Kualitatif terhadap nilai korelasi seperti ini? ... Really? Why? Why not?¶

image Source: https://spencermath.weebly.com/home/interpreting-the-correlation-coefficient
Cases (social, medicine, etc)
Objective, prediction vs insights

Contoh kasus sederhana¶

In [54]:

import warnings; warnings.simplefilter('ignore')
import pandas as pd, seaborn as sns, matplotlib.pyplot as plt, numpy as np
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols
import statsmodels.api as sm, scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
from sklearn.datasets import load_boston
from sklearn.preprocessing import MinMaxScaler
plt.style.use('bmh'); sns.set()
"Done"

Out[54]:

'Done'

In [55]:

data = {'usia':[40, 45, 50, 53, 60, 65, 69, 71], 'tekanan_darah':[126, 124, 135, 138, 142, 139, 140, 151]}
df = pd.DataFrame.from_dict(data)
df.head(8)

Out[55]:

	usia	tekanan_darah
0	40	126
1	45	124
2	50	135
3	53	138
4	60	142
5	65	139
6	69	140
7	71	151

In [56]:

# Korelasi dan Scatter Plot untuk melihat datanya
print('Covariance = ', np.cov(df.usia, df.tekanan_darah, ddof=0)[0][1])
print('Correlations = ', np.corrcoef(df.usia, df.tekanan_darah))
plt.scatter(df.usia, df.tekanan_darah)
plt.show()

Covariance =  76.953125
Correlations =  [[1.         0.88746015]
 [0.88746015 1.        ]]

In [57]:

# Better
print(df.corr())
sns.heatmap(df.corr(),cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,annot=True, annot_kws={"size": 16}, square=True)
p = sns.pairplot(df)

                  usia  tekanan_darah
usia           1.00000        0.88746
tekanan_darah  0.88746        1.00000

Interpretasi¶

Nilai ~0.95 menunjukkan bahwa ada korelasi linier positif yang kuat antara usia dan tekanan darah. Ada kecenderungan bahwa usia tinggi berkaitan dengan tekanan darah yang kebih tinggi dibandingkan usia rendah.
WARNING
Korelasi tidak sama (meng-implikasikan) dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia, tapi faktor lain yang tidak teramati pada data.
Contoh lain penelitian di Machine learning (kecantikan dan confidence/Panjang Jari dan IQ)

WARNING¶

Korelasi tidak sama dengan sebab akibat. Perhatikan interpretasi di atas. Tidak dinyatakan bahwa jika usia tinggi maka tekanan darah rendah, hanya suatu tren atau kecenderungan. Mungkin saja usia dengan bertambahnya usia maka tekanan darah meningkat, tapi mungkin juga tekanan darah tinggi bukan karena usia, tapi faktor lain yang tidak teramati pada data.
Sampai sini kita memahami kalau keduanya berhubungan, tapi seperti apa hubungannya kita masih belum bisa ketahui (lewat korelasi). Itulah Mengapa kita perlu Model Regresi.

Regresi Linier Sederhana¶

Korelasi ke Regresi¶

Bagaimana menghitung parameter Regresi yang Optimal?¶

Kenapa rumusnya seperti ini?
Pentingnya memahami "Loss Function"

Evaluasi Error (Mean Squared Error)¶

Hati-hati,... perhatikan rumusnya dengan baik .... ia tidak robust terhadap outlier
$\hat{y}=β_0+β_1x_1+...+β_nx_p$
MSE = total jarak/selisih antara prediksi dan nilai dari data (sesungguhnya)
RMSE = $\sqrt{MSE}$ ... why?
Evaluasi penting ketika kita ingin melakukan prediksi

In [58]:

# Fitting model Regresi Sederhana
lm = smf.ols("tekanan_darah ~ usia", data=df[['tekanan_darah','usia']]).fit()
lm.summary()
# 1. F-Stat. 
#.2. Uji Koef model
#.3. R^2
#.4. Interpretasi Model
#.5. Durbin-Watson ==> Time Series?

Out[58]:

OLS Regression Results
Dep. Variable:	tekanan_darah	R-squared:	0.788
Model:	OLS	Adj. R-squared:	0.752
Method:	Least Squares	F-statistic:	22.25
Date:	Fri, 27 Aug 2021	Prob (F-statistic):	0.00327
Time:	07:07:05	Log-Likelihood:	-21.920
No. Observations:	8	AIC:	47.84
Df Residuals:	6	BIC:	48.00
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	98.5623	8.266	11.924	0.000	78.337	118.788
usia	0.6766	0.143	4.717	0.003	0.326	1.028

Omnibus:	3.192	Durbin-Watson:	2.005
Prob(Omnibus):	0.203	Jarque-Bera (JB):	1.016
Skew:	-0.340	Prob(JB):	0.602
Kurtosis:	1.392	Cond. No.	311.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [59]:

# Plot the Data
p = sns.regplot(df.usia, df.tekanan_darah)

Evaluasi $R^2$: Model VS Tidak Pakai Model?¶

$SSR=SST-SSE=\sum{(y_i-\bar{y}_i)^2}-\sum{(y_i-\hat{y}_i)^2}$

Adjusted R-Squared? Why?¶

Pengaruh Variabel Tak Bebas ke Model¶

All Models Are Wrong¶

Perfect/true-best model tidak ada, bahkan seringnya tidak diperlukan

Pahami Asumsi-Asumsi di Regresi dengan Baik¶

https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html ¶

image source: https://www.slideshare.net/mahakvijay3/basics-of-regression-analysis

Regresi Non-Linier¶

Why?
Kapan tidak disarankan menambah kompleksitas model?
Regression for insights VS regression for prediction.
Masih linear terhadap parameter

image source: https://sites.google.com/site/apphysics1online/appendices/2-data-analysis/graph-linearization

In [60]:

# Loading Data Sampel dari Modul
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head(), df.shape, set(df['Region'])

Out[60]:

(   Lottery  Literacy  Wealth Region
 0       41        37      73      E
 1       38        51      22      N
 2       66        13      61      C
 3       80        46      76      E
 4       79        69      83      E,
 (85, 4),
 {'C', 'E', 'N', 'S', 'W'})

In [61]:

# Set "Region" sebagai variabel Kategorik
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
print(res.params)
print(res.summary())

Intercept         38.651655
C(Region)[T.E]   -15.427785
C(Region)[T.N]   -10.016961
C(Region)[T.S]    -4.548257
C(Region)[T.W]   -10.091276
Literacy          -0.185819
Wealth             0.451475
dtype: float64
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Lottery   R-squared:                       0.338
Model:                            OLS   Adj. R-squared:                  0.287
Method:                 Least Squares   F-statistic:                     6.636
Date:                Fri, 27 Aug 2021   Prob (F-statistic):           1.07e-05
Time:                        07:07:05   Log-Likelihood:                -375.30
No. Observations:                  85   AIC:                             764.6
Df Residuals:                      78   BIC:                             781.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         38.6517      9.456      4.087      0.000      19.826      57.478
C(Region)[T.E]   -15.4278      9.727     -1.586      0.117     -34.793       3.938
C(Region)[T.N]   -10.0170      9.260     -1.082      0.283     -28.453       8.419
C(Region)[T.S]    -4.5483      7.279     -0.625      0.534     -19.039       9.943
C(Region)[T.W]   -10.0913      7.196     -1.402      0.165     -24.418       4.235
Literacy          -0.1858      0.210     -0.886      0.378      -0.603       0.232
Wealth             0.4515      0.103      4.390      0.000       0.247       0.656
==============================================================================
Omnibus:                        3.049   Durbin-Watson:                   1.785
Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694
Skew:                          -0.340   Prob(JB):                        0.260
Kurtosis:                       2.454   Cond. No.                         371.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [62]:

# Non Linear transformation
res = smf.ols(formula='Lottery ~ np.log(Literacy) + Wealth -1', data=df).fit()
print(res.summary())

                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                Lottery   R-squared (uncentered):                   0.799
Model:                            OLS   Adj. R-squared (uncentered):              0.794
Method:                 Least Squares   F-statistic:                              165.2
Date:                Fri, 27 Aug 2021   Prob (F-statistic):                    1.16e-29
Time:                        07:07:05   Log-Likelihood:                         -384.16
No. Observations:                  85   AIC:                                      772.3
Df Residuals:                      83   BIC:                                      777.2
Df Model:                           2                                                  
Covariance Type:            nonrobust                                                  
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
np.log(Literacy)     4.6426      1.246      3.727      0.000       2.165       7.120
Wealth               0.5853      0.089      6.571      0.000       0.408       0.762
==============================================================================
Omnibus:                        4.188   Durbin-Watson:                   1.892
Prob(Omnibus):                  0.123   Jarque-Bera (JB):                4.034
Skew:                          -0.480   Prob(JB):                        0.133
Kurtosis:                       2.533   Cond. No.                         25.8
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Studi Kasus (Boston House Pricing) - Another Property Case Study¶

Source: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

In [63]:

# Loading Data
boston = load_boston()
# Convert ke Pandas Dataframe
bos = pd.DataFrame(boston.data)
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
bos.head()

Out[63]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

In [64]:

# Deskripsi Data
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
.. topic:: References

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

In [65]:

bos.describe(include='all')

Out[65]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

In [66]:

p = sns.pairplot(bos)

Checking Correlations between Predictors¶

In [67]:

# HeatMap untuk menyelidiki korelasi
corr2 = bos.corr() # We already examined SalePrice correlations
plt.figure(figsize=(12, 10))

sns.heatmap(corr2[(corr2 >= 0.5) | (corr2 <= -0.4)], 
            cmap='viridis', vmax=1.0, vmin=-1.0, linewidths=0.1,
            annot=True, annot_kws={"size": 8}, square=True);

In [68]:

m = ols('PRICE ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m.summary())
# Jangan lupa analisa dan interpretasi hasilnya

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.679
Model:                            OLS   Adj. R-squared:                  0.677
Method:                 Least Squares   F-statistic:                     353.3
Date:                Fri, 27 Aug 2021   Prob (F-statistic):          2.69e-123
Time:                        07:08:10   Log-Likelihood:                -1553.0
No. Observations:                 506   AIC:                             3114.
Df Residuals:                     502   BIC:                             3131.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     18.5671      3.913      4.745      0.000      10.879      26.255
RM             4.5154      0.426     10.603      0.000       3.679       5.352
PTRATIO       -0.9307      0.118     -7.911      0.000      -1.162      -0.700
LSTAT         -0.5718      0.042    -13.540      0.000      -0.655      -0.489
==============================================================================
Omnibus:                      202.072   Durbin-Watson:                   0.901
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1022.153
Skew:                           1.700   Prob(JB):                    1.10e-222
Kurtosis:                       9.076   Cond. No.                         402.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [69]:

m2 = ols('np.log(PRICE) ~ RM + PTRATIO + LSTAT ', bos).fit()
print(m2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          np.log(PRICE)   R-squared:                       0.714
Model:                            OLS   Adj. R-squared:                  0.713
Method:                 Least Squares   F-statistic:                     418.4
Date:                Fri, 27 Aug 2021   Prob (F-statistic):          3.96e-136
Time:                        07:08:10   Log-Likelihood:                 52.201
No. Observations:                 506   AIC:                            -96.40
Df Residuals:                     502   BIC:                            -79.50
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      3.5469      0.164     21.632      0.000       3.225       3.869
RM             0.1044      0.018      5.849      0.000       0.069       0.139
PTRATIO       -0.0391      0.005     -7.927      0.000      -0.049      -0.029
LSTAT         -0.0353      0.002    -19.974      0.000      -0.039      -0.032
==============================================================================
Omnibus:                       44.245   Durbin-Watson:                   0.916
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              179.110
Skew:                           0.246   Prob(JB):                     1.28e-39
Kurtosis:                       5.873   Cond. No.                         402.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [70]:

#define figure size
fig = plt.figure(figsize=(12,8))
#produce regression plots
fig = sm.graphics.plot_regress_exog(m,'RM', fig=fig)

In [71]:

plt.rc("figure", figsize=(16,12))
plt.rc("font", size=14)
fig = sm.graphics.plot_fit(m, "RM")
fig.tight_layout(pad=1.0)

In [72]:

model_fitted_y = m.fittedvalues # model residuals
model_residuals = m.resid # normalized residuals
model_norm_residuals = m.get_influence().resid_studentized_internal # absolute squared normalized residuals
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
# absolute residuals
model_abs_resid = np.abs(model_residuals)
# leverage, from statsmodels internals
model_leverage = m.get_influence().hat_matrix_diag
# cook's distance, from statsmodels internals
model_cooks = m.get_influence().cooks_distance[0]

plot_lm_1 = plt.figure()
plot_lm_1.axes[0] = sns.residplot(model_fitted_y, bos["PRICE"], data=bos,
                          lowess=True,
                          scatter_kws={'alpha': 0.5},
                          line_kws={'color': 'red', 'lw': 1, 'alpha': 0.8})

plot_lm_1.axes[0].set_title('Residuals vs Fitted')
plot_lm_1.axes[0].set_xlabel('Fitted values')
plot_lm_1.axes[0].set_ylabel('Residuals');

Variable Selection: Stepwise di Analisis Regresi¶

image source: https://quantifyinghealth.com/stepwise-selection/
image source: https://en.wikipedia.org/wiki/Stepwise_regression
Cautions: https://towardsdatascience.com/stopping-stepwise-why-stepwise-selection-is-bad-and-what-you-should-use-instead-90818b3f52df

In [73]:

def forward_selected(data, response):
    """Linear model designed by forward selection.
    https://planspace.org/20150423-forward_selection_with_statsmodels/
    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response

    response: string, name of response column in data

    Returns:
    --------
    model: an "optimal" fitted statsmodels linear model
           with an intercept
           selected by forward selection
           evaluated by adjusted R-squared
    """
    remaining = set(data.columns)
    remaining.remove(response)
    selected = []
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response, ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model

In [74]:

model = forward_selected(bos, 'PRICE')
print(model.model.formula)
print(model.rsquared_adj)

PRICE ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + CRIM + RAD + TAX + 1
0.7348057723274566

In [75]:

# Interpretasi koefisien?
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  PRICE   R-squared:                       0.741
Model:                            OLS   Adj. R-squared:                  0.735
Method:                 Least Squares   F-statistic:                     128.2
Date:                Fri, 27 Aug 2021   Prob (F-statistic):          5.54e-137
Time:                        07:08:15   Log-Likelihood:                -1498.9
No. Observations:                 506   AIC:                             3022.
Df Residuals:                     494   BIC:                             3072.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     36.3411      5.067      7.171      0.000      26.385      46.298
LSTAT         -0.5226      0.047    -11.019      0.000      -0.616      -0.429
RM             3.8016      0.406      9.356      0.000       3.003       4.600
PTRATIO       -0.9465      0.129     -7.334      0.000      -1.200      -0.693
DIS           -1.4927      0.186     -8.037      0.000      -1.858      -1.128
NOX          -17.3760      3.535     -4.915      0.000     -24.322     -10.430
CHAS           2.7187      0.854      3.183      0.002       1.040       4.397
B              0.0093      0.003      3.475      0.001       0.004       0.015
ZN             0.0458      0.014      3.390      0.001       0.019       0.072
CRIM          -0.1084      0.033     -3.307      0.001      -0.173      -0.044
RAD            0.2996      0.063      4.726      0.000       0.175       0.424
TAX           -0.0118      0.003     -3.493      0.001      -0.018      -0.005
==============================================================================
Omnibus:                      178.430   Durbin-Watson:                   1.078
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              787.785
Skew:                           1.523   Prob(JB):                    8.60e-172
Kurtosis:                       8.300   Cond. No.                     1.47e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.47e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Bandingkan Durbin-Watson in[15] dengan Durbin-Watson di [29] Comment on Jarque-Bera

Data Scaling "for Insights"¶

Pentingnya "scaling" di Regresi (atau clustering) untuk mencari insight dari data
image source: https://medium.com/greyatom/why-how-and-when-to-scale-your-features-4b30ab09db5e

In [76]:

scaler = MinMaxScaler()
bos[['TAX', 'AGE', 'B']] = scaler.fit_transform(bos[['TAX', 'AGE', 'B']])
bos.head()
# Continue to Modelling

Out[76]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	PRICE
0	0.00632	18.0	2.31	0.538	6.575	0.641607	4.0900	1.0	0.208015	15.3	1.000000	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	0.782698	4.9671	2.0	0.104962	17.8	1.000000	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	0.599382	4.9671	2.0	0.104962	17.8	0.989737	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	0.441813	6.0622	3.0	0.066794	18.7	0.994276	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	0.528321	6.0622	3.0	0.066794	18.7	1.000000	5.33	36.2

Pitfalls: Regresi Interpolation "bukan" Extrapolation (Forecasting/Peramalan)¶

image source: https://www.datasciencecentral.com/forum/topics/what-are-the-differences-between-prediction-extrapolation-and

Belum dibahas:

Logistic Regression [akan dibahas saat Topik Klasifikasi]
Piecewise Regression (Non Linear)
Probit/Tobit Regression (Probabilistic)
Bayesian Regressian
Logic Regression (lebih robust dari logistic regression utk Fraud Detection)
Quantile regression (extreme events)
LAD regression (L1)
Jackknife regression
SVR
ARIMA (Time Series)
Ecologic Regression

image Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html ¶

Latihan Studi Kasus Investasi Biaya Iklan¶

In [19]:

# Contoh
# Load DataFile CSV
try:
    df = pd.read_csv('data/iklan.csv') # run locally
except:
    !wget https://raw.githubusercontent.com/taudata-indonesia/eLearning/master/data/iklan.csv # "Google Colab"
    df = pd.read_csv('iklan.csv') 
df.head()

Out[19]:

	No	Iklan	Laba	Tipe
0	1	10	9.17	1
1	2	1	1.32	0
2	3	12	8.54	1
3	4	12	7.68	1
4	5	5	7.15	1

In [20]:

p = sns.pairplot(df, hue="Tipe")

In [21]:

# Do Modelling Here ... Don't forget to interpret.

Akhir Modul - Review Regression Analysis¶

Top Links Menu

GLM-01: Pendahuluan Korelasi dan Regresi

Prasyarat GLM-01

Video Lesson GLM-01

Referensi GLM-01:

Data Mining: Review Model Regresi

Pendahuluan Model Regresi¶

Berawal dari Pusat data dan Variansi¶

Variance ke Covariance: Menghitung Hubungan Linear antara 2 variabel¶

Covariance ke korelasi: Statistical Thinking¶

Nilai koefisien korelasi (Linear) "Pearson"¶

Hati-hati¶

Korelasi dan Sebab-Akibat¶

Penilaian Kualitatif terhadap nilai korelasi seperti ini? ... Really? Why? Why not?¶

Contoh kasus sederhana¶

Interpretasi¶

WARNING¶

Regresi Linier Sederhana¶

Korelasi ke Regresi¶

Bagaimana menghitung parameter Regresi yang Optimal?¶

Evaluasi Error (Mean Squared Error)¶

Evaluasi $R^2$: Model VS Tidak Pakai Model?¶

Adjusted R-Squared? Why?¶

Pengaruh Variabel Tak Bebas ke Model¶

All Models Are Wrong¶

Pahami Asumsi-Asumsi di Regresi dengan Baik¶

https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html¶

Regresi Non-Linier¶

Studi Kasus (Boston House Pricing) - Another Property Case Study¶

Checking Correlations between Predictors¶

Variable Selection: Stepwise di Analisis Regresi¶

Data Scaling "for Insights"¶

Pitfalls: Regresi Interpolation "bukan" Extrapolation (Forecasting/Peramalan)¶

Belum dibahas:

image Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html¶

Latihan Studi Kasus Investasi Biaya Iklan¶

Akhir Modul - Review Regression Analysis¶

No comments:

Post a Comment

SEARCH

LATEST

FOLLOW ME

Visitors

Translate~Terjemahkan

Pages

Follow Us

Popular

Archive

Postingan Populer

Latest courses

Comments

About

https://taudata.blogspot.com/2019/04/asumsi-statistik-antara-benci-butuh.html ¶

image Source: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html ¶