Assumptions of Logistic Regression, Clearly Explained

Home Page: https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290

Jupyter Notebook 100.00%

logistic-regression logistic-regression-algorithm logistic-regression-assumptions logistic-regression-classifier logistic-regression-implementation logistic-regression-models python statistics

logistic-regression-assumptions's Introduction

Assumptions of Logistic Regression, Clearly Explained

Understanding and implementing the assumption checks behind one of the most important statistical techniques in data science - Logistic Regression

Link to TowardsDataScience article: https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290
Logistic regression is a highly effective modeling technique that has remained a mainstay in statistics since its development in the 1940s.
Given its popularity and utility, data practitioners should understand the fundamentals of logistic regression before using it to tackle data and business problems.
In this project, we explore the key assumptions of logistic regression with theoretical explanations and practical Python implementation of the assumption checks.

(1) Logistic_Regression_Assumptions.ipynb

The main notebook containing the Python implementation codes (along with explanations) on how to check for each of the 6 key assumptions in logistic regression

(2) Box-Tidwell-Test-in-R.ipynb

Notebook containing R code for running Box-Tidwell test (to check for logit linearity assumption)

(3) /data

Folder containing the public Titanic dataset (train set)

(4) /references

Folder containing several sets of lecture notes explaining advanced regression

Special Thanks

@dataninj4 for correcting imports and adding .loc referencing in diagnosis_df cell so that it runs without errors in Python 3.6/3.8
@ArneTR for rightly pointing out that VIF calculation should include a constant, and correlation matrix should exclude target variable

References

logistic-regression-assumptions's People

Contributors

Stargazers

Watchers

Forkers

dataninjato joskid godfred06 stevenchicken cloudbigdatainnovation tfvip2022 danayt09

logistic-regression-assumptions's Issues

VIF needs constant and target should be removed

Hi @kennethleungty ,

thank you for the great notebook.

I just worked through the Python Notebook and think the VIF calculation may not be done against the target variable.

Since you just want to figure out the mutlicolinearity of the covariates the target must be removed.
Also to my knowledge the VIF must include an intercept. The python implementation of the function "variance_inflation_factor" does not add this by itself. So a call to sm.add_constant() must be done to the dataframe beforehand.

For #1 I have no specific source, but just remember this from my statistic lecture.
For #2 I have the following sources:

https://stackoverflow.com/questions/42658379/variance-inflation-factor-in-python
The R-Package car with the vif() function also includes an intercept: https://www.statology.org/variance-inflation-factor-r/
=> Using the python function to recalculate the package only gives the same result if the intercept is inlucded.
I also include a controversial source, as here a comment says, that there is a debate wether to include Intercept or not. I found no other source of that though and R seems to default otherwise: https://stackoverflow.com/questions/59694427/what-does-the-high-vif-for-the-constant-term-intercept-indicate

Compare:

import statsmodels.api as sm
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_raw = pd.DataFrame(mtcars)
df = df_raw.loc[:, ['disp', 'hp', 'wt', 'drat']]

## Without intercept
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
vif["features"] = df.columns
vif

## With intercept
df_const = sm.add_constant(df, prepend = False)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(df_const.values, i) for i in range(df_const.shape[1])]
vif["features"] = df_const.columns
vif

Another and more readable way, that is even more similar to R would be to use formulas:

from patsy import dmatrices
y, X = dmatrices('mpg ~ disp + hp + wt + drat', data=df_raw, return_type='dataframe')
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif

Looking forward to your feedback.

Assumption 5 - Independence of observations

Hi @kennethleungty ,

In my opinion plotting residuals vs index (or time) is usually misleading. I prefer to check the independence assumption via Ljung-Box test of autocorrelation. I've writtten a function which do this for me.

from statsmodels.stats.diagnostic import acorr_ljungbox
import statsmodels as statsmodels


def check_independence(model:statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, order:int):
    
    '''
    1. Perform the Ljung-Box test to check if residuals are autocorrelated
    2. Print both the null hypothesis and the p-values
    '''
    
    # If the lags parameter is an integer then this is taken to be the largest lag that is included, the test result is reported for all smaller lag length
    ljungbox_pvalues = acorr_ljungbox(x=model.resid_deviance.values, lags=order)['lb_pvalue'].round(2)
    boolean_mask = ljungbox_pvalues > 0.05
    
    if not ljungbox_pvalues.empty:
        print(f'The null hypothesis of Ljung-Box test is that there is autocorrelation in residuals of any order up to {order}.')
        print('p-value = P(reject H0|H0 true)')
        print(f'p-values of Ljung-Box test are: {[pvalue for pvalue in ljungbox_pvalues.values]}')
        print(f'p-values > 0.05, thus the residuals are uncorrelated at lags {[lag for lag in ljungbox_pvalues[boolean_mask].index]}')
        print(f'p-values < 0.05, thus the residuals are autocorrelated at lags {[lag for lag in ljungbox_pvalues[~boolean_mask].index]}')

In your example there is some autocorrelation - check out the output below.

check_independence(model=logit_results, order=10)

The null hypothesis of Ljung-Box test is that there is autocorrelation in residuals of any order up to 10.
p-value = P(reject H0|H0 true)
p-values of Ljung-Box test are: [0.36, 0.6, 0.62, 0.71, 0.02, 0.01, 0.02, 0.04, 0.04, 0.07]
p-values > 0.05, thus the residuals are uncorrelated at lags [1, 2, 3, 4, 10]
p-values < 0.05, thus the residuals are autocorrelated at lags [5, 6, 7, 8, 9]

kennethleungty / logistic-regression-assumptions Goto Github PK

logistic-regression-assumptions's Introduction

Assumptions of Logistic Regression, Clearly Explained

Understanding and implementing the assumption checks behind one of the most important statistical techniques in data science - Logistic Regression

Contents

Special Thanks

References

logistic-regression-assumptions's People

Contributors

Stargazers

Watchers

Forkers

logistic-regression-assumptions's Issues

VIF needs constant and target should be removed

Assumption 5 - Independence of observations

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs