GithubHelp home page GithubHelp logo

kennethleungty / logistic-regression-assumptions Goto Github PK

View Code? Open in Web Editor NEW
26.0 2.0 7.0 1.87 MB

Assumptions of Logistic Regression, Clearly Explained

Home Page: https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290

Jupyter Notebook 100.00%
logistic-regression logistic-regression-algorithm logistic-regression-assumptions logistic-regression-classifier logistic-regression-implementation logistic-regression-models python statistics

logistic-regression-assumptions's Introduction

Assumptions of Logistic Regression, Clearly Explained

Understanding and implementing the assumption checks behind one of the most important statistical techniques in data science - Logistic Regression

  • Link to TowardsDataScience article: https://towardsdatascience.com/assumptions-of-logistic-regression-clearly-explained-44d85a22b290
  • Logistic regression is a highly effective modeling technique that has remained a mainstay in statistics since its development in the 1940s.
  • Given its popularity and utility, data practitioners should understand the fundamentals of logistic regression before using it to tackle data and business problems.
  • In this project, we explore the key assumptions of logistic regression with theoretical explanations and practical Python implementation of the assumption checks.

Contents

(1) Logistic_Regression_Assumptions.ipynb

  • The main notebook containing the Python implementation codes (along with explanations) on how to check for each of the 6 key assumptions in logistic regression

(2) Box-Tidwell-Test-in-R.ipynb

  • Notebook containing R code for running Box-Tidwell test (to check for logit linearity assumption)

(3) /data

  • Folder containing the public Titanic dataset (train set)

(4) /references

  • Folder containing several sets of lecture notes explaining advanced regression

Special Thanks

  • @dataninj4 for correcting imports and adding .loc referencing in diagnosis_df cell so that it runs without errors in Python 3.6/3.8
  • @ArneTR for rightly pointing out that VIF calculation should include a constant, and correlation matrix should exclude target variable

References

logistic-regression-assumptions's People

Contributors

dataninjato avatar kennethleungty avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

logistic-regression-assumptions's Issues

VIF needs constant and target should be removed

Hi @kennethleungty ,

thank you for the great notebook.

I just worked through the Python Notebook and think the VIF calculation may not be done against the target variable.

  1. Since you just want to figure out the mutlicolinearity of the covariates the target must be removed.

  2. Also to my knowledge the VIF must include an intercept. The python implementation of the function "variance_inflation_factor" does not add this by itself. So a call to sm.add_constant() must be done to the dataframe beforehand.

For #1 I have no specific source, but just remember this from my statistic lecture.
For #2 I have the following sources:

Compare:

import statsmodels.api as sm
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_raw = pd.DataFrame(mtcars)
df = df_raw.loc[:, ['disp', 'hp', 'wt', 'drat']]

## Without intercept
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
vif["features"] = df.columns
vif

## With intercept
df_const = sm.add_constant(df, prepend = False)
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(df_const.values, i) for i in range(df_const.shape[1])]
vif["features"] = df_const.columns
vif

Another and more readable way, that is even more similar to R would be to use formulas:

from patsy import dmatrices
y, X = dmatrices('mpg ~ disp + hp + wt + drat', data=df_raw, return_type='dataframe')
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"] = X.columns
vif

Looking forward to your feedback.

Assumption 5 - Independence of observations

Hi @kennethleungty ,

In my opinion plotting residuals vs index (or time) is usually misleading. I prefer to check the independence assumption via Ljung-Box test of autocorrelation. I've writtten a function which do this for me.

from statsmodels.stats.diagnostic import acorr_ljungbox
import statsmodels as statsmodels


def check_independence(model:statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, order:int):
    
    '''
    1. Perform the Ljung-Box test to check if residuals are autocorrelated
    2. Print both the null hypothesis and the p-values
    '''
    
    # If the lags parameter is an integer then this is taken to be the largest lag that is included, the test result is reported for all smaller lag length
    ljungbox_pvalues = acorr_ljungbox(x=model.resid_deviance.values, lags=order)['lb_pvalue'].round(2)
    boolean_mask = ljungbox_pvalues > 0.05
    
    if not ljungbox_pvalues.empty:
        print(f'The null hypothesis of Ljung-Box test is that there is autocorrelation in residuals of any order up to {order}.')
        print('p-value = P(reject H0|H0 true)')
        print(f'p-values of Ljung-Box test are: {[pvalue for pvalue in ljungbox_pvalues.values]}')
        print(f'p-values > 0.05, thus the residuals are uncorrelated at lags {[lag for lag in ljungbox_pvalues[boolean_mask].index]}')
        print(f'p-values < 0.05, thus the residuals are autocorrelated at lags {[lag for lag in ljungbox_pvalues[~boolean_mask].index]}')

In your example there is some autocorrelation - check out the output below.

check_independence(model=logit_results, order=10)
The null hypothesis of Ljung-Box test is that there is autocorrelation in residuals of any order up to 10.
p-value = P(reject H0|H0 true)
p-values of Ljung-Box test are: [0.36, 0.6, 0.62, 0.71, 0.02, 0.01, 0.02, 0.04, 0.04, 0.07]
p-values > 0.05, thus the residuals are uncorrelated at lags [1, 2, 3, 4, 10]
p-values < 0.05, thus the residuals are autocorrelated at lags [5, 6, 7, 8, 9]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.