GithubHelp home page GithubHelp logo

dsc-1-11-10-multiple-linear-regression-in-statsmodels-data-science's Introduction

Multiple Linear Regression in Statsmodels

Introduction

In this lecture, you'll learn how to run your first multiple linear regression model.

Objectives

You will be able to:

  • Introduce Statsmodels for multiple regression
  • Present alternatives for running regression in Scikit Learn

Statsmodels for multiple linear regression

This lecture will be more of a code-along, where we will walk through a multiple linear regression model using both Statsmodels and Scikit-Learn.

Remember that we introduced single linear regression before, which is known as ordinary least squares. It determines a line of best fit by minimizing the sum of squares of the errors between the models predictions and the actual data. In algebra and statistics classes, this is often limited to the simple 2 variable case of $y=mx+b$, but this process can be generalized to use multiple predictive variables.

Auto-mpg data

The code below reiterates the steps we've taken before: we've created dummies for our categorical variables and have log-transformed some of our continuous predictors.

import pandas as pd
import numpy as np
data = pd.read_csv("auto-mpg.csv") 
data['horsepower'].astype(str).astype(int)

acc = data["acceleration"]
logdisp = np.log(data["displacement"])
loghorse = np.log(data["horsepower"])
logweight= np.log(data["weight"])

scaled_acc = (acc-min(acc))/(max(acc)-min(acc))	
scaled_disp = (logdisp-np.mean(logdisp))/np.sqrt(np.var(logdisp))
scaled_horse = (loghorse-np.mean(loghorse))/(max(loghorse)-min(loghorse))
scaled_weight= (logweight-np.mean(logweight))/np.sqrt(np.var(logweight))

data_fin = pd.DataFrame([])
data_fin["acc"]= scaled_acc
data_fin["disp"]= scaled_disp
data_fin["horse"] = scaled_horse
data_fin["weight"] = scaled_weight
cyl_dummies = pd.get_dummies(data["cylinders"], prefix="cyl")
yr_dummies = pd.get_dummies(data["model year"], prefix="yr")
orig_dummies = pd.get_dummies(data["origin"], prefix="orig")
mpg = data["mpg"]
data_fin = pd.concat([mpg, data_fin, cyl_dummies, yr_dummies, orig_dummies], axis=1)
data_fin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 26 columns):
mpg       392 non-null float64
acc       392 non-null float64
disp      392 non-null float64
horse     392 non-null float64
weight    392 non-null float64
cyl_3     392 non-null uint8
cyl_4     392 non-null uint8
cyl_5     392 non-null uint8
cyl_6     392 non-null uint8
cyl_8     392 non-null uint8
yr_70     392 non-null uint8
yr_71     392 non-null uint8
yr_72     392 non-null uint8
yr_73     392 non-null uint8
yr_74     392 non-null uint8
yr_75     392 non-null uint8
yr_76     392 non-null uint8
yr_77     392 non-null uint8
yr_78     392 non-null uint8
yr_79     392 non-null uint8
yr_80     392 non-null uint8
yr_81     392 non-null uint8
yr_82     392 non-null uint8
orig_1    392 non-null uint8
orig_2    392 non-null uint8
orig_3    392 non-null uint8
dtypes: float64(5), uint8(21)
memory usage: 23.4 KB

This was the data we had until now. As we want to focus on model interpretation and still don't want to have a massive model for now, let's only inlude "acc", "horse" and the three "orig" categories in our final data.

data_ols = pd.concat([mpg, scaled_acc, scaled_weight, orig_dummies], axis= 1)
data_ols.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
mpg acceleration weight orig_1 orig_2 orig_3
0 18.0 0.238095 0.720986 1 0 0
1 15.0 0.208333 0.908047 1 0 0
2 18.0 0.178571 0.651205 1 0 0
3 16.0 0.238095 0.648095 1 0 0
4 17.0 0.148810 0.664652 1 0 0

A linear model using Statsmodels

Now, let's use the statsmodels.api to run our ols on all our data. Just like for linear regression with a single predictor, you can use the formula $y \sim X$, where, with $n$ predictors, X is represented as $x_1+\ldots+x_n$.

import statsmodels.api as sm
from statsmodels.formula.api import ols
formula = "mpg ~ acceleration+weight+orig_1+orig_2+orig_3"
model = ols(formula= formula, data=data_ols).fit()

Having to type out all the predictors isn't practical when you have many. Another better way than to type them all out is to seperate out the outcome variable "mpg" out of your data frame, and use the a "+".join() command on the predictors, as done below:

outcome = 'mpg'
predictors = data_ols.drop('mpg', axis=1)
pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum
model = ols(formula= formula, data=data_ols).fit()
model.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.726
Model: OLS Adj. R-squared: 0.723
Method: Least Squares F-statistic: 256.7
Date: Thu, 08 Nov 2018 Prob (F-statistic): 1.86e-107
Time: 13:56:09 Log-Likelihood: -1107.2
No. Observations: 392 AIC: 2224.
Df Residuals: 387 BIC: 2244.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 16.1041 0.509 31.636 0.000 15.103 17.105
acceleration 5.0494 1.389 3.634 0.000 2.318 7.781
weight -5.8764 0.282 -20.831 0.000 -6.431 -5.322
orig_1 4.6566 0.363 12.839 0.000 3.944 5.370
orig_2 5.0690 0.454 11.176 0.000 4.177 5.961
orig_3 6.3785 0.430 14.829 0.000 5.533 7.224
Omnibus: 37.427 Durbin-Watson: 0.840
Prob(Omnibus): 0.000 Jarque-Bera (JB): 55.989
Skew: 0.648 Prob(JB): 6.95e-13
Kurtosis: 4.322 Cond. No. 2.18e+15

Or even easier, simply use the .OLS-method from statsmodels.api. The advantage is that you don't have to create the summation string. Important to note, however, is that the intercept term is not included by default, so you have to make sure you manipulate your predictors dataframe so it includes a constant term. You can do this using .add_constant.

import statsmodels.api as sm
predictors_int = sm.add_constant(predictors)
model = sm.OLS(data['mpg'],predictors_int).fit()
model.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.726
Model: OLS Adj. R-squared: 0.723
Method: Least Squares F-statistic: 256.7
Date: Thu, 08 Nov 2018 Prob (F-statistic): 1.86e-107
Time: 13:56:09 Log-Likelihood: -1107.2
No. Observations: 392 AIC: 2224.
Df Residuals: 387 BIC: 2244.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 16.1041 0.509 31.636 0.000 15.103 17.105
acceleration 5.0494 1.389 3.634 0.000 2.318 7.781
weight -5.8764 0.282 -20.831 0.000 -6.431 -5.322
orig_1 4.6566 0.363 12.839 0.000 3.944 5.370
orig_2 5.0690 0.454 11.176 0.000 4.177 5.961
orig_3 6.3785 0.430 14.829 0.000 5.533 7.224
Omnibus: 37.427 Durbin-Watson: 0.840
Prob(Omnibus): 0.000 Jarque-Bera (JB): 55.989
Skew: 0.648 Prob(JB): 6.95e-13
Kurtosis: 4.322 Cond. No. 2.18e+15

Interpretation

Just like for single multiple regression, the coefficients for our model should be interpreted as "how does Y change for each additional unit X"? Do note that the fact that we transformed X, interpretation can sometimes require a little more attention. In fact, as the model is built on the transformed X, the actual relationship is "how does Y change for each additional unit X'", where X' is the (log- and min-max, standardized,...) transformed data matrix.

Linear regression using scikit learn

You can also repeat this process using Scikit-Learn. The code to do this can be found below. The Scikit-learn is generally known for its machine learning functionalities and generally very popular when it comes to building a clear data science workflow. It is also commonly used by data scientists for regression. The disadvantage of scikit learn compared to Statsmodels is that it doesn't have some statistical metrics like the p-values of the parameter estimates readily available. For a more ad-hoc comparison of Scikit-learn and statsmodels, you can read this blogpost: https://blog.thedataincubator.com/2017/11/scikit-learn-vs-statsmodels/.

from sklearn.linear_model import LinearRegression
y = data_ols['mpg']
linreg = LinearRegression()
linreg.fit(predictors, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
# coefficients
linreg.coef_
array([ 5.04941007, -5.87640551, -0.71140721, -0.29903267,  1.01043987])

The intercept of the model is stored in the .intercept_-attribute.

# intercept
linreg.intercept_
21.472164286075383

Why are the coefficients different in scikit learn vs Statsmodels?

You might have noticed that running our regression in Scikit-learn and Statsmodels returned (partially) different parameter estimates. Let's put them side to side:

Statsmodels Scikit-learn
intercept 16.1041 21.4722
acceleration 5.0494 5.0494
weight -5.8764 -5.8764
orig_1 4.6566 -0.7114
orig_2 5.0690 -0.2990
orig_3 6.3785 1.0104

These models return equivalent results! We'll use an example to illustrate this. Remember that minmax-scaling was used on acceleration, and standardization on log(weight).

Let's assume a particular observation with a value of 0.5 for both acceleration and weight after transformation, and let's assume that the origin of the car = orig_3. The predicted value for mpg for this particular value will then be equal to:

  • 16.1041 + 5.0494 * 0.5+ (-5.8764) * 0.5 + 6.3785 = 22.0691 according to the Statsmodels
  • 21.4722 + 5.0494 * 0.5+ (-5.8764) * 0.5 + 1.0104 = 22.0691 according to the Scikit-learn model

The eventual result is the same. The extimates for the categorical variables are the same "up to a constant", the difference between the categorical variables, in this case 5.3681, is added in the intercept!

You can make sure to get the same result in both Statsmodels and Scikit-learn, by dropping out one of the orig_-levels. This way, you're essentially forcing the coefficient of this level to be equal to zero, and the intercepts and the other coefficients will be the same.

This is how you do it in Scikit-learn:

predictors = predictors.drop("orig_3",axis=1)
linreg.fit(predictors, y)
linreg.coef_
array([ 5.04941007, -5.87640551, -1.72184708, -1.30947254])
linreg.intercept_
22.482604160455665

And Statsmodels:

pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum
model = ols(formula= formula, data=data_ols).fit()
model.summary()
OLS Regression Results
Dep. Variable: mpg R-squared: 0.726
Model: OLS Adj. R-squared: 0.723
Method: Least Squares F-statistic: 256.7
Date: Thu, 08 Nov 2018 Prob (F-statistic): 1.86e-107
Time: 13:56:09 Log-Likelihood: -1107.2
No. Observations: 392 AIC: 2224.
Df Residuals: 387 BIC: 2244.
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 22.4826 0.789 28.504 0.000 20.932 24.033
acceleration 5.0494 1.389 3.634 0.000 2.318 7.781
weight -5.8764 0.282 -20.831 0.000 -6.431 -5.322
orig_1 -1.7218 0.653 -2.638 0.009 -3.005 -0.438
orig_2 -1.3095 0.688 -1.903 0.058 -2.662 0.043
Omnibus: 37.427 Durbin-Watson: 0.840
Prob(Omnibus): 0.000 Jarque-Bera (JB): 55.989
Skew: 0.648 Prob(JB): 6.95e-13
Kurtosis: 4.322 Cond. No. 9.59

Summary

Congrats! You now know how to build a linear regression model with multiple predictors in both Scikit-Learn and Statsmodels. Before we discuss the model metrics in detail, let's go ahead and try out this model on the Boston Housing Data Set!

dsc-1-11-10-multiple-linear-regression-in-statsmodels-data-science's People

Contributors

loredirick avatar sik-flow avatar peterbell avatar

Watchers

 avatar James Cloos avatar Kevin McAlear avatar  avatar Victoria Thevenot avatar Belinda Black avatar  avatar Joe Cardarelli avatar Sam Birk avatar Sara Tibbetts avatar The Learn Team avatar Sophie DeBenedetto avatar  avatar Jaichitra (JC) Balakrishnan avatar Antoin avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar A. Perez avatar Nicole Kroese  avatar  avatar  avatar Nicolas Marcora avatar Lisa Jiang avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.