In this lecture, you'll learn how to run your first multiple linear regression model.
You will be able to:
- Introduce Statsmodels for multiple regression
- Present alternatives for running regression in Scikit Learn
This lecture will be more of a code-along, where we will walk through a multiple linear regression model using both Statsmodels and Scikit-Learn.
Remember that we introduced single linear regression before, which is known as ordinary least squares. It determines a line of best fit by minimizing the sum of squares of the errors between the models predictions and the actual data. In algebra and statistics classes, this is often limited to the simple 2 variable case of
The code below reiterates the steps we've taken before: we've created dummies for our categorical variables and have log-transformed some of our continuous predictors.
import pandas as pd
import numpy as np
data = pd.read_csv("auto-mpg.csv")
data['horsepower'].astype(str).astype(int)
acc = data["acceleration"]
logdisp = np.log(data["displacement"])
loghorse = np.log(data["horsepower"])
logweight= np.log(data["weight"])
scaled_acc = (acc-min(acc))/(max(acc)-min(acc))
scaled_disp = (logdisp-np.mean(logdisp))/np.sqrt(np.var(logdisp))
scaled_horse = (loghorse-np.mean(loghorse))/(max(loghorse)-min(loghorse))
scaled_weight= (logweight-np.mean(logweight))/np.sqrt(np.var(logweight))
data_fin = pd.DataFrame([])
data_fin["acc"]= scaled_acc
data_fin["disp"]= scaled_disp
data_fin["horse"] = scaled_horse
data_fin["weight"] = scaled_weight
cyl_dummies = pd.get_dummies(data["cylinders"], prefix="cyl")
yr_dummies = pd.get_dummies(data["model year"], prefix="yr")
orig_dummies = pd.get_dummies(data["origin"], prefix="orig")
mpg = data["mpg"]
data_fin = pd.concat([mpg, data_fin, cyl_dummies, yr_dummies, orig_dummies], axis=1)
data_fin.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 392 entries, 0 to 391
Data columns (total 26 columns):
mpg 392 non-null float64
acc 392 non-null float64
disp 392 non-null float64
horse 392 non-null float64
weight 392 non-null float64
cyl_3 392 non-null uint8
cyl_4 392 non-null uint8
cyl_5 392 non-null uint8
cyl_6 392 non-null uint8
cyl_8 392 non-null uint8
yr_70 392 non-null uint8
yr_71 392 non-null uint8
yr_72 392 non-null uint8
yr_73 392 non-null uint8
yr_74 392 non-null uint8
yr_75 392 non-null uint8
yr_76 392 non-null uint8
yr_77 392 non-null uint8
yr_78 392 non-null uint8
yr_79 392 non-null uint8
yr_80 392 non-null uint8
yr_81 392 non-null uint8
yr_82 392 non-null uint8
orig_1 392 non-null uint8
orig_2 392 non-null uint8
orig_3 392 non-null uint8
dtypes: float64(5), uint8(21)
memory usage: 23.4 KB
This was the data we had until now. As we want to focus on model interpretation and still don't want to have a massive model for now, let's only inlude "acc", "horse" and the three "orig" categories in our final data.
data_ols = pd.concat([mpg, scaled_acc, scaled_weight, orig_dummies], axis= 1)
data_ols.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
mpg | acceleration | weight | orig_1 | orig_2 | orig_3 | |
---|---|---|---|---|---|---|
0 | 18.0 | 0.238095 | 0.720986 | 1 | 0 | 0 |
1 | 15.0 | 0.208333 | 0.908047 | 1 | 0 | 0 |
2 | 18.0 | 0.178571 | 0.651205 | 1 | 0 | 0 |
3 | 16.0 | 0.238095 | 0.648095 | 1 | 0 | 0 |
4 | 17.0 | 0.148810 | 0.664652 | 1 | 0 | 0 |
Now, let's use the statsmodels.api to run our ols on all our data. Just like for linear regression with a single predictor, you can use the formula
import statsmodels.api as sm
from statsmodels.formula.api import ols
formula = "mpg ~ acceleration+weight+orig_1+orig_2+orig_3"
model = ols(formula= formula, data=data_ols).fit()
Having to type out all the predictors isn't practical when you have many. Another better way than to type them all out is to seperate out the outcome variable "mpg" out of your data frame, and use the a "+".join()
command on the predictors, as done below:
outcome = 'mpg'
predictors = data_ols.drop('mpg', axis=1)
pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum
model = ols(formula= formula, data=data_ols).fit()
model.summary()
Dep. Variable: | mpg | R-squared: | 0.726 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.723 |
Method: | Least Squares | F-statistic: | 256.7 |
Date: | Thu, 08 Nov 2018 | Prob (F-statistic): | 1.86e-107 |
Time: | 13:56:09 | Log-Likelihood: | -1107.2 |
No. Observations: | 392 | AIC: | 2224. |
Df Residuals: | 387 | BIC: | 2244. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 16.1041 | 0.509 | 31.636 | 0.000 | 15.103 | 17.105 |
acceleration | 5.0494 | 1.389 | 3.634 | 0.000 | 2.318 | 7.781 |
weight | -5.8764 | 0.282 | -20.831 | 0.000 | -6.431 | -5.322 |
orig_1 | 4.6566 | 0.363 | 12.839 | 0.000 | 3.944 | 5.370 |
orig_2 | 5.0690 | 0.454 | 11.176 | 0.000 | 4.177 | 5.961 |
orig_3 | 6.3785 | 0.430 | 14.829 | 0.000 | 5.533 | 7.224 |
Omnibus: | 37.427 | Durbin-Watson: | 0.840 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 55.989 |
Skew: | 0.648 | Prob(JB): | 6.95e-13 |
Kurtosis: | 4.322 | Cond. No. | 2.18e+15 |
Or even easier, simply use the .OLS
-method from statsmodels.api. The advantage is that you don't have to create the summation string. Important to note, however, is that the intercept term is not included by default, so you have to make sure you manipulate your predictors
dataframe so it includes a constant term. You can do this using .add_constant
.
import statsmodels.api as sm
predictors_int = sm.add_constant(predictors)
model = sm.OLS(data['mpg'],predictors_int).fit()
model.summary()
Dep. Variable: | mpg | R-squared: | 0.726 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.723 |
Method: | Least Squares | F-statistic: | 256.7 |
Date: | Thu, 08 Nov 2018 | Prob (F-statistic): | 1.86e-107 |
Time: | 13:56:09 | Log-Likelihood: | -1107.2 |
No. Observations: | 392 | AIC: | 2224. |
Df Residuals: | 387 | BIC: | 2244. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 16.1041 | 0.509 | 31.636 | 0.000 | 15.103 | 17.105 |
acceleration | 5.0494 | 1.389 | 3.634 | 0.000 | 2.318 | 7.781 |
weight | -5.8764 | 0.282 | -20.831 | 0.000 | -6.431 | -5.322 |
orig_1 | 4.6566 | 0.363 | 12.839 | 0.000 | 3.944 | 5.370 |
orig_2 | 5.0690 | 0.454 | 11.176 | 0.000 | 4.177 | 5.961 |
orig_3 | 6.3785 | 0.430 | 14.829 | 0.000 | 5.533 | 7.224 |
Omnibus: | 37.427 | Durbin-Watson: | 0.840 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 55.989 |
Skew: | 0.648 | Prob(JB): | 6.95e-13 |
Kurtosis: | 4.322 | Cond. No. | 2.18e+15 |
Just like for single multiple regression, the coefficients for our model should be interpreted as "how does Y change for each additional unit X"? Do note that the fact that we transformed X, interpretation can sometimes require a little more attention. In fact, as the model is built on the transformed X, the actual relationship is "how does Y change for each additional unit X'", where X' is the (log- and min-max, standardized,...) transformed data matrix.
You can also repeat this process using Scikit-Learn. The code to do this can be found below. The Scikit-learn is generally known for its machine learning functionalities and generally very popular when it comes to building a clear data science workflow. It is also commonly used by data scientists for regression. The disadvantage of scikit learn compared to Statsmodels is that it doesn't have some statistical metrics like the p-values of the parameter estimates readily available. For a more ad-hoc comparison of Scikit-learn and statsmodels, you can read this blogpost: https://blog.thedataincubator.com/2017/11/scikit-learn-vs-statsmodels/.
from sklearn.linear_model import LinearRegression
y = data_ols['mpg']
linreg = LinearRegression()
linreg.fit(predictors, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
# coefficients
linreg.coef_
array([ 5.04941007, -5.87640551, -0.71140721, -0.29903267, 1.01043987])
The intercept of the model is stored in the .intercept_
-attribute.
# intercept
linreg.intercept_
21.472164286075383
You might have noticed that running our regression in Scikit-learn and Statsmodels returned (partially) different parameter estimates. Let's put them side to side:
Statsmodels | Scikit-learn | |
---|---|---|
intercept | 16.1041 | 21.4722 |
acceleration | 5.0494 | 5.0494 |
weight | -5.8764 | -5.8764 |
orig_1 | 4.6566 | -0.7114 |
orig_2 | 5.0690 | -0.2990 |
orig_3 | 6.3785 | 1.0104 |
These models return equivalent results! We'll use an example to illustrate this. Remember that minmax-scaling was used on acceleration, and standardization on log(weight).
Let's assume a particular observation with a value of 0.5 for both acceleration and weight after transformation, and let's assume that the origin of the car = orig_3
. The predicted value for mpg for this particular value will then be equal to:
- 16.1041 + 5.0494 * 0.5+ (-5.8764) * 0.5 + 6.3785 = 22.0691 according to the Statsmodels
- 21.4722 + 5.0494 * 0.5+ (-5.8764) * 0.5 + 1.0104 = 22.0691 according to the Scikit-learn model
The eventual result is the same. The extimates for the categorical variables are the same "up to a constant", the difference between the categorical variables, in this case 5.3681, is added in the intercept!
You can make sure to get the same result in both Statsmodels and Scikit-learn, by dropping out one of the orig_
-levels. This way, you're essentially forcing the coefficient of this level to be equal to zero, and the intercepts and the other coefficients will be the same.
This is how you do it in Scikit-learn:
predictors = predictors.drop("orig_3",axis=1)
linreg.fit(predictors, y)
linreg.coef_
array([ 5.04941007, -5.87640551, -1.72184708, -1.30947254])
linreg.intercept_
22.482604160455665
And Statsmodels:
pred_sum = "+".join(predictors.columns)
formula = outcome + "~" + pred_sum
model = ols(formula= formula, data=data_ols).fit()
model.summary()
Dep. Variable: | mpg | R-squared: | 0.726 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.723 |
Method: | Least Squares | F-statistic: | 256.7 |
Date: | Thu, 08 Nov 2018 | Prob (F-statistic): | 1.86e-107 |
Time: | 13:56:09 | Log-Likelihood: | -1107.2 |
No. Observations: | 392 | AIC: | 2224. |
Df Residuals: | 387 | BIC: | 2244. |
Df Model: | 4 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 22.4826 | 0.789 | 28.504 | 0.000 | 20.932 | 24.033 |
acceleration | 5.0494 | 1.389 | 3.634 | 0.000 | 2.318 | 7.781 |
weight | -5.8764 | 0.282 | -20.831 | 0.000 | -6.431 | -5.322 |
orig_1 | -1.7218 | 0.653 | -2.638 | 0.009 | -3.005 | -0.438 |
orig_2 | -1.3095 | 0.688 | -1.903 | 0.058 | -2.662 | 0.043 |
Omnibus: | 37.427 | Durbin-Watson: | 0.840 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 55.989 |
Skew: | 0.648 | Prob(JB): | 6.95e-13 |
Kurtosis: | 4.322 | Cond. No. | 9.59 |
Congrats! You now know how to build a linear regression model with multiple predictors in both Scikit-Learn and Statsmodels. Before we discuss the model metrics in detail, let's go ahead and try out this model on the Boston Housing Data Set!