In this lab you'll identify multicollinearity in the Boston Housing Data set.
You will be able to:
- Plot heatmaps for the predictors of the Boston dataset
- Understand and calculate correlation matrices
Let's reimport the Boston Housing data and let's use the data with the categorical variables for tax_dummy
and rad_dummy
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston()
boston_features = pd.DataFrame(boston.data, columns = boston.feature_names)
# first, create bins for RAD based on the values observed. 5 values will result in 4 bins
bins = [0, 3, 4 , 5, 24]
bins_rad = pd.cut(boston_features['RAD'], bins)
bins_rad = bins_rad.cat.as_unordered()
# first, create bins for TAX based on the values observed. 6 values will result in 5 bins
bins = [0, 250, 300, 360, 460, 712]
bins_tax = pd.cut(boston_features['TAX'], bins)
bins_tax = bins_tax.cat.as_unordered()
tax_dummy = pd.get_dummies(bins_tax, prefix="TAX")
rad_dummy = pd.get_dummies(bins_rad, prefix="RAD")
boston_features = boston_features.drop(["RAD","TAX"], axis=1)
boston_features = pd.concat([boston_features, rad_dummy, tax_dummy], axis=1)
boston_features.head()
Create the scatter matrix for the Boston Housing data.
This took a while to load. Not surprisingly, the categorical variables didn't really provide any meaningful result. remove the categorical columns associated with "RAD" and "TAX" from the data again and look at the scatter matrix again.
Next, let's look at the correlation matrix
Return "True" for positive or negative correlations that are bigger than 0.75.
Remove the most problematic feature from the data.
Good job! You've now edited the Boston Housing Data so highly correlated variables are removed.