import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

Predictive Regression Modeling Workflow

This dataset was downloaded from Kaggle and contains information about used car sale listings. We are trying to predict the price associated with the listing.

Features (as described on Kaggle)

Car_Name: The name of the car
Year: The year in which the car was bought
Selling_Price: The price the owner wants to sell the car at
Present_Price: The current ex-showroom price of the car
Kms_Driven: The distance completed by the car in km
Fuel_Type: The fuel type of the car (Petrol, Diesel, or Other)
Seller_Type: Whether the seller is a dealer or an individual
Transmission: Whether the car is manual or automatic
Owner: The number of owners the car has previously had

Looking at the original website, it looks like the prices are listed in lakhs, meaning hundreds of thousands of rupees.

df = pd.read_csv("cars.csv")

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Car_Name	Year	Selling_Price	Present_Price	Kms_Driven	Fuel_Type	Seller_Type	Transmission
0	ritz	2014	3.35	5.59	27000	Petrol	Dealer	Manual
1	sx4	2013	4.75	9.54	43000	Diesel	Dealer	Manual
2	ciaz	2017	7.25	9.85	6900	Petrol	Dealer	Manual
3	wagon r	2011	2.85	4.15	5200	Petrol	Dealer	Manual
4	swift	2014	4.60	6.87	42450	Diesel	Dealer	Manual

df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Year	Selling_Price	Present_Price	Kms_Driven	Owner
count	301.000000	301.000000	301.000000	301.000000	301.000000
mean	2013.627907	4.661296	7.628472	36947.205980	0.043189
std	2.891554	5.082812	8.644115	38886.883882	0.247915
min	2003.000000	0.100000	0.320000	500.000000	0.000000
25%	2012.000000	0.900000	1.200000	15000.000000	0.000000
50%	2014.000000	3.600000	6.400000	32000.000000	0.000000
75%	2016.000000	6.000000	9.900000	48767.000000	0.000000
max	2018.000000	35.000000	92.600000	500000.000000	3.000000

df.isna().sum()

Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64

sns.pairplot(df)

<seaborn.axisgrid.PairGrid at 0x11c9e6f28>

Train-Test Split

Before performing any preprocessing or modeling, set aside a holdout test set

X = df.drop("Selling_Price", axis=1)
y = df["Selling_Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Baseline Model: Linear Regression with Numeric Features Only

We have four numeric features (Year, Present_Price, Kms_Driven, and Owner) and four non-numeric features (Car_Name, Fuel_Type, Seller_Type, Transmission). Before doing any of the engineering work to be able to use those non-numeric features, let's just try using the numeric ones

lin_reg_model = LinearRegression()

X_train_numeric = X_train[["Year", "Present_Price", "Kms_Driven", "Owner"]].copy()

baseline_cross_val_score = cross_val_score(lin_reg_model, X_train_numeric, y_train, cv=5)
baseline_cross_val_score

array([0.67168559, 0.75626366, 0.88591659, 0.79241643, 0.84299344])

Ok, not too bad, we are getting somewhere between 0.67 and 0.89 r-squared for a linear regression with just the numeric features

Add One-Hot Encoded Features

Let's see if adding in some of those non-numeric features helps

# resetting the index so we can concatenate the one-hot encoded dfs more easily
X_train_all_features = X_train.copy().reset_index().drop("index", axis=1)

def encode_and_concat_feature_train(X_train_all_features, feature_name):
    """
    Helper function for transforming training data.  It takes in the full X dataframe and
    feature name, makes a one-hot encoder, and returns the encoder as well as the dataframe
    with that feature transformed into multiple columns of 1s and 0s
    """
    # make a one-hot encoder and fit it to the training data
    ohe = OneHotEncoder(categories="auto", handle_unknown="ignore")
    single_feature_df = X_train_all_features[[feature_name]]
    ohe.fit(single_feature_df)
    
    # call helper function that actually encodes the feature and concats it
    X_train_all_features = encode_and_concat_feature(X_train_all_features, feature_name, ohe)
    
    return ohe, X_train_all_features

def encode_and_concat_feature(X, feature_name, ohe):
    """
    Helper function for transforming a feature into multiple columns of 1s and 0s. Used
    in both training and testing steps.  Takes in the full X dataframe, feature name, 
    and encoder, and returns the dataframe with that feature transformed into multiple
    columns of 1s and 0s
    """
    # create new one-hot encoded df based on the feature
    single_feature_df = X[[feature_name]]
    feature_array = ohe.transform(single_feature_df).toarray()
    ohe_df = pd.DataFrame(feature_array, columns=ohe.categories_[0])
    
    # drop the old feature from X and concat the new one-hot encoded df
    X = X.drop(feature_name, axis=1)
    X = pd.concat([X, ohe_df], axis=1)
    
    return X

# we will need each of these encoders later for transforming the test data

fuel_type_ohe, X_train_all_features = encode_and_concat_feature_train(X_train_all_features, "Fuel_Type")
seller_type_ohe, X_train_all_features = encode_and_concat_feature_train(X_train_all_features, "Seller_Type")
transmission_ohe, X_train_all_features = encode_and_concat_feature_train(X_train_all_features, "Transmission")
# putting car name at the end just because there are the most categories
car_name_ohe, X_train_all_features = encode_and_concat_feature_train(X_train_all_features, "Car_Name")

X_train_all_features.columns

Index(['Year', 'Present_Price', 'Kms_Driven', 'Owner', 'CNG', 'Diesel',
       'Petrol', 'Dealer', 'Individual', 'Automatic', 'Manual', '800',
       'Activa 3g', 'Bajaj  ct 100', 'Bajaj Avenger 150',
       'Bajaj Avenger 150 street', 'Bajaj Avenger 220',
       'Bajaj Avenger 220 dtsi', 'Bajaj Avenger Street 220',
       'Bajaj Discover 100', 'Bajaj Discover 125', 'Bajaj Dominar 400',
       'Bajaj Pulsar 135 LS', 'Bajaj Pulsar 150', 'Bajaj Pulsar 220 F',
       'Bajaj Pulsar NS 200', 'Bajaj Pulsar RS200', 'Hero  Ignitor Disc',
       'Hero Extreme', 'Hero Glamour', 'Hero Honda Passion Pro', 'Hero Hunk',
       'Hero Passion Pro', 'Hero Passion X pro', 'Hero Splender Plus',
       'Hero Splender iSmart', 'Hero Super Splendor', 'Honda Activa 4G',
       'Honda CB Hornet 160R', 'Honda CB Shine', 'Honda CB Unicorn',
       'Honda CB twister', 'Honda CBR 150', 'Honda Karizma', 'Hyosung GT250R',
       'KTM 390 Duke ', 'KTM RC200', 'KTM RC390', 'Royal Enfield Bullet 350',
       'Royal Enfield Classic 350', 'Royal Enfield Classic 500',
       'Royal Enfield Thunder 350', 'Royal Enfield Thunder 500',
       'Suzuki Access 125', 'TVS Apache RTR 160', 'TVS Apache RTR 180',
       'TVS Jupyter', 'TVS Sport ', 'TVS Wego', 'Yamaha FZ  v 2.0',
       'Yamaha FZ 16', 'Yamaha FZ S ', 'Yamaha FZ S V 2.0', 'alto 800',
       'alto k10', 'amaze', 'baleno', 'brio', 'camry', 'ciaz', 'city',
       'corolla altis', 'creta', 'dzire', 'elantra', 'eon', 'ertiga',
       'etios cross', 'etios g', 'etios gd', 'etios liva', 'fortuner',
       'grand i10', 'i10', 'i20', 'ignis', 'innova', 'jazz', 'land cruiser',
       'omni', 'ritz', 'swift', 'sx4', 'verna', 'wagon r', 'xcent'],
      dtype='object')

X_train_all_features

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Year	Present_Price	Kms_Driven	Owner	CNG	Diesel	Petrol	Dealer	Individual	Automatic	...	innova	jazz	land cruiser	omni	ritz	swift	sx4	verna	wagon r	xcent
0	2017	0.84	5000	0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	2015	14.79	12900	0	0.0	0.0	1.0	1.0	0.0	1.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	2015	0.32	35000	0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	2015	13.60	21780	0	0.0	0.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
4	2015	5.90	14465	0	0.0	0.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
220	2013	0.57	18000	0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
221	2011	12.48	45000	0	0.0	1.0	0.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
222	2014	3.45	16500	1	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
223	2011	10.00	69341	0	0.0	0.0	1.0	1.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
224	2017	1.78	4000	0	0.0	0.0	1.0	0.0	1.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

225 rows × 96 columns

Linear Regression with More Features

lin_reg_model = LinearRegression()

print("Old:", baseline_cross_val_score)
print("New:", cross_val_score(lin_reg_model, X_train_all_features, y_train, cv=5))

Old: [0.67168559 0.75626366 0.88591659 0.79241643 0.84299344]
New: [ 5.37672694e-01 -7.38091761e+12  9.16586477e-01  7.58859065e-01
  7.52699829e-01]

! conda list scikit-learn

# packages in environment at /Users/ehoffman/.conda/envs/prework-labs:
#
# Name                    Version                   Build  Channel
scikit-learn              0.22             py37h3dc85bc_1    conda-forge

That looks worse. What if we don't use the car name, and just use the categories with 1-3 values?

X_train_all_except_car_name = X_train_all_features[[
                    "Year",
                    "Present_Price",
                    "Kms_Driven",
                    "Owner",
                    "CNG",
                    "Diesel",
                    "Petrol",
                    "Dealer",
                    "Individual",
                    "Automatic",
                    "Manual"
                ]].copy()
X_train_all_except_car_name

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Year	Present_Price	Kms_Driven	Owner	CNG	Diesel	Petrol	Dealer	Individual	Automatic	Manual
0	2017	0.84	5000	0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
1	2015	14.79	12900	0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
2	2015	0.32	35000	0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
3	2015	13.60	21780	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
4	2015	5.90	14465	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...
220	2013	0.57	18000	0	0.0	0.0	1.0	0.0	1.0	0.0	1.0
221	2011	12.48	45000	0	0.0	1.0	0.0	1.0	0.0	0.0	1.0
222	2014	3.45	16500	1	0.0	0.0	1.0	0.0	1.0	0.0	1.0
223	2011	10.00	69341	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
224	2017	1.78	4000	0	0.0	0.0	1.0	0.0	1.0	0.0	1.0

225 rows × 11 columns

lin_reg_model = LinearRegression()

print("Old:", baseline_cross_val_score)
print("New:", cross_val_score(lin_reg_model, X_train_all_except_car_name, y_train, cv=5))

Old: [0.67168559 0.75626366 0.88591659 0.79241643 0.84299344]
New: [0.71773298 0.65008625 0.92349676 0.81151078 0.88264199]

lin_reg_model.fit(X_train_all_except_car_name, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

lin_reg_model.coef_

array([ 3.66737867e-01,  4.24013125e-01, -5.41982063e-06, -1.01134981e+00,
       -1.09963446e+00,  1.42995678e+00, -3.30322320e-01,  6.31302952e-01,
       -6.31302952e-01,  7.77845019e-01, -7.77845019e-01])

lin_reg_model.intercept_

-736.3855028465609

lin_reg_model.rank_

lin_reg_model.get_params()

{'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'normalize': False}

Ok, adding these categories improved r-squared for 4 out of 5 subsamples compared to just having numeric features, so let's keep them for our linear regression model

best_linreg_cross_val_score = cross_val_score(lin_reg_model, X_train_all_except_car_name, y_train)

Try a More Advanced Model

It depends on our business case whether these numbers are sufficient. We are explaining approximately somewhere between 65% and 92% of the variance in the sale price. But let's try a more complicated model.

First, just using the X_train values used in the linear regression:

random_forest_regressor_model_1 = RandomForestRegressor(n_estimators=10, random_state=42)

print("Old:", best_linreg_cross_val_score)
print("New:", cross_val_score(random_forest_regressor_model_1, X_train_all_except_car_name, y_train, cv=5))

Old: [0.71773298 0.65008625 0.92349676 0.81151078 0.88264199]
New: [0.83102658 0.66532476 0.90650579 0.81714334 0.92862009]

Ok, this more-sophisticated model is performing slightly better on 4 of 5 subsamples than the best linear regression score. Let's see what happens if we add the car names back in:

random_forest_regressor_model_2 = RandomForestRegressor(n_estimators=10, random_state=42)

print("Old:", cross_val_score(random_forest_regressor_model_1, X_train_all_except_car_name, y_train, cv=5))
print("New:", cross_val_score(random_forest_regressor_model_2, X_train_all_features, y_train, cv=5))

Old: [0.83102658 0.66532476 0.90650579 0.81714334 0.92862009]
New: [0.8120682  0.7237103  0.90434184 0.80154837 0.92665771]

Only one of the subsamples improved with adding this feature, and everything else got worse

Hyperparameter Tuning the More Advanced Model

Let's add some more "power" to the random forest regressor, since it's running reasonably quickly right now

random_forest_regressor_model_3 = RandomForestRegressor(n_estimators=1000, random_state=42)

print("Old:", cross_val_score(random_forest_regressor_model_1, X_train_all_except_car_name, y_train, cv=5))
print("New:", cross_val_score(random_forest_regressor_model_3, X_train_all_except_car_name, y_train, cv=5))

Old: [0.83102658 0.66532476 0.90650579 0.81714334 0.92862009]
New: [0.85347083 0.72854389 0.90506631 0.83936817 0.93520377]

That marginally improved 4 of the 5 subsamples (but was significantly slower to run). Let's try including the car name again:

random_forest_regressor_model_4 = RandomForestRegressor(n_estimators=1000, random_state=42)

print("Old:", cross_val_score(random_forest_regressor_model_3, X_train_all_except_car_name, y_train))
print("New:", cross_val_score(random_forest_regressor_model_4, X_train_all_features, y_train))

Old: [0.85347083 0.72854389 0.90506631 0.83936817 0.93520377]
New: [0.82470832 0.71709424 0.89251424 0.8456903  0.93846238]

Again, that didn't really seem to help. So if we're stopping right now, we can say that the third random forest regressor is the best model.

Model Evaluation

Now that we have chosen a best model, let's use the holdout set to see how well the final model does

First, perform all of the same transformations on the test X that were performed on the train X

X_test_all_except_car_name = X_test.reset_index().drop(["index", "Car_Name"], axis=1)

# fuel_type_ohe, seller_type_ohe, and transmission_ohe were fitted on the training data
X_test_all_except_car_name = encode_and_concat_feature(X_test_all_except_car_name, "Fuel_Type", fuel_type_ohe)
X_test_all_except_car_name = encode_and_concat_feature(X_test_all_except_car_name, "Seller_Type", seller_type_ohe)
X_test_all_except_car_name = encode_and_concat_feature(X_test_all_except_car_name, "Transmission", transmission_ohe)

X_test_all_except_car_name

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Year	Present_Price	Kms_Driven	Owner	CNG	Diesel	Petrol	Dealer	Individual	Automatic	Manual
0	2016	0.57	24000	0	0.0	0.0	1.0	0.0	1.0	1.0	0.0
1	2016	13.60	10980	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
2	2012	9.40	60000	0	0.0	1.0	0.0	1.0	0.0	0.0	1.0
3	2011	0.57	35000	1	0.0	0.0	1.0	0.0	1.0	0.0	1.0
4	2013	18.61	40001	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
...	...	...	...	...	...	...	...	...	...	...	...
71	2011	8.01	50000	0	0.0	0.0	1.0	1.0	0.0	1.0	0.0
72	2016	7.90	28569	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
73	2015	7.27	40534	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
74	2012	4.43	23709	0	0.0	0.0	1.0	1.0	0.0	0.0	1.0
75	2016	1.40	35000	0	0.0	0.0	1.0	0.0	1.0	0.0	1.0

76 rows × 11 columns

Fit our best model on all of the training data

random_forest_regressor_model_3.fit(X_train_all_except_car_name, y_train)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=1000, n_jobs=None, oob_score=False,
                      random_state=42, verbose=0, warm_start=False)

Score our best model on the test data

random_forest_regressor_model_3.score(X_test_all_except_car_name, y_test)

0.9706072528266274

That's pretty good! We have a model that is able to explain 97% of the variance in the car sale list prices

To report something more applicable to a business audience, let's calculate the root mean square error

from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, random_forest_regressor_model_3.predict(X_test_all_except_car_name))

0.8075452595118476

np.sqrt(0.8075452595118476)

0.8986352204937483

To interpret this: on average, our prediction of Selling_Price is off (either too high or too low) by about 0.9 lakh, i.e. about 90,000 rupees (about 1200 USD)

Also, here is a plot that shows the actual vs. predicted prices:

fig, ax = plt.subplots()

ax.set_xlabel("True Price (Lakhs)")
ax.set_ylabel("Predicted Price (Lakhs)")

ax.scatter(y_test, random_forest_regressor_model_3.predict(X_test_all_except_car_name), alpha=0.5);

You can see that this model performs quite well on the lower end of price, then makes more mistakes as price increases, particularly beyond 10 lakhs.

If we had made this plot before the evaluation phase, maybe we could go back to figure out if there is any additional feature engineering that would help improve the predictions for higher priced cars (e.g. extracting the make from the car name). But since this is the final evaluation phase, that would be "cheating" since it would incorporate information from the test data in the modeling phase. Later we'll discuss overfitting and why this is a problem.

Similarities and Differences with Inferential Regression Workflow

Same

Data understanding process (looking at what kinds of features we have, trying to understand the relationship to the target)
Starting with a baseline model, then building more and checking how well they performed
Preprocessing needed to prepare categorical variables (all features need to be numeric for the model to understand them)
Doing a regression analysis and looking at r-squared to see how much of the variance is explained
Evaluate model at the end

Different

Checking model performance along the way
- For predictive, we only looked at the metric of interest (r-squared in this case) and not anything else
- Assumptions of linear regression (linearity, normality, homoscedasticiy, independence)
  - Checked for these in inferential but not in predictive
  - For inferential, our FSM had only one feature because we were doing these checks at each phase. For predictive, we added all numeric features into the FSM.
- Coefficients and their p-values
  - For inferential, checked these repeatedly in the modeling and evaluation process
  - For predictive, didn't really need to check coefficients at all (although we did towards the end out of curiousity), and never checked p-values
Models used
- Different packages (sklearn for predictive, statsmodels for inferential) to run the linear regressions
  - Statsmodels gives a lot more information in the model summary that we don't get from the sklearn's linear regression
- For predictive, we used models other than a linear regression (RandomForestRegressor), inferential only used linear regression
  - More advanced models like RandomForestRegressor have hyperparameters we can tune
  - Have to specify a random_state if we want the same behavior each time
Train-test split
- For inferential, we just used all available rows every time we fit the model
- For predictive, we had a test data set that was used for evaluation at the end
- For predictive, we also used cross_val_score every time instead of just fitting to everything in X_train
One-hot encoding (I added this after the lesson ended)
- For inferential, dropped the first one of any given category
- For predictive, didn't drop any

kelliexmachina / predictive-regression-workflow-seattle-ds-012720 Goto Github PK

predictive-regression-workflow-seattle-ds-012720's Introduction

Predictive Regression Modeling Workflow

Features (as described on Kaggle)

Train-Test Split

Baseline Model: Linear Regression with Numeric Features Only

Add One-Hot Encoded Features

Linear Regression with More Features

Try a More Advanced Model

Hyperparameter Tuning the More Advanced Model

Model Evaluation

Similarities and Differences with Inferential Regression Workflow

Same

Different

predictive-regression-workflow-seattle-ds-012720's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs