In this project I apply and compare logistic regression models to an imbalanced dataset of historical lending activity in order to predict healthy and high-risk loans.
lending_data.csv - labeled (0 - healthy, 1 - high-risk) historical lending activity from a peer-to-peer lending services company
This analysis aims to compare two logistic regression models, one that trains with imbalanced data and one that uses random oversampling, to see the differences in their predictive performance. The dataset used is labeled loan data with features loan_size, interest_rate, borrower_income, debt_to_income, num_of_accounts, derogatory_marks, and total_debt as shown below:
The loan_status column is the label to distinguish between healthy loans (0) and high-risk loans (1), but the original data is heavily imbalanced with 75,036 healthy loans and 2500 high-risk loans.
Both models are mostly the same as both are scikit-learn LogisticRegression models. However they differ because, after splitting the data into training and testing data, one is trained using the original (imbalanced) data and the other is trained using randomly oversampled data which end up in an even 56,271 values for both healthy and high-risk loans. After they are trained, they both predict on the same testing data and the results are analyzed using scikit-learn's balanced_accuracy_score, confusion_matrix, and classification_report_imbalanced methods.
-
LogisticRegression model trained on original, imbalanced, data:
- Balanced accuracy score = 0.9520479254722232
- Precision scores:
- Healthy loans = 1.0 = Of the loans that the model predicted to be healthy, about 100% of them were actually healthy loans
- High-risk loans = 0.85 = Of the loans that the model predicted to be high-risk, about 85% of them were actually high-risk loans
- Recall scores:
- Healthy loans = 0.99 = Of all the actually healthy loans, the model correctly predicted them to be healthy about 99% of the time
- High-risk loans = 0.91 = Of all the actually high-risk loans, the model correctly predicted them to be high-risk about 91% of the time
-
LogisticRegression model trained on randomly oversampled data:
- Balanced accuracy score: 0.9936781215845847
- Precision scores:
- Healthy loans = 1.0 = Of the loans that the model predicted to be healthy, about 100% of them were actually healthy loans
- High-risk loans = 0.84 = Of the loans that the model predicted to be high-risk, about 84% of them were actually high-risk loans
- Recall scores:
- Healthy loans = 0.99 = Of all the actually healthy loans, the model correctly predicted them to be healthy about 99% of the time
- High-risk loans = 0.99 = Of all the actually high-risk loans, the model correctly predicted them to be high-risk about 99% of the time
Since this model focuses on predicting high-risk loans, I would recommend using the randomly oversampled model because it has a 0.99 recall score for high-risk loans compared to the original data model's recall of 0.91 for high-risk loans. This increase in recall score only comes at the cost of a 0.01 reduction in precision for high-risk loans, but this is negligible since the score is still pretty high at 0.84.
Things to keep in mind with these recommendation/results is that there will likely need to be a check for overfitting to our data and it would be a good idea to run this analysis with a validation set as well. However, assuming that the models learned well and aren't highly overfit to the dataset, then it can be said that oversampling for the purpose of predicting high-risk loans is beneficial to performance.
This is a Python 3.7 project ran in JupyterLab using a Conda dev environment.
The following dependencies are used:
- Jupyter - Running code
- Conda (4.13.0) - Dev environment
- Pandas (1.3.5) - Data analysis
- Numpy (1.21.5) - Data calculations + Pandas support
- Scikit-learn (1.0.2) - Machine learning models and tools
- Imbalanced-learn (0.10.1) - Imbalanced classification dataset tools
If you would like to run the program in JupyterLab, install the Anaconda distribution and run jupyter lab
in a conda dev environment.
To ensure that your notebook runs properly you can use the requirements.txt file to create an exact copy of the conda dev environment used in development of this project.
Create a copy of the conda dev environment with conda create --name myenv --file requirements.txt
Then install the requirements with conda install --name myenv --file requirements.txt
The Jupyter notebook credit_risk_resampling_ipynb will provide all steps of the data collection, preparation, and analysis. Data visualizations are shown inline and accompanying analysis responses are provided.
This project uses the GNU General Public License