GithubHelp home page GithubHelp logo

ethansilvas / credit-fraud-logistic-regression Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 645 KB

Comparison of logistic regression models trained on imbalanced historical loan data to predict healthy and high-risk loans

License: GNU General Public License v3.0

Jupyter Notebook 100.00%
imbalanced-learning logistic-regression machine-learning pandas scikit-learn

credit-fraud-logistic-regression's Introduction

Credit Fraud Logistic Regression - UW Fintech Bootcamp Module 12 Challenge

In this project I apply and compare logistic regression models to an imbalanced dataset of historical lending activity in order to predict healthy and high-risk loans.

Data Used

lending_data.csv - labeled (0 - healthy, 1 - high-risk) historical lending activity from a peer-to-peer lending services company


Overview of the Analysis

This analysis aims to compare two logistic regression models, one that trains with imbalanced data and one that uses random oversampling, to see the differences in their predictive performance. The dataset used is labeled loan data with features loan_size, interest_rate, borrower_income, debt_to_income, num_of_accounts, derogatory_marks, and total_debt as shown below:

DataFrame head showing features loan_size, interest_rate, borrower_income, debt_to_income, num_of_accounts, derogatory_marks, and total_debt

The loan_status column is the label to distinguish between healthy loans (0) and high-risk loans (1), but the original data is heavily imbalanced with 75,036 healthy loans and 2500 high-risk loans.

Both models are mostly the same as both are scikit-learn LogisticRegression models. However they differ because, after splitting the data into training and testing data, one is trained using the original (imbalanced) data and the other is trained using randomly oversampled data which end up in an even 56,271 values for both healthy and high-risk loans. After they are trained, they both predict on the same testing data and the results are analyzed using scikit-learn's balanced_accuracy_score, confusion_matrix, and classification_report_imbalanced methods.

Results

  • LogisticRegression model trained on original, imbalanced, data:

    • Balanced accuracy score = 0.9520479254722232
    • Precision scores:
      • Healthy loans = 1.0 = Of the loans that the model predicted to be healthy, about 100% of them were actually healthy loans
      • High-risk loans = 0.85 = Of the loans that the model predicted to be high-risk, about 85% of them were actually high-risk loans
    • Recall scores:
      • Healthy loans = 0.99 = Of all the actually healthy loans, the model correctly predicted them to be healthy about 99% of the time
      • High-risk loans = 0.91 = Of all the actually high-risk loans, the model correctly predicted them to be high-risk about 91% of the time

    Confusion matrix:
    Confusion matrix showing 18663 to 102 for healthy loans and 56 to 563 for high-risk loans

  • LogisticRegression model trained on randomly oversampled data:

    • Balanced accuracy score: 0.9936781215845847
    • Precision scores:
      • Healthy loans = 1.0 = Of the loans that the model predicted to be healthy, about 100% of them were actually healthy loans
      • High-risk loans = 0.84 = Of the loans that the model predicted to be high-risk, about 84% of them were actually high-risk loans
    • Recall scores:
      • Healthy loans = 0.99 = Of all the actually healthy loans, the model correctly predicted them to be healthy about 99% of the time
      • High-risk loans = 0.99 = Of all the actually high-risk loans, the model correctly predicted them to be high-risk about 99% of the time

    Confusion matrix:
    Confusion matrix showing 18649 to 116 for healthy loans and 4 to 615 for high-risk loans

Summary

Since this model focuses on predicting high-risk loans, I would recommend using the randomly oversampled model because it has a 0.99 recall score for high-risk loans compared to the original data model's recall of 0.91 for high-risk loans. This increase in recall score only comes at the cost of a 0.01 reduction in precision for high-risk loans, but this is negligible since the score is still pretty high at 0.84.

Things to keep in mind with these recommendation/results is that there will likely need to be a check for overfitting to our data and it would be a good idea to run this analysis with a validation set as well. However, assuming that the models learned well and aren't highly overfit to the dataset, then it can be said that oversampling for the purpose of predicting high-risk loans is beneficial to performance.


Technologies

This is a Python 3.7 project ran in JupyterLab using a Conda dev environment.

The following dependencies are used:

  1. Jupyter - Running code
  2. Conda (4.13.0) - Dev environment
  3. Pandas (1.3.5) - Data analysis
  4. Numpy (1.21.5) - Data calculations + Pandas support
  5. Scikit-learn (1.0.2) - Machine learning models and tools
  6. Imbalanced-learn (0.10.1) - Imbalanced classification dataset tools

Installation Guide

If you would like to run the program in JupyterLab, install the Anaconda distribution and run jupyter lab in a conda dev environment.

To ensure that your notebook runs properly you can use the requirements.txt file to create an exact copy of the conda dev environment used in development of this project.

Create a copy of the conda dev environment with conda create --name myenv --file requirements.txt

Then install the requirements with conda install --name myenv --file requirements.txt


Usage

The Jupyter notebook credit_risk_resampling_ipynb will provide all steps of the data collection, preparation, and analysis. Data visualizations are shown inline and accompanying analysis responses are provided.


Contributors

Ethan Silvas


License

This project uses the GNU General Public License

credit-fraud-logistic-regression's People

Contributors

ethansilvas avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.