GithubHelp home page GithubHelp logo

alanchn31 / loan-default-prediction Goto Github PK

View Code? Open in Web Editor NEW
18.0 2.0 2.0 11.68 MB

Loan Default Prediction using PySpark, with jobs scheduled by Apache Airflow and Integration with Spark using Apache Livy

License: MIT License

Jupyter Notebook 70.09% Makefile 0.80% Python 25.16% Shell 2.01% Dockerfile 1.94%
pyspark python3 classification unit-testing spark auc-roc loan-default-prediction livy livy-docker livy-operators

loan-default-prediction's Introduction

Smart Lending Default Prediction

Screenshot

(Source of icon: https://www.flaticon.com/free-icon/car-loan_2505972)

How to use this repo?


  • This repo can act as an example on how to structure pyspark code to train a machine learning model.
  • Data transformations are modularized into Transformers (app/transformers) then subsequently composed into a pipeline in our preprocess script (app/jobs/preprocess.py)
  • After preprocessing, we fit our Estimators and compose a specialized pipeline that can be fitted on our training data and be used to transform/predict external data. (see app/jobs/train_model.py)
  • See (Set-up) on how to set up environment and run scripts.

Domain Background


  • Financial institutions incur significant losses due to the default of vehicle loans every year.

Problem Statement


  • Goal: To accurately pre-empt problematic borrowers who are likely to default in future, our goal is to build a machine learning model that can accurately predict borrowers that are likely to default on first EMI (Equated Monthly Instalments)

Data


Technical Architecture


TechArch

Architecture of Model Training Pipeline


Screenshot

Folder Structure


.
├── LICENSE
├── Makefile
├── README.md
├── airflow
│   ├── __init__.py
│   ├── config
│   │   └── airflow.cfg
│   ├── dags
│   │   ├── __init__.py
│   │   ├── model_inference_dag.py          # Airflow DAG for model inferencing
│   │   └── model_train_dag.py              # Airflow DAG for model training
│   ├── logs
│   └── plugins
│       ├── __init__.py
│       └── airflow_livy
│           ├── __init__.py
│           ├── batch.py                    # Custom Airflow Livy operator (for batch)
│           └── session.py                  # Custom Airflow Livy operator (for session)
├── app
│   ├── Makefile                            
│   ├── config.json
│   ├── dist
│   │   ├── config.json
│   │   ├── main.py
│   │   └── src.zip
│   ├── main.py
│   ├── src
│   │   ├── __init__.py
│   │   ├── conftest.py
│   │   ├── data
│   │   │   ├── inputs
│   │   │   │   └── loan_default.csv
│   │   ├── jobs
│   │   │   ├── __init__py
│   │   │   ├── inference.py                # Spark Job defined for inferencing
│   │   │   ├── preprocess_data.py          # Spark Job defined for data preprocessing
│   │   │   └── train_model.py              # Spark Job defined for training pyspark.ml GBT model
│   │   ├── models
│   │   │   ├── logs
│   │   │   └── models
│   │   ├── pipe
│   │   │   ├── IF.py
│   │   │   ├── __init__.py
│   │   │   └── pipe.py
│   │   ├── shared
│   │   │   ├── __init__.py
│   │   │   └── utils.py
│   │   └── transformers
│   │       ├── __init__.py
│   │       ├── convert_str_to_date.py
│   │       ├── drop_columns.py
│   │       ├── extract_time_period_mths.py
│   │       ├── get_age.py
│   │       ├── impute_cat_missing_vals.py
│   │       ├── remove_duplicates.py
│   │       └── replace_str_regex.py
│   └── tests
│       └── transformers
│           ├── test_convert_str_to_date.py
│           ├── test_drop_columns.py
│           ├── test_extract_time_period_mths.py
│           ├── test_get_age.py
│           ├── test_impute_cat_missing_vals.py
│           ├── test_remove_duplicates.py
│           └── test_replace_str_regex.py
├── data
├── docker
│   ├── airflow
│   │   ├── Dockerfile
│   │   ├── conf
│   │   │   ├── hadoop
│   │   │   │   ├── core-site.xml
│   │   │   │   ├── hadoop-env.sh
│   │   │   │   ├── hdfs-site.xml
│   │   │   │   ├── mapred-site.xml
│   │   │   │   ├── workers
│   │   │   │   └── yarn-site.xml
│   │   │   └── spark
│   │   │       └── spark-defaults.conf
│   │   └── entrypoint.sh
│   ├── hive
│   │   └── init.sql
│   ├── livy
│   │   └── Dockerfile
│   ├── master
│   │   ├── Dockerfile
│   │   └── master.sh
│   └── worker
│       ├── Dockerfile
│       └── worker.sh
├── docker-compose.yml
├── docs
│   └── images
│       ├── modelling_architecture.PNG
│       ├── project_icon.PNG
│       └── technical_architecture.PNG
├── notebooks
│   └── Loan Default Classification.ipynb			# Jupyter Notebook containing exploration
├── requirements.txt
── setup.py

Results


  • AUC_ROC: 0.625

Set-up (Local)


Note that this project requires spark to be installed on a local system.

  • Run the following command for initial set-up of virtual environment:

    • Run make setup on home dir (outside app dir)
  • Run the following commands to submit spark job (in app dir):

    1. Run make build to move all dependencies for spark job to dist/ folder
    2. Run make preprocess_train to submit spark job to preprocess training data
    3. Run make train_model to submit spark job to train our model on preprocessed training data
    4. Run make preprocess_predict to submit spark job to preprocess data to be inferred by model
    5. Run make model_predict to submit spark job for model to infer on preprocessed inference data
  • Running pytest:

    • Run make test (in app dir)

Set-up (AWS)


Note that this project requires spark to be installed on a local system.

  • Prerequisites:

    • An AWS Account, with a S3 bucket set up
    • EC2 instance (must have permission access to S3)
    • AWS cli installed locally
  • Run the following commands:

    1. Run make build to move all dependencies for spark job to dist/ folder
    2. Run make package_s3 to move packaged dependencies to s3 bucket folder dist
  • Run jobs on EC2 using Airflow and Livy:

    • To set up Spark, Airflow and Livy, run docker-compose build at home dir (on EC2 ssh terminal).
    • After docker-compose build runs successfully, run docker-compose up
    • Check on ec2 address: http://ec2-X-XXX-X-XXX.compute-1.amazonaws.com:9001/, airflow should be up and running, change port to 8998, livy should be up and running
  • Variables configuration:

    • On airflow UI, set the following variables:

      1. aws_config:

        {
            "awsKey": "Input your user's aws key",
            "awsSecretKey": "Input your user's aws secret key",
            "s3Bucket": "Input your s3 bucket for this project"
        }
      2. main_file_path: "main_file_path": "s3a://{YOUR_S3_BUCKET}/dist/main.py

      3. pyfile_path: "pyfiles_path": "s3a://{YOUR_S3_BUCKET}/dist/src.zip

      4. config_file_path: "config_file_path": "s3a://{YOUR_S3_BUCKET}/dist/config.json

    • On airflow UI, also set up a HttpHook to livy:

      "conn_id": "livy", "host": "http://ec2-X-XXX-X-XXX.compute-1.amazonaws.com", "port": 8998

    • Now airflow DAG should be able to run on UI

loan-default-prediction's People

Contributors

alanchn31 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.