The default-modeling from nvlinhvn

Problem Definition

predict the probability of default for each user id in risk modeling
default = 1 means defaulted users, default = 0 means otherwise
Imbalance binary classification problem

Expected Workflow

Variables (total = 43):

uuid: text User Id
default: (or target) boolean (0, or 1)
Categorical, and numerical features are defined in default_modeling.utils.preproc.pyx (function feature_definition)

Adjustment:

If you want to run the experiment with your data for the purpose of binary classification:
- Replace csv in both train_data and test_data by your csv. (Optional: also change test file test_sample_1.csv in default_modeling/default_modeling/tests/data/ for unit test). Each row of your csv should correspond to unique User ID .
- Redefine categorical, numerical features in default_modeling/default_modeling/utils/preproc.pyx (function feature_definition) based on your definition
- Change TARGET=default in Dockerfile to TARGET={your target variable}
- Data example can be seen below

UUID (User id)	Feature 1	...	Feature N	Target (binary)
001	100	...	"AAA"	0
002	300	...	"BBB"	1

Package Requirements:

pandas, numpy, category_encoders, sklearn, scipy, joblib, Cython

Folder Structure

.
├── Dockerfile
├── default_modeling
│   ├── __init__.py
│   ├── default_modeling
│   │   ├── __init__.py
│   │   ├── interface
│   │   │   ├── __init__.py
│   │   │   ├── launch_predictor.py
│   │   │   ├── launch_trainer.py
│   │   │   ├── predictor.c
│   │   │   ├── predictor.pyx
│   │   │   ├── trainer.c
│   │   │   └── trainer.pyx
│   │   └── utils
│   │       ├── __init__.py
│   │       ├── load.c
│   │       ├── load.pyx 
│   │       ├── preproc.c
│   │       └── preproc.pyx
│   ├── setup.py
│   ├── tests
│   │   ├── __init__.py
│   │   ├── data
│   │   │   └── test_sample_1.csv
│   │   ├── test_case_base.py
│   │   └── test_data_handling.py
├── model
│   └── risk_model.joblib
├── prototype
│   ├── prototype_cython.ipynb
│   └── prototype_python.ipynb
├── requirements.txt
├── test_data
│   ├── test_set_1.csv
│   └── test_set_2.csv
└── train_data
    ├── train_set_1.csv
    └── train_set_2.csv

Run Locally

After cloning this repository, SETUP

!python3 -m default_modeling.setup build

UNIT TEST

!python3 -m unittest discover default_modeling

Arguments Explaination

model-dir: folder to store trained model (model as seen in this repo)
model-name: name of trained .joblib model (risk_model saved in folder model in this case)
train-folder: folder contains train csv (train_data in this repo)
train-file: selected file in train-folder (train_set_1.csv in this case)
target: target columns from data
test-folder: folder contains test csv (test_data in this repo)
test-file: selected file in test-folder (test_set_1.csv in this case)
Random forest parameters as sklearn.RandomForestClassifier:
- n-estimators
- max-depth
- min-samples-leaf
- random-state

TRAIN (train file `train_data/train_set_1.csv`)

!python3 -m default_modeling.default_modeling.interface.launch_trainer \
                                                --model-dir ./model \
                                                --model-name risk_model \
                                                --train-folder train_data \
                                                --train-file train_set_1.csv \
                                                --target default

Now if we would like to tune or modify random forest hyperparameters in training.

!python3 -m default_modeling.default_modeling.interface.launch_trainer \
                                                --model-dir ./model \
                                                --model-name risk_model \
                                                --train-folder train_data \
                                                --train-file train_set_1.csv \
                                                --target default \
                                                --n-estimators 200 \
                                                --max-depth 15 \
                                                --min-samples-leaf 20

PREDICT (predict file `test_data/test_set_1.csv`)

!python3 -m default_modeling.default_modeling.interface.launch_predictor \
                                               --test-file test_set_1.csv \
                                               --model-dir ./model \
                                               --model-name risk_model \
                                               --test-folder test_data \
                                               --test-file test_set_1.csv \
                                               --target default

DockerFile Contents

My Local Working Directory named /home/jupyter. In this local working directory:
- train_data folder contains different files for training random forest classifers
- model folder store the trained .joblib random forest, and the model will be loaded in this folder for prediction
- test_data folder contains new data coming and waiting for prediction, prediction result will be locally stored inside the same file in this folder.
Container will mount to those local folders: train_data, test_data and model
With this approach, we can conveniently play with every new data coming, by replacing the files inside train_data and/or test_data
Container is built both in pure Python and Cython

FROM python:3.8
WORKDIR /app/

RUN mkdir model

ENV TRAIN_FOLDER=./train_data
ENV TEST_FOLDER=./test_data
ENV TRAIN_FILE=train_set.csv
ENV TEST_FILE=test_set.csv
ENV MODEL_DIR=./model
ENV MODEL_NAME=risk_model
ENV TARGET=default

COPY requirements.txt .
COPY default_modeling default_modeling

RUN pip install -r requirements.txt
RUN python3 -m default_modeling.setup build

ENTRYPOINT ["python3"]

Build Image from Dockerfile

!docker build -t default_model -f Dockerfile .

Run unit test in Image

!docker run -t default_model:latest -m unittest discover default_modeling

Found the following test data
default_modeling/tests/data/test_sample_1.csv
..
----------------------------------------------------------------------
Ran 2 tests in 0.772s

OK

Train with the selected file, e.g. `train_data/TRAIN_SET_1.csv`. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local `train_data`, and `model`

!docker run -v /home/jupyter/train_data:/app/train_data \
            -v /home/jupyter/model:/app/model \
            default_model:latest -m default_modeling.default_modeling.interface.launch_trainer \
            --train-file train_set_1.csv \
            --n-estimators 200 \
            --max-depth 15 \
            --min-samples-leaf 20

extracting arguments
Namespace(max_depth=15, min_samples_leaf=20, model_dir='./model', model_name='risk_model', n_estimators=200, random_state=1234, target='default', train_file='train_set_1.csv', train_folder='./train_data')
Training Data at ./train_data/train_set_1.csv
('Total Input Features', 39)
('class weight', {0: 0.5074062934696794, 1: 34.255076142131976})
Found existing model at: ./model/risk_model.joblib.
Overwriting ...
Congratulation! Saving model at ./model/risk_model.joblib. Finish after 3.684312582015991 s

And predict selected file, e.g: `test_data/test_set_1.csv`. Now, mount to local `test_data`, and `model`

!docker run -v /home/jupyter/test_data:/app/test_data \
            -v /home/jupyter/model:/app/model \
            default_model:latest -m default_modeling.default_modeling.interface.launch_predictor \
            --test-file test_set_1.csv

extracting arguments
Namespace(model_dir='./model', model_name='risk_model', target='default', test_file='test_set_1.csv', test_folder='./test_data')
Found model at: ./model/risk_model.joblib
Predicting test_set_1.csv ....
Finish after 0.549715518951416 s
...to csv ./test_data/test_set_1.csv

We have prediction in local folder test_data. Evaluate with Metrics

Decision threshold on the probability of default would probably depend on credit policy. There could be several cutoff points or a mathematical cost function rather than a fixed decision threshold. Therefore, binary metrics like F1, Recall, or Precision is not meaningful in this situation. And the output should be a prediction in probability.
KS-statistic (between P(prediction|truth = 1) and P(prediction|truth = 0) to quantify the distance between 2 classes) are used to evaluate model.
Left plot: ROC AUC Curve
Right plot: Normalized KS Distribution of 2 types of users:
- class 0: non-default
- class 1: default

Conclusions & Future Work

With KS score = 0.66 and small p-value, this means the predictor can properly distinguish between default and non-default users (test is significant)
Visually, we can observe the clear gap in the KS distribution plot between 2 classes
In the future, host with AWS Sagemeker endpoint

nvlinhvn / default-modeling Goto Github PK

default-modeling's Introduction

Problem Definition

Expected Workflow

Variables (total = 43):

Adjustment:

Package Requirements:

Folder Structure

Run Locally

After cloning this repository, SETUP

UNIT TEST

Arguments Explaination

TRAIN (train file train_data/train_set_1.csv)

Now if we would like to tune or modify random forest hyperparameters in training.

PREDICT (predict file test_data/test_set_1.csv)

DockerFile Contents

Build Image from Dockerfile

Run unit test in Image

Train with the selected file, e.g. train_data/TRAIN_SET_1.csv. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local train_data, and model

And predict selected file, e.g: test_data/test_set_1.csv. Now, mount to local test_data, and model

We have prediction in local folder test_data. Evaluate with Metrics

Conclusions & Future Work

default-modeling's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs

TRAIN (train file `train_data/train_set_1.csv`)

PREDICT (predict file `test_data/test_set_1.csv`)

Train with the selected file, e.g. `train_data/TRAIN_SET_1.csv`. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local `train_data`, and `model`

And predict selected file, e.g: `test_data/test_set_1.csv`. Now, mount to local `test_data`, and `model`