predict the probability of default for each user id in risk modeling
default = 1 means defaulted users, default = 0 means otherwise
Imbalance binary classification problem
Expected Workflow
Variables (total = 43):
uuid: text User Id
default: (or target) boolean (0, or 1)
Categorical, and numerical features are defined in default_modeling.utils.preproc.pyx (function feature_definition)
Adjustment:
If you want to run the experiment with your data for the purpose of binary classification:
Replace csv in both train_data and test_data by your csv. (Optional: also change test file test_sample_1.csv in default_modeling/default_modeling/tests/data/ for unit test). Each row of your csv should correspond to unique User ID .
Redefine categorical, numerical features in default_modeling/default_modeling/utils/preproc.pyx (function feature_definition) based on your definition
Change TARGET=default in Dockerfile to TARGET={your target variable}
Found the following test data
default_modeling/tests/data/test_sample_1.csv
..
----------------------------------------------------------------------
Ran 2 tests in 0.772s
OK
Train with the selected file, e.g. train_data/TRAIN_SET_1.csv. If no hyperparameters are declared (like n_estimators, max_depth, ...), the file will train with default hyper parameters. Remember to mount to local train_data, and model
extracting arguments
Namespace(max_depth=15, min_samples_leaf=20, model_dir='./model', model_name='risk_model', n_estimators=200, random_state=1234, target='default', train_file='train_set_1.csv', train_folder='./train_data')
Training Data at ./train_data/train_set_1.csv
('Total Input Features', 39)
('class weight', {0: 0.5074062934696794, 1: 34.255076142131976})
Found existing model at: ./model/risk_model.joblib.
Overwriting ...
Congratulation! Saving model at ./model/risk_model.joblib. Finish after 3.684312582015991 s
And predict selected file, e.g: test_data/test_set_1.csv. Now, mount to local test_data, and model
extracting arguments
Namespace(model_dir='./model', model_name='risk_model', target='default', test_file='test_set_1.csv', test_folder='./test_data')
Found model at: ./model/risk_model.joblib
Predicting test_set_1.csv ....
Finish after 0.549715518951416 s
...to csv ./test_data/test_set_1.csv
We have prediction in local folder test_data. Evaluate with Metrics
Decision threshold on the probability of default would probably depend on credit policy. There could be several cutoff points or a mathematical cost function rather than a fixed decision threshold. Therefore, binary metrics like F1, Recall, or Precision is not meaningful in this situation. And the output should be a prediction in probability.
KS-statistic (between P(prediction|truth = 1) and P(prediction|truth = 0) to quantify the distance between 2 classes) are used to evaluate model.
Left plot: ROC AUC Curve
Right plot: Normalized KS Distribution of 2 types of users:
class 0: non-default
class 1: default
Conclusions & Future Work
With KS score = 0.66 and small p-value, this means the predictor can properly distinguish between default and non-default users (test is significant)
Visually, we can observe the clear gap in the KS distribution plot between 2 classes