GithubHelp home page GithubHelp logo

ing-bank / skorecard Goto Github PK

View Code? Open in Web Editor NEW
81.0 17.0 23.0 8.76 MB

scikit-learn compatible tools for building credit risk acceptance models

Home Page: https://ing-bank.github.io/skorecard/

License: MIT License

Python 100.00%
scikit-learn scorecard scorecard-model scorecards credit-risk creditrisk machine-learning logistic-regression

skorecard's Introduction

skorecard

pytest PyPI - Python Version PyPI PyPI - License GitHub contributors PyPI - Downloads Downloads Code style: black pre-commit

skorecard is a scikit-learn compatible python package that helps streamline the development of credit risk acceptance models (scorecards).

Scorecards are ‘traditional’ models used by banks in the credit decision process. Internally, scorecards are Logistic Regression models that make use of features that are binned into different groups. The process of binning is usually done manually by experts, and skorecard provides tools to makes this process easier. skorecard is built on top of scikit-learn as well as other excellent open source projects like optbinning, dash and plotly.

👉 Read the blogpost introducing skorecard

Features ⭐

  • Automate bucketing of features inside scikit-learn pipelines.
  • Dash webapp to help manually tweak bucketing of features with business knowledge
  • Extension to sklearn.linear_model.LogisticRegression that is also able to report p-values
  • Plots and reports to speed up analysis and writing technical documentation.

Quick demo

skorecard offers a range of bucketers:

import pandas as pd
from skorecard.bucketers import EqualWidthBucketer

df = pd.DataFrame({'column' : range(100)})

ewb = EqualWidthBucketer(n_bins=5)
ewb.fit_transform(df)

ewb.bucket_table('column')
#>    bucket                       label  Count  Count (%)
#> 0      -1                     Missing    0.0        0.0
#> 1       0                (-inf, 19.8]   20.0       20.0
#> 2       1                (19.8, 39.6]   20.0       20.0
#> 3       2  (39.6, 59.400000000000006]   20.0       20.0
#> 4       3  (59.400000000000006, 79.2]   20.0       20.0
#> 5       4                 (79.2, inf]   20.0       20.0

That also support a dash app to explore and update bucket boundaries:

ewb.fit_interactive(df)
#> Dash app running on http://127.0.0.1:8050/

Installation

pip3 install skorecard

Documentation

See ing-bank.github.io/skorecard/.

Presentations

Title Host Date Speaker(s)
Skorecard: Making logistic regressions great again ING Data Science Meetup 10 June 2021 Daniel Timbrell, Sandro Bjelogrlic, Tim Vink

skorecard's People

Contributors

anilkumarpanda avatar dronakurl avatar lorenjan avatar orchardbirds avatar reinierkoops avatar satya-pattnaik avatar sbjelogr avatar timvink avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skorecard's Issues

missing value neutral

Add missing option where 'neutral' (the bucket where the WoE is closest to 0, ie min(abs(WoE))

Mismatches between ScoreCardPoints object and calibrate_to_master_scale scores

Please excuse the way that I reported this issue. This is my first time reporting a GitHub issue. I get different results from the ScoreCardPoints object. The scores using calibrate_to_master_scale on the proba_train are different from the score using scp.transform(X_train). I believe the calibrate_to_master_scale scores were right.

EDIT: I tried following the last tutorial example 'Scorecard Model' and I encounter the same problem. Going through the example, I noticed that the coefficients from scorecard.get_stats() are negative and the scorecard.woe_transform(X_test) are positive values but I get positive coefficients and negative scorecard.woe_transform(X_test).

Check the following images. In this example, I used a single categorical variable educational attainment versus default rate. Thank you!
0
1
2
3
4
5

Missing dependency

Skorecard relies on the optbinning package, but optbinning is not in the base_packages list of setup.py. It should be.

Error when using with `cross_val_score`

bucketing_process = BucketingProcess(
        prebucketing_pipeline=make_pipeline(
                DecisionTreeBucketer(max_n_bins=100, min_bin_size=0.05),
        ),
        bucketing_pipeline=make_pipeline(
                OptimalBucketer(max_n_bins=10, min_bin_size=0.05),
        ),
        specials=specials
)

pipe = make_pipeline(
    bucketing_process,
    StandardScaler(),
    LogisticRegression(solver="liblinear", random_state=0)
)

cross_val_score(pipe, X, y, cv=5, scoring='roc_auc')

Gives me an error:

ValueError: Specials should be defined on the BucketingProcess level, remove the specials from DecisionTreeBucketer

Make the package more visible online

Currently it is very difficult to find this page online, you cannot find it if you google:

  • skorecard
  • skorecard ing bank
  • skorecard github

I can only find this repo by first finding the pypi page - at the bottom :( .
image

It might be issue with indexing in google that it takes some time for them to index it, but i think the visibility of this package can be improved, by:

  • modifying text of the page to include more of the relevant keywords that would be picked up by google
  • Adding more links between pages e.g. docs -> github, github -> docs, but also other websites pointing to this repo
  • writing a blogpost about it :)

OrdinalCategoricalBucketer test fails

================================== FAILURES ==================================
____________________________ test_encoding_method ____________________________

df =       EDUCATION  MARRIAGE  LIMIT_BAL  BILL_AMT1  default pet_ownership
0             1         2   400000.0   201800.0...        0       no pets
5999          2         1   410000.0    71532.0        1     dog lover

[6000 rows x 6 columns]

    def test_encoding_method(df):
        """Test the encoding method."""
        X = df[["EDUCATION", "default"]]
        y = df["default"]

        ocb = OrdinalCategoricalBucketer(tol=0.03, variables=["EDUCATION"], encoding_method="frequency")
        ocb.fit(X, y)

        assert ocb.features_bucket_mapping_.get("EDUCATION").map == {1: 1, 2: 0, 3: 2}

        ocb = OrdinalCategoricalBucketer(tol=0.03, variables=["EDUCATION"], encoding_method="ordered")
        ocb.fit(X, y)

>       assert ocb.features_bucket_mapping_.get("EDUCATION").map == {1: 2, 2: 0, 3: 1}
E       assert {1: 0, 2: 2, 3: 1} == {1: 2, 2: 0, 3: 1}
E         Omitting 1 identical items, use -vv to show
E         Differing items:
E         {1: 0} != {1: 2}
E         {2: 2} != {2: 0}
E         Use -v to get the full diff

tests/test_bucketer_OrdinalCategoricalBucketer.py:66: AssertionError

Feature co-efficient signs are inverted between version 0.7.1 and 1.4.0

When I try to run the following code, I see that the model co-efficients are inverted between version 0.7.1 and 1.4.0

from skorecard.bucketers import DecisionTreeBucketer, OptimalBucketer
from skorecard.pipeline import BucketingProcess
from sklearn.pipeline import make_pipeline
from skorecard.datasets import load_credit_card
from sklearn.model_selection import train_test_split
from skorecard import Skorecard
from sklearn.metrics import roc_auc_score

#Load data
data = load_credit_card(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['y'], axis=1),
    data['y'], 
    test_size=0.25, 
    random_state=42
)
#Select features
selected_feats = ['x6','x8','x10','x18','x1','x19','x20','x21','x23','x22','x3','x17','x16']
even_cat_cols = ['x6','x8']
odd_cat_cols = ['x19']
num_cols = ['x10','x18','x1','x20','x21','x23','x22','x3','x17','x16']
#Create bucketing process
prebucketing_pipeline=make_pipeline( 
    DecisionTreeBucketer(variables=selected_feats, max_n_bins=40, #loose requirements
                        min_bin_size=0.03
                        ),
   
)

bucketing_pipeline=make_pipeline(
    OptimalBucketer(variables=selected_feats, max_n_bins=6,min_bin_size=0.05,missing_treatment='most_risky'),
)
bucketing_process = BucketingProcess(
    prebucketing_pipeline=prebucketing_pipeline,
    bucketing_pipeline=bucketing_pipeline,
)

bucketing_process = bucketing_process.fit(X_train[selected_feats], y_train)

# Train skorecard version 0.7.1
# scorecard = Skorecard(
#     bucketing_process,
#     selected_features=selected_feats
# )

# Train skorecard version 1.4.0
scorecard = Skorecard(
    bucketing_process,
    variables=selected_feats,
    calculate_stats=True

)
scorecard.fit(X_train[selected_feats], y_train)

# Results
proba_train = scorecard.predict_proba(X_train[selected_feats])[:,1]
proba_test = scorecard.predict_proba(X_test[selected_feats])[:,1]
print(f"AUC train:{round(roc_auc_score(y_train, proba_train),4)}")
print(f"AUC test :{round(roc_auc_score(y_test, proba_test),4)}\n")
print(scorecard.get_stats())

Results version 0.7.1

AUC train:0.7679
AUC test :0.7626

Features Coef. Std.Err z P>z
const -1.24442 0.0182161 -68.3145 0
x6 -0.753524 0.0208574 -36.1274 8.4108e-286
x8 -0.277921 0.034096 -8.15111 3.60589e-16
x10 -0.31137 0.0381928 -8.1526 3.56189e-16
x18 -0.277205 0.0510278 -5.43243 5.55932e-08
x1 -0.350234 0.0484375 -7.23063 4.80774e-13
x19 -0.210958 0.0573138 -3.68075 0.000232548
x20 -0.279283 0.0665349 -4.19754 2.69828e-05
x21 -0.138776 0.0806698 -1.7203 0.0853775
x23 -0.176563 0.0754207 -2.34104 0.0192301
x22 -0.146805 0.0807149 -1.81882 0.0689397
x3 -0.0831014 0.136821 -0.607372 0.543604
x17 -0.294193 0.334516 -0.879459 0.379153
x16 0.56809 0.407881 1.39278 0.163685

Results version 1.4.0

AUC train:0.7679
AUC test :0.7626

Features Coef. Std.Err z P>z
const -1.24575 0.0182208 -68.3695 0
x6 0.753233 0.0208504 36.1256 8.98644e-286
x8 0.277778 0.0340932 8.14762 3.71169e-16
x10 0.311173 0.0381806 8.15003 3.63845e-16
x18 0.277392 0.0510811 5.43042 5.62211e-08
x1 0.350658 0.0484606 7.23593 4.62344e-13
x19 0.211326 0.057416 3.68061 0.000232678
x20 0.279587 0.0665905 4.1986 2.68564e-05
x21 0.138937 0.0808025 1.71946 0.0855306
x23 0.176326 0.0755183 2.33488 0.0195497
x22 0.146274 0.080873 1.80869 0.0704997
x3 0.0832047 0.136923 0.607673 0.543404
x17 0.295899 0.334592 0.884356 0.376504
x16 -0.562766 0.408206 -1.37863 0.168008

Maybe this is also related to #68 .

Align WoE

The bucket_table function is reporting a WoE different to the woe_1d function. Which is correct?

The WoE function should be done in one place.

make_pipeline with BucketingProcess and Bucketer

via Anil:

If you need to pre-bucket and post-bucket numerical columns, but only bucket categoricals, how to achieve this?

BucketingPipeline raises an error:

msg = f"The following columns are bucketed but have not been pre-bucketed: {', '.join(not_prebucketed)}.\n"
            msg += "Consider adding an AsIsNumericalBucketer or AsIsCategoricalBucketer to the prebucketing pipeline.\n"
            msg += "Or add an additional bucketing step after the BucketingProcess:\n"
            msg += "make_pipeline(BucketingProcess(..), Bucketer())"

Then calling:

make_pipeline(
BucketingProcess(*args), 
OptimalBucketer(*args)
)

Gives:

NotBucketObjectError: All bucketing steps must be skorecard bucketers. Remove BucketingProcess

OrdinalCategoricalBucketer number of buckets

Everything by default is in 1 bucket, 'Other'

bucketer = OrdinalCategoricalBucketer(max_n_categories=3, variables=['EDUCATION'], encoding_method='ordered')
bucketer.fit_transform(X, y).head()['EDUCATION'].unique()
bucketer.bucket_table('EDUCATION')

As opposed to:

bucketer = OrdinalCategoricalBucketer(max_n_categories=3, variables=['EDUCATION'], encoding_method='ordered', tol=0)
bucketer.fit_transform(X, y).head()['EDUCATION'].unique()
bucketer.bucket_table('EDUCATION')

Unusable kwargs should throw an error

These ob_kwargs should throw an error, but they don't currently

O = OptimalBucketer(variables=num_cols, ob_kwargs={'uwotm7': 9, "LOOK_AT_ME":4})
O.fit(X_train[num_cols], y_train)

most_risky missing treatment not working

X = pd.DataFrame({'counts': [1, 2, 2, 1, 4, 2, np.nan, 1, 3]})
y = pd.DataFrame({'target': [0, 0, 1, 0, 1, 0, 1, 0, 1]})
EqualFrequencyBucketer(n_bins=3, missing_treatment='most_risky').fit_transform(X, y)

The output puts the missing values into bucket -1, which it shouldn't do

Missing Treatment Propagation

Incase the missing_treatment is 'separate' , in the bucketing process the missing bucket is not kept separate, but is merged into a normal bucket.

Reproducing the error :

from skorecard.bucketers import DecisionTreeBucketer, OrdinalCategoricalBucketer, OptimalBucketer
from skorecard.pipeline import BucketingProcess
from sklearn.pipeline import make_pipeline
from skorecard.datasets import load_credit_card
from sklearn.model_selection import train_test_split
import numpy as np
from skorecard import Skorecard

data = load_credit_card(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['y'], axis=1),
    data['y'], 
    test_size=0.25, 
    random_state=42
)

# Add random null values 
for col in X_train.columns:
    X_train.loc[X_train.sample(frac=0.1).index, col] = np.nan
    X_test.loc[X_test.sample(frac=0.1).index, col] = np.nan

# Select the features
selected_feats = ['x1','x5','x6','x2','x3','x4']
cat_cols = ['x2','x3','x4']
num_cols = ['x1','x5','x6']
specials = {'x1':{'special_demo':[50000]}}

# Create a bucketing pipeline
# For the categorical features, the missing treatment is kept as 'separate'

prebucketing_pipeline=make_pipeline( 
    DecisionTreeBucketer(variables=num_cols, max_n_bins=40, #loose requirements
                        min_bin_size=0.03,missing_treatment='most_risky'
                        ),
    OrdinalCategoricalBucketer(variables=cat_cols, tol = 0.02,missing_treatment='separate')
)

bucketing_pipeline=make_pipeline(
    OptimalBucketer(variables=num_cols, max_n_bins=6,min_bin_size=0.05,missing_treatment='most_risky'),
    OptimalBucketer(
        variables=cat_cols,
        variables_type='categorical',
        max_n_bins=10,
        min_bin_size=0.05,missing_treatment='separate')
)


bucketing_process = BucketingProcess(
    prebucketing_pipeline=prebucketing_pipeline,
    bucketing_pipeline=bucketing_pipeline,
    specials=specials,
)

bucketing_process = bucketing_process.fit(X_train[selected_feats], y_train)

# Create a skorecard model.

scorecard = Skorecard(
    bucketing_process,
    selected_features=selected_feats

)
scorecard = scorecard.fit(X_train[selected_feats], y_train)

# Look at the buckets
scorecard.prebucket_table("x3")

scorecard.bucket_table("x3")

Output of prebucket stage :
image

Output of the bucket stage :
image

As we can see the in the pre-bucketing stage the bucket -1 contains 2250 missing values, but in the bucketing stage they are put into the bucket 1, instead of the -1 (missing bucket). The missing bucket also shows 0 values.
Which is also not the desired output.

Add Weight Plot

From Satya:

"Also a suggestion, the coefficients have their own intervals(usually 5% and 95%), it would be great to have a weight plot. I would like to contribute in this issue."

Fix inconsistencies with WoE

After the changes related to the WoEEncoder in #53 , there are still parts of the package that rely on the "old" calculation of WoE.
Namely, this refers to the functions in

scorecard.metrics.metrics

that are also used in the reporting functionality for IV, PSI etc.

Ideally, one would like that the WoE is calculated via the WOEEncoder everywhere, to avoid potential inconsistencies.

Update to the documentation might be needed as well. Check:

  • howto/psi_and_iv.ipynb
  • howto/Optimizations.ipynb

OptimalBucketer constraint setting

skorecard.OptimalBucketer has a kwargs argument which passes keyword arguments to optbinning.OptimalBinning.

This means you can set constraints like this:

from skorecard.bucketers import OptimalBucketer
b = OptimalBucketer(monotonic_trend='concave')

Going through the list of arguments in optbinning.OptimalBinning, some of them make sense to expose directly to users:

  • monotonic_trend
  • solver
  • gamma

Note that we have already hard-coded some of these values (bad!)

solver="cp",
monotonic_trend="auto_asc_desc",

Improve training time performance of Skorecard()

If you look at the benchmarks, the training time of Skorecard is quite terrible:

image

We would have to profile it, but I suspect there are two main reasons:

  1. skorecard/linear_model/linear_model.py always calculating & saving the covariance matrix
  2. detecting numerical and categorical columns in _setup , for example this line:
    low_uniques = X.nunique()[X.nunique() < 10]

For 1), we might add a flag to be able to turn this off? Especially when grid searching or doing some CV benchmarks, these statistics are unnecessary to calculate.
For 2) we might find better tricks and defaults

Docs: Optimizations

Create / update docs/howto/Optimizations.ipynb for grid searching
We should discuss what the options are + when to use which.

Implement `.run_interactive()` on bucketers

Two observations:

  • we are starting to have a consistent design across bucketers, bucketing_process and skorecard (.summary(), .bucket_table(), .plot_bucket())
  • Skorecard() design is evolving nicely, where you input any bucketing pipeline and then get an estimator and all the reports with it.

The current dash app design works like this:

tweaker = BucketTweakerApp(pipeline, X, y)
tweaker.run_server()

Instead, I propose we use the following API:

bucketer.run_interactive(X, y)
bucketing_process.run_interactive(X, y)
skorecard.run_interactive(X, y)

Design

Should be a very simple UX, run inline in a notebook by default:

image

bucketing_process.run_interactive(X, y) will be a slightly more complex app (the one we have now) with pre-bucketing and bucketing tables & plots shown.

Next to setting buckets numbers, we'll also need to allow for setting missing_bucket and other_bucket (only for categoricals). specials will need to be defined manually in the code.

Superfluous dependency

Since [https://github.com/ing-bank/skorecard/commit/e70f8fcc11743da93064695e7de7c65d8599b375](this commit), explicit support for python 3.6 is dropped. However, the setup.py still installs the dataclasses package, which is deprecated for python >= 3.7. It causes some conflicts for other packages. Could we remove this package from setup?

bucket_process save in yml

Many thanks for your useful skorecard package. It helped me a lot.
But I got issue when saving an loading my_bucket_process to/from yaml file as belove. Could you help me to check and resolve this issue. Thank you.

Decision Tree Value Error

from the hackathon

from skorecard.bucketers import DecisionTreeBucketer

train = pd.read_csv("train.csv").drop("RiskPerformance", axis=1)

target = ['target']
features = [f for f in train.columns if f not in target]

X = train[features]
y = train[target]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=411114)

dt_bucketer = DecisionTreeBucketer(variables=features)
dt_bucketer.fit(X_train, y_train)

outputs:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-0bb9739818a8> in <module>
      2 
      3 dt_bucketer = DecisionTreeBucketer(variables=features)
----> 4 dt_bucketer.fit(X_train, y_train)

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/bucketers/base_bucketer.py in fit(self, X, y)
    241         self.features_bucket_mapping_ = FeaturesBucketMapping(features_bucket_mapping_)
    242 
--> 243         self._generate_summary(X, y)
    244 
    245         return self

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/reporting/report.py in _generate_summary(self, X, y)
    208         # Calculate information value
    209         if y is not None:
--> 210             iv_scores = iv(self.transform(X), y)
    211         else:
    212             iv_scores = {}

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/reporting/report.py in iv(X, y, epsilon, digits)
    357         IVs (dict): Keys are feature names, values are the IV values
    358     """  # noqa
--> 359     return {col: _IV_score(y, X[col], epsilon=epsilon, digits=digits) for col in X.columns}

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/reporting/report.py in <dictcomp>(.0)
    357         IVs (dict): Keys are feature names, values are the IV values
    358     """  # noqa
--> 359     return {col: _IV_score(y, X[col], epsilon=epsilon, digits=digits) for col in X.columns}

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/metrics/metrics.py in _IV_score(y_test, y_pred, epsilon, digits)
     66 
     67     """
---> 68     df = woe_1d(y_pred, y_test, epsilon=epsilon)
     69 
     70     iv = ((df["non_target"] - df["target"]) * df["woe"]).sum()

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/metrics/metrics.py in woe_1d(X, y, epsilon)
     23     if not isinstance(y, pd.Series):
     24         if y.shape[0] == X.shape[0]:
---> 25             y = pd.Series(y, index=X.index)
     26         else:
     27             raise ValueError(f"y has {y.shape[0]}, but expected {X.shape[0]}")

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    353             name = ibase.maybe_extract_name(name, data, type(self))
    354 
--> 355             if is_empty_data(data) and dtype is None:
    356                 # gh-17261
    357                 warnings.warn(

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/pandas/core/construction.py in is_empty_data(data)
    792     is_none = data is None
    793     is_list_like_without_dtype = is_list_like(data) and not hasattr(data, "dtype")
--> 794     is_simple_empty = is_list_like_without_dtype and not data
    795     return is_none or is_simple_empty
    796 

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self)
   1533     def __nonzero__(self):
   1534         raise ValueError(
-> 1535             f"The truth value of a {type(self).__name__} is ambiguous. "
   1536             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1537         )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().


bucket_table()-method WoE-inconsistency

The Skorecard object has a bucket_table() method which showcases WoE-values per bucket. This feature is quite useful for reporting. Unfortunately, the WoE-values in the table are inconsistent with the WoE-values that are obtained by applying the Skorecard's woe_transform() method to the feature.

The latter values, which are also used when predicting, are obtained from category_encoders WoE-transform. The former are implemented by hand in line 126 of skorecard.reporting.report.py. The difference lies in the way that the regularisation-factor epsilon is implemented in both cases (not just the value of epsilon, but also where it shows up in the equation for the WoE).

Add benchmarks in discussion

The UCI creditcard dataset is quite common. We can compare default skorecard.Skorecard() with more modern approaches like keras (https://keras.io/examples/structured_data/imbalanced_classification/), lightGBM, RandomForest and a basic LogisticRegression.

We can also include comparable packages like

Also interesting is to have a look at the benchmark from OptBinning, which also includes some timings.

missing value similar

add option 'similar' (where the WoE of the missing buckets is the closest to the WoE of the bucket)

scikitlearn compatibility

Not all our estimators are compatible with scikit learn. For example, the usage of **kwargs in DecisionTreeClassifier:

from sklearn.utils.estimator_checks import check_estimator
from skorecard.bucketers import DecisionTreeBucketer

check_estimator(DecisionTreeBucketer())

We should:

  • Update unit tests to run check_estimator on all estimators
  • Fix any issues that arise

I found out about this one because using Skorecard() inside cross_validate produced an error.

add feature selection method

Hi skorecad team:
1.How to deal with logistic regression regression feature coefficient is positive(features are transformed by woe)?
2.What is a good way to select scorecard feature to ensure that all feature coefficients are positive?
thanks

OptBucketer Index error

From the hackathon, trying to simply fit on the data

from skorecard.bucketers import OptimalBucketer

target = ['target']
features = [f for f in train.columns if f not in target]

X = train[features]
y = train[target]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=411114)

bucketer = OptimalBucketer(max_n_bins=20, variables=features)
bucketer.fit(X_train, y_train)

outputs:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-27fef8132e80> in <module>
     10 
     11 bucketer = OptimalBucketer(max_n_bins=20, variables=features, min_bin_size=0.001)
---> 12 bucketer.fit(X_train, y_train)

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/bucketers/base_bucketer.py in fit(self, X, y)
    187             # This method is implemented by each bucketer
    188             assert isinstance(X_flt, pd.Series)
--> 189             splits, right = self._get_feature_splits(feature, X=X_flt, y=y_flt, X_unfiltered=X)
    190 
    191             # Deal with missing values

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/bucketers/bucketers.py in _get_feature_splits(self, feature, X, y, X_unfiltered)
    154             **self.kwargs,
    155         )
--> 156         binner.fit(X.values, y)
    157 
    158         # Extract fitted boundaries

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/optbinning/binning/binning.py in fit(self, x, y, sample_weight, check_input)
    526             Fitted optimal binning.
    527         """
--> 528         return self._fit(x, y, sample_weight, check_input)
    529 
    530     def fit_transform(self, x, y, sample_weight=None, metric="woe",

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/optbinning/binning/binning.py in _fit(self, x, y, sample_weight, check_input)
    693             self.dtype, x, y, self.special_codes, self.cat_cutoff,
    694             self.user_splits, check_input, self.outlier_detector,
--> 695             self.outlier_params, None, None, self.class_weight, sample_weight)
    696 
    697         self._time_preprocessing = time.perf_counter() - time_preprocessing

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/optbinning/preprocessing.py in split_data(dtype, x, y, special_codes, cat_cutoff, user_splits, check_input, outlier_detector, outlier_params, fix_lb, fix_ub, class_weight, sample_weight)
    190         clean_mask = ~missing_mask
    191 
--> 192         x_clean = x[clean_mask]
    193         y_clean = y[clean_mask]
    194         x_missing = x[missing_mask]

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Add feature importance method to Skorecard class

The feature importance in the context of the skorecard model is the feature IV*coef of the logistic regression.

Let's make this calculation within the skorercard class
In terms of code, it's similar to this

X_train_bins = scorecard.bucket_transform(X_train)
iv_dict = iv(X_train_bins, y_train)

iv_values = pd.Series(iv_dict).sort_values(ascending=False)
iv_values.name="IV"

feat_importance = model_stats[['Coef.']].join(iv_values)
feat_importance['importance'] = -1.*feat_importance['Coef.']*feat_importance['IV']
feat_importance.sort_values(by='importance', ascending=False)

Support sample weights when bucketing

Dear skorecard team, are there bucketers that support weights?

E.g. we undersampled our negative target for modelling, and now we need to apply a weight in this class to get proper default rates in each bucket.

I looked at the DecisionTreeBucketer but it doesn’t seem to support that.

random_state

skorecard = Skorecard(random_state=random_state)

Will throw a TypeError: __init__() got an unexpected keyword argument 'random_state'.

Let's:

  • make sure this argument is supported explicitly in all bucketers, SkorecardPipeline, BucketingProcess and Skorecard,
  • and passed along properly when used in SkorecardPipeline, BucketingProcess and Skorecard

Identify Suppressor Effect

Sometimes a following scenario is observed :
A feature when used in a univariate setting has a positive co-efficient, however the co-efficient sign is inverted when used in a multivariate model. This is caused even when the feature are only mildly correlated .

Proposed Solution :

Create a function to validate the co-efficients used in the final model.
This can be done by fitting a univariate model and checking the co-efficient signs against those in the final model.

WoeEncoder can produce wrong value

I am struggling with ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). and managed to get the cause down to WoeEncoder.

We should test that transformer more thoroughly.

`save_yml` to accept `fout` as string

Now we have to supply a file reference:

bucketing_process.save_yml(open('buckets.yml','w'))

Let's also allow a string path:

bucketing_process.save_yml('buckets.yml'))

Where internally we do something like:

if instance(fout, str):
   fout = Path(fout)
....

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.