ing-bank / skorecard Goto Github PK

View Code? Open in Web Editor NEW

81.0 17.0 23.0 8.76 MB

scikit-learn compatible tools for building credit risk acceptance models

Home Page: https://ing-bank.github.io/skorecard/

License: MIT License

Python 100.00%

scikit-learn scorecard scorecard-model scorecards credit-risk creditrisk machine-learning logistic-regression

skorecard's Introduction

skorecard

skorecard is a scikit-learn compatible python package that helps streamline the development of credit risk acceptance models (scorecards).

Scorecards are ‘traditional’ models used by banks in the credit decision process. Internally, scorecards are Logistic Regression models that make use of features that are binned into different groups. The process of binning is usually done manually by experts, and skorecard provides tools to makes this process easier. skorecard is built on top of scikit-learn as well as other excellent open source projects like optbinning, dash and plotly.

👉 Read the blogpost introducing skorecard

Features ⭐

Automate bucketing of features inside scikit-learn pipelines.
Dash webapp to help manually tweak bucketing of features with business knowledge
Extension to sklearn.linear_model.LogisticRegression that is also able to report p-values
Plots and reports to speed up analysis and writing technical documentation.

Quick demo

skorecard offers a range of bucketers:

import pandas as pd
from skorecard.bucketers import EqualWidthBucketer

df = pd.DataFrame({'column' : range(100)})

ewb = EqualWidthBucketer(n_bins=5)
ewb.fit_transform(df)

ewb.bucket_table('column')
#>    bucket                       label  Count  Count (%)
#> 0      -1                     Missing    0.0        0.0
#> 1       0                (-inf, 19.8]   20.0       20.0
#> 2       1                (19.8, 39.6]   20.0       20.0
#> 3       2  (39.6, 59.400000000000006]   20.0       20.0
#> 4       3  (59.400000000000006, 79.2]   20.0       20.0
#> 5       4                 (79.2, inf]   20.0       20.0

That also support a dash app to explore and update bucket boundaries:

ewb.fit_interactive(df)
#> Dash app running on http://127.0.0.1:8050/

Installation

pip3 install skorecard

Documentation

See ing-bank.github.io/skorecard/.

Presentations

Title	Host	Date	Speaker(s)
Skorecard: Making logistic regressions great again	ING Data Science Meetup	10 June 2021	Daniel Timbrell, Sandro Bjelogrlic, Tim Vink

skorecard's People

Contributors

Stargazers

Watchers

skorecard's Issues

missing value neutral

Add missing option where 'neutral' (the bucket where the WoE is closest to 0, ie min(abs(WoE))

Mismatches between ScoreCardPoints object and calibrate_to_master_scale scores

Please excuse the way that I reported this issue. This is my first time reporting a GitHub issue. I get different results from the ScoreCardPoints object. The scores using calibrate_to_master_scale on the proba_train are different from the score using scp.transform(X_train). I believe the calibrate_to_master_scale scores were right.

EDIT: I tried following the last tutorial example 'Scorecard Model' and I encounter the same problem. Going through the example, I noticed that the coefficients from scorecard.get_stats() are negative and the scorecard.woe_transform(X_test) are positive values but I get positive coefficients and negative scorecard.woe_transform(X_test).

Check the following images. In this example, I used a single categorical variable educational attainment versus default rate. Thank you!

Missing dependency

Skorecard relies on the optbinning package, but optbinning is not in the base_packages list of setup.py. It should be.

Error when using with `cross_val_score`

bucketing_process = BucketingProcess(
        prebucketing_pipeline=make_pipeline(
                DecisionTreeBucketer(max_n_bins=100, min_bin_size=0.05),
        ),
        bucketing_pipeline=make_pipeline(
                OptimalBucketer(max_n_bins=10, min_bin_size=0.05),
        ),
        specials=specials
)

pipe = make_pipeline(
    bucketing_process,
    StandardScaler(),
    LogisticRegression(solver="liblinear", random_state=0)
)

cross_val_score(pipe, X, y, cv=5, scoring='roc_auc')

Gives me an error:

ValueError: Specials should be defined on the BucketingProcess level, remove the specials from DecisionTreeBucketer

Make the package more visible online

Currently it is very difficult to find this page online, you cannot find it if you google:

skorecard
skorecard ing bank
skorecard github

I can only find this repo by first finding the pypi page - at the bottom :( .

It might be issue with indexing in google that it takes some time for them to index it, but i think the visibility of this package can be improved, by:

modifying text of the page to include more of the relevant keywords that would be picked up by google
Adding more links between pages e.g. docs -> github, github -> docs, but also other websites pointing to this repo
writing a blogpost about it :)

OrdinalCategoricalBucketer test fails

================================== FAILURES ==================================
____________________________ test_encoding_method ____________________________

df =       EDUCATION  MARRIAGE  LIMIT_BAL  BILL_AMT1  default pet_ownership
0             1         2   400000.0   201800.0...        0       no pets
5999          2         1   410000.0    71532.0        1     dog lover

[6000 rows x 6 columns]

    def test_encoding_method(df):
        """Test the encoding method."""
        X = df[["EDUCATION", "default"]]
        y = df["default"]

        ocb = OrdinalCategoricalBucketer(tol=0.03, variables=["EDUCATION"], encoding_method="frequency")
        ocb.fit(X, y)

        assert ocb.features_bucket_mapping_.get("EDUCATION").map == {1: 1, 2: 0, 3: 2}

        ocb = OrdinalCategoricalBucketer(tol=0.03, variables=["EDUCATION"], encoding_method="ordered")
        ocb.fit(X, y)

>       assert ocb.features_bucket_mapping_.get("EDUCATION").map == {1: 2, 2: 0, 3: 1}
E       assert {1: 0, 2: 2, 3: 1} == {1: 2, 2: 0, 3: 1}
E         Omitting 1 identical items, use -vv to show
E         Differing items:
E         {1: 0} != {1: 2}
E         {2: 2} != {2: 0}
E         Use -v to get the full diff

tests/test_bucketer_OrdinalCategoricalBucketer.py:66: AssertionError

Feature co-efficient signs are inverted between version 0.7.1 and 1.4.0

When I try to run the following code, I see that the model co-efficients are inverted between version 0.7.1 and 1.4.0

from skorecard.bucketers import DecisionTreeBucketer, OptimalBucketer
from skorecard.pipeline import BucketingProcess
from sklearn.pipeline import make_pipeline
from skorecard.datasets import load_credit_card
from sklearn.model_selection import train_test_split
from skorecard import Skorecard
from sklearn.metrics import roc_auc_score

#Load data
data = load_credit_card(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['y'], axis=1),
    data['y'], 
    test_size=0.25, 
    random_state=42
)
#Select features
selected_feats = ['x6','x8','x10','x18','x1','x19','x20','x21','x23','x22','x3','x17','x16']
even_cat_cols = ['x6','x8']
odd_cat_cols = ['x19']
num_cols = ['x10','x18','x1','x20','x21','x23','x22','x3','x17','x16']
#Create bucketing process
prebucketing_pipeline=make_pipeline( 
    DecisionTreeBucketer(variables=selected_feats, max_n_bins=40, #loose requirements
                        min_bin_size=0.03
                        ),
   
)

bucketing_pipeline=make_pipeline(
    OptimalBucketer(variables=selected_feats, max_n_bins=6,min_bin_size=0.05,missing_treatment='most_risky'),
)
bucketing_process = BucketingProcess(
    prebucketing_pipeline=prebucketing_pipeline,
    bucketing_pipeline=bucketing_pipeline,
)

bucketing_process = bucketing_process.fit(X_train[selected_feats], y_train)

# Train skorecard version 0.7.1
# scorecard = Skorecard(
#     bucketing_process,
#     selected_features=selected_feats
# )

# Train skorecard version 1.4.0
scorecard = Skorecard(
    bucketing_process,
    variables=selected_feats,
    calculate_stats=True

)
scorecard.fit(X_train[selected_feats], y_train)

# Results
proba_train = scorecard.predict_proba(X_train[selected_feats])[:,1]
proba_test = scorecard.predict_proba(X_test[selected_feats])[:,1]
print(f"AUC train:{round(roc_auc_score(y_train, proba_train),4)}")
print(f"AUC test :{round(roc_auc_score(y_test, proba_test),4)}\n")
print(scorecard.get_stats())

Results version 0.7.1

AUC train:0.7679
AUC test :0.7626

Features	Coef.	Std.Err	z	P>z
const	-1.24442	0.0182161	-68.3145	0
x6	-0.753524	0.0208574	-36.1274	8.4108e-286
x8	-0.277921	0.034096	-8.15111	3.60589e-16
x10	-0.31137	0.0381928	-8.1526	3.56189e-16
x18	-0.277205	0.0510278	-5.43243	5.55932e-08
x1	-0.350234	0.0484375	-7.23063	4.80774e-13
x19	-0.210958	0.0573138	-3.68075	0.000232548
x20	-0.279283	0.0665349	-4.19754	2.69828e-05
x21	-0.138776	0.0806698	-1.7203	0.0853775
x23	-0.176563	0.0754207	-2.34104	0.0192301
x22	-0.146805	0.0807149	-1.81882	0.0689397
x3	-0.0831014	0.136821	-0.607372	0.543604
x17	-0.294193	0.334516	-0.879459	0.379153
x16	0.56809	0.407881	1.39278	0.163685

Results version 1.4.0

AUC train:0.7679
AUC test :0.7626

Features	Coef.	Std.Err	z	P>z
const	-1.24575	0.0182208	-68.3695	0
x6	0.753233	0.0208504	36.1256	8.98644e-286
x8	0.277778	0.0340932	8.14762	3.71169e-16
x10	0.311173	0.0381806	8.15003	3.63845e-16
x18	0.277392	0.0510811	5.43042	5.62211e-08
x1	0.350658	0.0484606	7.23593	4.62344e-13
x19	0.211326	0.057416	3.68061	0.000232678
x20	0.279587	0.0665905	4.1986	2.68564e-05
x21	0.138937	0.0808025	1.71946	0.0855306
x23	0.176326	0.0755183	2.33488	0.0195497
x22	0.146274	0.080873	1.80869	0.0704997
x3	0.0832047	0.136923	0.607673	0.543404
x17	0.295899	0.334592	0.884356	0.376504
x16	-0.562766	0.408206	-1.37863	0.168008

Maybe this is also related to #68 .

Add remainder argument to the UserInputBucketer

The UserInputBucketer should have a reminder argument as well, like all the other bucketers.

Align WoE

The bucket_table function is reporting a WoE different to the woe_1d function. Which is correct?

The WoE function should be done in one place.

make_pipeline with BucketingProcess and Bucketer

via Anil:

If you need to pre-bucket and post-bucket numerical columns, but only bucket categoricals, how to achieve this?

BucketingPipeline raises an error:

msg = f"The following columns are bucketed but have not been pre-bucketed: {', '.join(not_prebucketed)}.\n"
            msg += "Consider adding an AsIsNumericalBucketer or AsIsCategoricalBucketer to the prebucketing pipeline.\n"
            msg += "Or add an additional bucketing step after the BucketingProcess:\n"
            msg += "make_pipeline(BucketingProcess(..), Bucketer())"

Then calling:

make_pipeline(
BucketingProcess(*args), 
OptimalBucketer(*args)
)

Gives:

NotBucketObjectError: All bucketing steps must be skorecard bucketers. Remove BucketingProcess

Docs: Update docs page for missing values

Create / update docs/tutorials/missing_values.ipynb
We should discuss what the options are + when to use which.

Create a new minor release for skorecard

Update the setup.py file for creating a new minor release for skorecard.

OrdinalCategoricalBucketer number of buckets

Everything by default is in 1 bucket, 'Other'

bucketer = OrdinalCategoricalBucketer(max_n_categories=3, variables=['EDUCATION'], encoding_method='ordered')
bucketer.fit_transform(X, y).head()['EDUCATION'].unique()
bucketer.bucket_table('EDUCATION')

As opposed to:

bucketer = OrdinalCategoricalBucketer(max_n_categories=3, variables=['EDUCATION'], encoding_method='ordered', tol=0)
bucketer.fit_transform(X, y).head()['EDUCATION'].unique()
bucketer.bucket_table('EDUCATION')

Unusable kwargs should throw an error

These ob_kwargs should throw an error, but they don't currently

O = OptimalBucketer(variables=num_cols, ob_kwargs={'uwotm7': 9, "LOOK_AT_ME":4})
O.fit(X_train[num_cols], y_train)

Add remainder parameter to BucketingProcess

most_risky missing treatment not working

X = pd.DataFrame({'counts': [1, 2, 2, 1, 4, 2, np.nan, 1, 3]})
y = pd.DataFrame({'target': [0, 0, 1, 0, 1, 0, 1, 0, 1]})
EqualFrequencyBucketer(n_bins=3, missing_treatment='most_risky').fit_transform(X, y)

The output puts the missing values into bucket -1, which it shouldn't do

Refactor bucketers `.fit()` to reduce code duplication

missing value - riskiest bucket option

Add the missing option 'riskiest' such that the missing values are placed into the riskiest bucket

Missing Treatment Propagation

Incase the missing_treatment is 'separate' , in the bucketing process the missing bucket is not kept separate, but is merged into a normal bucket.

Reproducing the error :

from skorecard.bucketers import DecisionTreeBucketer, OrdinalCategoricalBucketer, OptimalBucketer
from skorecard.pipeline import BucketingProcess
from sklearn.pipeline import make_pipeline
from skorecard.datasets import load_credit_card
from sklearn.model_selection import train_test_split
import numpy as np
from skorecard import Skorecard

data = load_credit_card(as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(
    data.drop(['y'], axis=1),
    data['y'], 
    test_size=0.25, 
    random_state=42
)

# Add random null values 
for col in X_train.columns:
    X_train.loc[X_train.sample(frac=0.1).index, col] = np.nan
    X_test.loc[X_test.sample(frac=0.1).index, col] = np.nan

# Select the features
selected_feats = ['x1','x5','x6','x2','x3','x4']
cat_cols = ['x2','x3','x4']
num_cols = ['x1','x5','x6']
specials = {'x1':{'special_demo':[50000]}}

# Create a bucketing pipeline
# For the categorical features, the missing treatment is kept as 'separate'

prebucketing_pipeline=make_pipeline( 
    DecisionTreeBucketer(variables=num_cols, max_n_bins=40, #loose requirements
                        min_bin_size=0.03,missing_treatment='most_risky'
                        ),
    OrdinalCategoricalBucketer(variables=cat_cols, tol = 0.02,missing_treatment='separate')
)

bucketing_pipeline=make_pipeline(
    OptimalBucketer(variables=num_cols, max_n_bins=6,min_bin_size=0.05,missing_treatment='most_risky'),
    OptimalBucketer(
        variables=cat_cols,
        variables_type='categorical',
        max_n_bins=10,
        min_bin_size=0.05,missing_treatment='separate')
)


bucketing_process = BucketingProcess(
    prebucketing_pipeline=prebucketing_pipeline,
    bucketing_pipeline=bucketing_pipeline,
    specials=specials,
)

bucketing_process = bucketing_process.fit(X_train[selected_feats], y_train)

# Create a skorecard model.

scorecard = Skorecard(
    bucketing_process,
    selected_features=selected_feats

)
scorecard = scorecard.fit(X_train[selected_feats], y_train)

# Look at the buckets
scorecard.prebucket_table("x3")

scorecard.bucket_table("x3")

Output of prebucket stage :

Output of the bucket stage :

As we can see the in the pre-bucketing stage the bucket -1 contains 2250 missing values, but in the bucketing stage they are put into the bucket 1, instead of the -1 (missing bucket). The missing bucket also shows 0 values.
Which is also not the desired output.

Display correaltion and VIF for the model features

Add correlation and VIF values for the Scorecard Model. These can be a part of model reporting.

Add Weight Plot

From Satya:

"Also a suggestion, the coefficients have their own intervals(usually 5% and 95%), it would be great to have a weight plot. I would like to contribute in this issue."

Fix inconsistencies with WoE

After the changes related to the WoEEncoder in #53 , there are still parts of the package that rely on the "old" calculation of WoE.
Namely, this refers to the functions in

scorecard.metrics.metrics

that are also used in the reporting functionality for IV, PSI etc.

Ideally, one would like that the WoE is calculated via the WOEEncoder everywhere, to avoid potential inconsistencies.

Update to the documentation might be needed as well. Check:

howto/psi_and_iv.ipynb
howto/Optimizations.ipynb

OptimalBucketer constraint setting

skorecard.OptimalBucketer has a kwargs argument which passes keyword arguments to optbinning.OptimalBinning.

This means you can set constraints like this:

from skorecard.bucketers import OptimalBucketer
b = OptimalBucketer(monotonic_trend='concave')

Going through the list of arguments in optbinning.OptimalBinning, some of them make sense to expose directly to users:

monotonic_trend
solver
gamma

Note that we have already hard-coded some of these values (bad!)

skorecard/skorecard/bucketers/bucketers.py

Lines 145 to 146 in a1cc593

 solver="cp", 

 monotonic_trend="auto_asc_desc",

Improve training time performance of Skorecard()

If you look at the benchmarks, the training time of Skorecard is quite terrible:

We would have to profile it, but I suspect there are two main reasons:

skorecard/linear_model/linear_model.py always calculating & saving the covariance matrix
detecting numerical and categorical columns in _setup , for example this line:

skorecard/skorecard/skorecard.py

Line 227 in ccd6f0c

low_uniques = X.nunique()[X.nunique() < 10]

For 1), we might add a flag to be able to turn this off? Especially when grid searching or doing some CV benchmarks, these statistics are unnecessary to calculate.
For 2) we might find better tricks and defaults

Docs: Optimizations

Create / update docs/howto/Optimizations.ipynb for grid searching
We should discuss what the options are + when to use which.

Table methods report the wrong IV

When calling .table(), all the information values are negative.
They should normally be positive quantities.

Implement `.run_interactive()` on bucketers

Two observations:

we are starting to have a consistent design across bucketers, bucketing_process and skorecard (.summary(), .bucket_table(), .plot_bucket())
Skorecard() design is evolving nicely, where you input any bucketing pipeline and then get an estimator and all the reports with it.

The current dash app design works like this:

tweaker = BucketTweakerApp(pipeline, X, y)
tweaker.run_server()

Instead, I propose we use the following API:

bucketer.run_interactive(X, y)
bucketing_process.run_interactive(X, y)
skorecard.run_interactive(X, y)

Design

Should be a very simple UX, run inline in a notebook by default:

bucketing_process.run_interactive(X, y) will be a slightly more complex app (the one we have now) with pre-bucketing and bucketing tables & plots shown.

Next to setting buckets numbers, we'll also need to allow for setting missing_bucket and other_bucket (only for categoricals). specials will need to be defined manually in the code.

Skorecard() class is slow

From the benchmarks notebook, Skorecard is now 'very' slow:

Superfluous dependency

Since [https://github.com/ing-bank/skorecard/commit/e70f8fcc11743da93064695e7de7c65d8599b375](this commit), explicit support for python 3.6 is dropped. However, the setup.py still installs the dataclasses package, which is deprecated for python >= 3.7. It causes some conflicts for other packages. Could we remove this package from setup?

bucket_process save in yml

Many thanks for your useful skorecard package. It helped me a lot.
But I got issue when saving an loading my_bucket_process to/from yaml file as belove. Could you help me to check and resolve this issue. Thank you.

Decision Tree Value Error

from the hackathon

from skorecard.bucketers import DecisionTreeBucketer

train = pd.read_csv("train.csv").drop("RiskPerformance", axis=1)

target = ['target']
features = [f for f in train.columns if f not in target]

X = train[features]
y = train[target]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=411114)

dt_bucketer = DecisionTreeBucketer(variables=features)
dt_bucketer.fit(X_train, y_train)

outputs:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-0bb9739818a8> in <module>
      2 
      3 dt_bucketer = DecisionTreeBucketer(variables=features)
----> 4 dt_bucketer.fit(X_train, y_train)

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/bucketers/base_bucketer.py in fit(self, X, y)
    241         self.features_bucket_mapping_ = FeaturesBucketMapping(features_bucket_mapping_)
    242 
--> 243         self._generate_summary(X, y)
    244 
    245         return self

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/reporting/report.py in _generate_summary(self, X, y)
    208         # Calculate information value
    209         if y is not None:
--> 210             iv_scores = iv(self.transform(X), y)
    211         else:
    212             iv_scores = {}

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/reporting/report.py in iv(X, y, epsilon, digits)
    357         IVs (dict): Keys are feature names, values are the IV values
    358     """  # noqa
--> 359     return {col: _IV_score(y, X[col], epsilon=epsilon, digits=digits) for col in X.columns}

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/reporting/report.py in <dictcomp>(.0)
    357         IVs (dict): Keys are feature names, values are the IV values
    358     """  # noqa
--> 359     return {col: _IV_score(y, X[col], epsilon=epsilon, digits=digits) for col in X.columns}

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/metrics/metrics.py in _IV_score(y_test, y_pred, epsilon, digits)
     66 
     67     """
---> 68     df = woe_1d(y_pred, y_test, epsilon=epsilon)
     69 
     70     iv = ((df["non_target"] - df["target"]) * df["woe"]).sum()

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/metrics/metrics.py in woe_1d(X, y, epsilon)
     23     if not isinstance(y, pd.Series):
     24         if y.shape[0] == X.shape[0]:
---> 25             y = pd.Series(y, index=X.index)
     26         else:
     27             raise ValueError(f"y has {y.shape[0]}, but expected {X.shape[0]}")

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    353             name = ibase.maybe_extract_name(name, data, type(self))
    354 
--> 355             if is_empty_data(data) and dtype is None:
    356                 # gh-17261
    357                 warnings.warn(

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/pandas/core/construction.py in is_empty_data(data)
    792     is_none = data is None
    793     is_list_like_without_dtype = is_list_like(data) and not hasattr(data, "dtype")
--> 794     is_simple_empty = is_list_like_without_dtype and not data
    795     return is_none or is_simple_empty
    796 

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/pandas/core/generic.py in __nonzero__(self)
   1533     def __nonzero__(self):
   1534         raise ValueError(
-> 1535             f"The truth value of a {type(self).__name__} is ambiguous. "
   1536             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1537         )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

bucket_table()-method WoE-inconsistency

The Skorecard object has a bucket_table() method which showcases WoE-values per bucket. This feature is quite useful for reporting. Unfortunately, the WoE-values in the table are inconsistent with the WoE-values that are obtained by applying the Skorecard's woe_transform() method to the feature.

The latter values, which are also used when predicting, are obtained from category_encoders WoE-transform. The former are implemented by hand in line 126 of skorecard.reporting.report.py. The difference lies in the way that the regularisation-factor epsilon is implemented in both cases (not just the value of epsilon, but also where it shows up in the equation for the WoE).

Add benchmarks in discussion

The UCI creditcard dataset is quite common. We can compare default skorecard.Skorecard() with more modern approaches like keras (https://keras.io/examples/structured_data/imbalanced_classification/), lightGBM, RandomForest and a basic LogisticRegression.

We can also include comparable packages like

Also interesting is to have a look at the benchmark from OptBinning, which also includes some timings.

Add the option to plot either default rate or WoE

Add a parameter to .plot_buckets() to use default rate or WoE.

Default rate should be the default option.

missing value similar

add option 'similar' (where the WoE of the missing buckets is the closest to the WoE of the bucket)

scikitlearn compatibility

Not all our estimators are compatible with scikit learn. For example, the usage of **kwargs in DecisionTreeClassifier:

from sklearn.utils.estimator_checks import check_estimator
from skorecard.bucketers import DecisionTreeBucketer

check_estimator(DecisionTreeBucketer())

We should:

Update unit tests to run check_estimator on all estimators
Fix any issues that arise

I found out about this one because using Skorecard() inside cross_validate produced an error.

add feature selection method

Hi skorecad team:
1.How to deal with logistic regression regression feature coefficient is positive(features are transformed by woe)?
2.What is a good way to select scorecard feature to ensure that all feature coefficients are positive?
thanks

add percentage event and non-event to bucket table

Helps keep track of the relative sizes of 1s and 0s

OptBucketer Index error

From the hackathon, trying to simply fit on the data

from skorecard.bucketers import OptimalBucketer

target = ['target']
features = [f for f in train.columns if f not in target]

X = train[features]
y = train[target]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.25, random_state=411114)

bucketer = OptimalBucketer(max_n_bins=20, variables=features)
bucketer.fit(X_train, y_train)

outputs:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-6-27fef8132e80> in <module>
     10 
     11 bucketer = OptimalBucketer(max_n_bins=20, variables=features, min_bin_size=0.001)
---> 12 bucketer.fit(X_train, y_train)

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/bucketers/base_bucketer.py in fit(self, X, y)
    187             # This method is implemented by each bucketer
    188             assert isinstance(X_flt, pd.Series)
--> 189             splits, right = self._get_feature_splits(feature, X=X_flt, y=y_flt, X_unfiltered=X)
    190 
    191             # Deal with missing values

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/skorecard/bucketers/bucketers.py in _get_feature_splits(self, feature, X, y, X_unfiltered)
    154             **self.kwargs,
    155         )
--> 156         binner.fit(X.values, y)
    157 
    158         # Extract fitted boundaries

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/optbinning/binning/binning.py in fit(self, x, y, sample_weight, check_input)
    526             Fitted optimal binning.
    527         """
--> 528         return self._fit(x, y, sample_weight, check_input)
    529 
    530     def fit_transform(self, x, y, sample_weight=None, metric="woe",

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/optbinning/binning/binning.py in _fit(self, x, y, sample_weight, check_input)
    693             self.dtype, x, y, self.special_codes, self.cat_cutoff,
    694             self.user_splits, check_input, self.outlier_detector,
--> 695             self.outlier_params, None, None, self.class_weight, sample_weight)
    696 
    697         self._time_preprocessing = time.perf_counter() - time_preprocessing

~/miniconda3/envs/skorecard_py37/lib/python3.7/site-packages/optbinning/preprocessing.py in split_data(dtype, x, y, special_codes, cat_cutoff, user_splits, check_input, outlier_detector, outlier_params, fix_lb, fix_ub, class_weight, sample_weight)
    190         clean_mask = ~missing_mask
    191 
--> 192         x_clean = x[clean_mask]
    193         y_clean = y[clean_mask]
    194         x_missing = x[missing_mask]

IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

Add feature importance method to Skorecard class

The feature importance in the context of the skorecard model is the feature IV*coef of the logistic regression.

Let's make this calculation within the skorercard class
In terms of code, it's similar to this

X_train_bins = scorecard.bucket_transform(X_train)
iv_dict = iv(X_train_bins, y_train)

iv_values = pd.Series(iv_dict).sort_values(ascending=False)
iv_values.name="IV"

feat_importance = model_stats[['Coef.']].join(iv_values)
feat_importance['importance'] = -1.*feat_importance['Coef.']*feat_importance['IV']
feat_importance.sort_values(by='importance', ascending=False)

Check missing value bucket assignment

Notice missing bucket is put together with bucket 0. Let's check to make sure this is correct.

@sbjelogr Do you have the code that generated this table?

Support sample weights when bucketing

Dear skorecard team, are there bucketers that support weights?

E.g. we undersampled our negative target for modelling, and now we need to apply a weight in this class to get proper default rates in each bucket.

I looked at the DecisionTreeBucketer but it doesn’t seem to support that.

plot bucket is separate in skorecard

We still need the plot_bucket(line='woe') to work in Skorecard.py

random_state

skorecard = Skorecard(random_state=random_state)

Will throw a TypeError: __init__() got an unexpected keyword argument 'random_state'.

Let's:

make sure this argument is supported explicitly in all bucketers, SkorecardPipeline, BucketingProcess and Skorecard,
and passed along properly when used in SkorecardPipeline, BucketingProcess and Skorecard

Identify Suppressor Effect

Sometimes a following scenario is observed :
A feature when used in a univariate setting has a positive co-efficient, however the co-efficient sign is inverted when used in a multivariate model. This is caused even when the feature are only mildly correlated .

Proposed Solution :

Create a function to validate the co-efficients used in the final model.
This can be done by fitting a univariate model and checking the co-efficient signs against those in the final model.

Update docs with other category encoding package usage

Just found this one:

https://github.com/scikit-learn-contrib/category_encoders

And of course there are others, like base scikit learn:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html#sklearn.preprocessing.KBinsDiscretizer

and:

https://feature-engine.readthedocs.io/en/latest/

WoeEncoder can produce wrong value

I am struggling with ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). and managed to get the cause down to WoeEncoder.

We should test that transformer more thoroughly.

Add rich repl

New feature in rich we can add to our classes.

This makes the output in Jupyter notebooks and ipy look much better. At the cost of an extra dependency.

https://rich.readthedocs.io/en/latest/pretty.html#rich-repr-protocol

`save_yml` to accept `fout` as string

Now we have to supply a file reference:

bucketing_process.save_yml(open('buckets.yml','w'))

Let's also allow a string path:

bucketing_process.save_yml('buckets.yml'))

Where internally we do something like:

if instance(fout, str):
   fout = Path(fout)
....

missing value - safest bucket option

See #9 but here we want the safest option