pythonpredictions / cobra Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 5.0 14.48 MB

A Python package to build predictive linear and logistic regression models focused on performance and interpretation

Home Page: https://pythonpredictions.github.io/cobra.io

License: MIT License

Python 20.46% Jupyter Notebook 79.48% Makefile 0.06%

linear-regression logistic-regression predictive-analytics-for-business

cobra's People

Contributors

Stargazers

Watchers

Forkers

matthiasroels jgrujic jhanishant260 jan-benisek micseb

cobra's Issues

Model_building - univariate_selection

Since AUC cannot be applied to continuous target, we need another metric. Since we want univariate selection RMSE could be used instead. Review this before continuing.
After, extend the function with model_type which can be [classification, regression]. Then, the function returns either RMSE or AUC. The order has to be correct (we want the highest AUC, but the lowest RMSE)

cobra/cobra/model_building/univariate_selection.py

Line 14 in 48d8a9a

def compute_univariate_preselection(target_enc_train_data: pd.DataFrame,
Document and write/modify unit tests

train/selection/validation split error

Floating point issue when separating data
0.7+0.2+0.1=0.99999

which raises error:

if train_prop + selection_prop + validation_prop != 1.0:
--> 385             raise ValueError("The sum of train_prop, selection_prop and "
    386                              "validation_prop cannot differ from 1.0")

The code is located here:

cobra/cobra/preprocessing/preprocessor.py

Lines 384 to 386 in 0133435

 if train_prop + selection_prop + validation_prop != 1.0: 

 raise ValueError("The sum of train_prop, selection_prop and " 

 "validation_prop cannot differ from 1.0")

Improve documentation

Make documentation more precise, based from feedback of our users, these are the points they were missing:

if X happens, do Y or that (example when variable is dropped when it has on 6 bins, but we set 10)
be more explicit in what Cobra does and does not do
- AUC is being optimized
- no interaction variables
- if model fails, try this or that
- ...
why it works
- why logistic regression (and not decision tree)
- why we bin continuous variables (we could keep them as they are)
- why we do incidence replacement
- link to papers/more fundamental research
how some steps works
- show on example how regrouping works
add link to Geert's paper
explain how preprocessing works
- regrouping
- binning

Discretizer gives error with NaNs

(reported by user, to be investigated)

When fitting the preprocessor, some continuous variables gave Value Error.
I suspect this happends if a continuous variables of type np.float64 has only NaN values (or in one of the splits).
The error probably is raised because pandas cannot set interval index properly.

To be investigated (attached picture and traceback in text file)

traceback.txt

Reorganise evaluation

In a previous phase, the model evaluation is decoupled from the model building phase. In the later, only one metric was used for optimisation where in the former, many metrics can be used to evaluate the final model. After adding the options to use different algorithms (SVM, linear regression with continuous target, ...) for building your model, there should also be different options for evaluation as there are different types of models:

Binary target with binary scoring: basic evaluator class for binary model (contains accuracy, F1, precision, recall, ...)
Binary target with probabilistic scoring: basic evaluator (with threshold on probability) + AUC, lift, gains and lift curves, roc curve, ...)
Multi class model
Ordinal regression models
continuous target
...

All of these types of models can have different Evaluator classes, so it is important to use inheritance as much as possible, as well as having the proper utility functions split off for plotting.

Improve PIGs

After some discussion with Geert, below his suggestions:

Incidence graph:

the colors don't really match with our slide template AND I believe the frequency bars should be less visually attractive (eg grey), while the incidence graph should be really present (blue).
from this graph it is impossible to guess which Y-axis (left vs right) applies to incidence, and which applies to proportion of cases. Usually, I would align the color of the axis to the graph color - i.e. grey for population size on the left and blue for incidence on the right.
I would personally put incidence on the left instead of the right, as that is the most important information and typically, we are used to look at the left axis first. I would also clean up the graph labels a little so they don't contain .0 (left, right and bottom)...
make sure we always refer to them as Predictor Insights Graphs!

Attached is how the plot could look like.

Preprocessing - target_encoder

Discuss the methodology - how to do incidence replacement for continuous target? Currently we calculate average per group and size of the group and do the replacement. This will be working for continuous target as well.
Document and write/modify unit tests
If time permits, this is a good opportunity to implement more sophisticated encoding - see #24 and #61

Model_building - models

Create the same model class as for logistic regression but for regression. Let's use model from sklearn, RMSE as evaluation, all the relevant model parameters:

cobra/cobra/model_building/models.py

Line 11 in 48d8a9a

class LogisticRegressionModel:
Document and write/modify unit tests

CGains & ROC curves improve design

The gray grid on the plot is not very pretty. We could remove borders, make it dashed and overall make the plots prettier

Unexpected bin limit rounding

While reviewing a pull request, I tested the below method with some extra things.
I noticed this:

@pytest.mark.parametrize("n_bins, auto_adapt_bins, data, expected",
                                          [
                                              (2, 
                                               False,
                                               # Variable contains floats, nans and inf (e.g. "available resources" variable): WORKS
                                               pd.DataFrame({"variable": [7.5, 8.2, 14.9, np.inf]}),
                                               [(7.0, 12.0), (12.0, np.inf)])
                                           ],
                                           ids=["unexpected bin limit"])
def test_fit_column(self, n_bins, auto_adapt_bins, data, expected):
        discretizer = KBinsDiscretizer(n_bins=n_bins,
                                                        auto_adapt_bins=auto_adapt_bins)

        actual = discretizer._fit_column(data, column_name="variable")

        assert actual == expected

Result:

[(8.0, 12.0), (12.0, inf)] != [(7.0, 12.0), (12.0, inf)]

Expected :[(7.0, 12.0), (12.0, inf)]
Actual :[(8.0, 12.0), (12.0, inf)]

The column minimum is 7.5, which is rounded up (!) in KBinsDiscretizer._compute_bins_from_edges() to format the lower (!) bin limit, although it should have been rounded down: column value 7.5 falls in a bin that starts at 7.0, not 8.0.

I thought of just changing

        for a, b in zip(bin_edges, bin_edges[1:]):
            fmt_a = round(a, precision)
            fmt_b = round(b, precision)

to just

        for a, b in zip(bin_edges, bin_edges[1:]):
            fmt_a = floor(a, precision)  # <====
            fmt_b = round(b, precision)

but that will break the precision, which can be even negative apparently:

compute the minimal precision of the bin_edges. This can be a negative number, which then rounds numbers to the nearest 10, 100, ...

So minor bug, but floor() is not a good enough solution - I won't fix this right away with a commit in the pull request (which is actually not enough related too), so will log it here and continue with that pull request.

Preprocessing - Consider Catboost for target encoding

Insipire by https://catboost.ai/ for target encoding. To be investigated

Evaluation - pigs_tables

Add parameter model_type which can be [classification, regression]
Based on the parameter, change names (we no longer talk about incidence rate)
Document and write/modify unit tests

regroup_name does not work

When giving custom name to regrouped variables (currently set to Others), the change does not work and Others is keep appearing.

Looks like the parameter is set in class

cobra/cobra/preprocessing/categorical_data_processor.py

Line 61 in 0133435

def __init__(self, regroup: bool=True, regroup_name: str="Other",

but the value in the body is actually hard-coded:

cobra/cobra/preprocessing/categorical_data_processor.py

Line 453 in 0133435

return data.apply(lambda x: str(x) if x in categories else "Other")

Custom number of bins per variable

Now, number of bins are always the same across variables.
It'd be useful for this to be optional, like for variable A I want 10 bins, for variable B I want 6 bins.

However, this requires more complex changes - let's first see the complexity.
Additionally, we need to assess how this fits into our methodology.

Analyze and improve speed and memory consumption

We had a use case at Argenta, where we worked with table of about 300 cols and ~2 mil. of rows.
There, the preprocessing took a lot of time and memory especially.

What we’d need is to find any dataset which is of similar size and is close to reality (mixture of categorical, flags and continuous variables, has missing) and see how much memory Cobra uses and how slow it is.

The issue occurs in preprocessor.fit() and preprocessor.transform() – but these guys do a lot behind, so I am trying to pinpoint the cause (is it the binning? Incidence replacement? Maybe the data types of intermediate tables are not efficient and it takes too much memory … ).

Once we find the cause, we can figure out how to fix it.

Index error with categorical variables

(issue submitted by user)

During preprocessing, when flag variable is constant (either in full dataset or in just train set), the following error occures.
Needs to be investigated, reproduced and fixed.
Occurs when we call preprocessor.fit()

Add logo to the README page

We would like logo on the README page.
Tobania marketing can help here.

Preprocessing - TargetEncoder is dangerous

Congrats with the release of this package! I thought I'd contribute back a little with this issue.

The TargetEncoder strikes me as a dangerous transformation. While the docstring does openly say that it suffers from leakage, it gives the impression that it isn't a problem if you apply regularisation or cross-validation. I find that somewhat misleading and think the encoder should probably best be avoided in general.

To illustrate the danger: imagine you have a dataset with only one data point x with corresponding label y, then it's clear that the TargetEncoder will encode x as the exact label y, even when applying regularisation! The issue is that each example x's target value y is used to encode x, and that remains true even as you increase the number of examples.

Let's say you want to deal with that issue by implementing a "LeaveOneOutTargetEncoder", which replaces each example's categorical value with the average target of the other examples that share the same categorical value (see e.g. [1]). That sounds a bit better because none of the examples are allowed to use their own target value to encode their features. But even this encoder suffers from leakage! To see this, imagine that the encoder encodes a category as the leave-one-out sum (instead of the average). The model could then learn the per-category target sums, and simply subtract an example x's leave-one-out sum from the per-category sum to predict the exact label y for the example x.

In general, any transformation that "inserts y into X" should be treated with a lot of scrutiny.

[1] https://contrib.scikit-learn.org/category_encoders/leaveoneout.html

Variable importance is empty

Sometime plot variable importance has empty fields - this occured on titanic dataset.
We need to reproduce and locate this error

PIG graph y-axis of population size should be in %

When plotting PIGs, the left y-axis is in decimals, while it should be in % - see the added picture.

The fix is simple, in the code

cobra/cobra/evaluation/pigs_tables.py

Line 137 in 0133435

ax.set_ylabel('population size', fontsize=16)

add following:

ax.set_yticklabels(['{:3.1f}%'.format(x*100) for x in ax.get_yticks()])

Binning not constant

Binning differs in this current version from the original.

When we calculate AUC in univariate selection, we get list of variables and corresponding AUC.
Now, look at this: (left is the output from your code, right is mine code)

You can see that capital-loss has AUC of 0.53, hence its not chosen for forward selection. Whilst in your version, it has AUC of 0.534, hence it is chosen.
Also, the order of the variables is slightly different (capital-gain and workclass) .

That results in slightly different model after forward selection. Although the difference is small, I cant find the reason for the differences.

I set the same seed, so the partition is the same in both versions.

I see that when I run the two programs, the binns are different. Below is a count of scont_1 grouped by itself the binns in train partition. Even though both versions are somewhat similar, I believe this is the reason why the AUC produce the small differences.

Hence the question: In the eqfreq function, is there any way to get different binns with the same data? Maybe some randomness or rounding?

Model_building - forward_selection

This should be model-independent, test it.
Document and write/modify unit tests

Test if class is fitted on another attribute in categorical_data_processor.py

Currently, we test whether instance has been fitted based on two attributes:

cobra/cobra/preprocessing/categorical_data_processor.py

Lines 248 to 252 in 474650f

 if self.regroup and len(self._cleaned_categories_by_column) == 0: 

 msg = ("{} instance is not fitted yet. Call 'fit' with " 

 "appropriate arguments before using this method.") 

 raise NotFittedError(msg.format(self.__class__.__name__))

However, this will break if there is for example just one categorical variable and it has been skipped. The fit() will run but will not populate self._cleaned_categories_by_column and thus error will be raised.

Define fitted instance on another attribute (for example like in PreProcessor class where we have separate flag):

cobra/cobra/preprocessing/preprocessor.py

Line 255 in 474650f

self._is_fitted = True # set fitted boolean to True

Evaluation - evaluator

I'd say split into classification_evaluator and regression_evaluator classes.
Define metrics for regression case (inspiration here). As for plots, maybe QQ or residuals here. Some extra inspiration also here.
Document and write/modify unit tests

Improve import

Now we must write cobra.cobra to import it, fix it so we do only import cobra as c

Keep size of categories in JSON

(suggesting from a user)
The output JSON file can optionally store size of each category - this allows Data Scientists to debug and verify the preprocessing

Value is trying to set on a copy

During preprocessing, pandas raises following warning:

A value is trying to be set on a copy of a slice from a DataFrame

Locate which function is causing the trouble and rewrite it so the warning does not fire.
This is probably happening in multiple places!
(one was found, see the image)

Speed improvement

Find which parts of the code are the slowest
Try to vectorize most functions
Use more numpy
We keep lots of DFs in th memory. Or the main DF has both bined and incidence columns - inefficient. Drop as much as possible

Train/selection/validation split is slow

In the current design, train/sel/val split is slow, because it is creating many sub-dataframes, which are then concatenated.
In cases with larger tables (milions of rows, hundreds of columns), this takes too long and too much memory.

We could just create a list with three categories - train / sel / val and append it to the dataframe, without creating any new datafreame or concatenation.

Before the solution is merge, it'd be useful to see the impact (for example with %timeit).

One challenge might be implementation of stratified split in this simplfied version.

Refactoring

Refactor eqfreq and incidenceReplacement - messy

AUC sorting inconsistency

We found a small inconsistency - in plot_univariate_predictor_quality() we sort by 'AUC train'

cobra/cobra/evaluation/plotting_utils.py

Lines 26 to 27 in 0133435

 df = (df_auc[df_auc["preselection"]] 

 .sort_values(by='AUC train', ascending=False))

while in compute_univariate_preselection() we sort by 'AUC selection'.

cobra/cobra/model_building/univariate_selection.py

Lines 91 to 92 in 0133435

 return (df_auc.sort_values(by='AUC selection', ascending=False) 

 .reset_index(drop=True))

It does not ahve any effect on the modeling, but then the plot and preselected_predictors have different order of variables, which is confusing.

Both should be sorted by 'AUC selection'.

Possible cobra.io documentation improvements

Documentation improvements to reap:

Create a GitHub Action such that whenever we push to master, the Sphinx docs files are automatically built and the cobra.io website is recreated
Would be cool to add the tutorial notebooks to the Sphinx documentation website instead of having them in a folder on the repo (https://stackoverflow.com/questions/38526888/embed-ipython-notebook-in-sphinx-document)

Improve .csv import with C engine

Now, Python engine is used to import .csv. Which is nice, because it can infer datatypes, but its slower. I believe we can gain some speed with using C engine. But it has to be tested.

Input checks and summary statistics

When data are imported, check number of missing values, skeweness, mean, counts etc...

Finish unit testing

Right now, there are some unit tests in place. However, code coverage is still sub optimal as there are several crucial parts of the code which are not tested. This should be mitigated asap.

Add tutorial as Jupyter Notebook to documentation

For users it is easier to see not only the code, but also the output. Hence, we want to embed jupyter notebook with the full example into the sphinx documentation.

Variable importance - optional metrics

We compute variable importance by calculating Pearson's correlation between scores and target encoded variables:

cobra/cobra/model_building/models.py

Lines 144 to 150 in 0133435

 importance_by_variable = { 

 utils.clean_predictor_name(predictor): stats.pearsonr( 

 data[predictor], 

 y_pred 

 )[0] 

 for predictor in self.predictors 

 }

It'd be nice to choose different correlation (like Kendall)? Pearson assumes normality, but doesn't always hold for the variables considered.

https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall

Preprocessing should have option verbose

In cases when preprocessing takes longer, it'd be useful to see what is happening inside.
We can implement verbose boolean parameter which when set to True with print what is happening ("processing variable X ...")

Optionally, the verbose can be setup in the same was as in sklearn - it accepts integer and the bigger the integer, the more details are printed.

Number of deciles in evaluator plots

Cumulative lift, cumulative response now always show 10 bins (decile), change it to be optional.

plot_cumulative_gains() plot has optional number of bins
plot_lift_curve() has optional number of bins
plot_cumulative_response_curve() has optional number of bins

Incidence replacement - Missing value imputation of previously unknown categories

In some cases, it could be that for a categorical variable some rare category is not present in the train set, but only in the selection/validation set. In such cases, the incidence replaced variable will contain missing values. To score/evaluate the model, it is important to impute the missing values. This could be either with:

the average incidence
the incidence of the category with the lowest (resp. heighest) incidence rate

At the moment, this is not implemented in the TargetEncoder it should be fixed!

Improve speed of train/selection/validation split function

Currently, the funciton is using train_test_split() from sklearn twice, but with large datasets, the functions becomes slow and memory demanding due to the fact that we are creating multiple dataframes.
The solution would be just to append a list with the split [train, selection, train, validation ... ]

cobra/cobra/preprocessing/preprocessor.py

Line 340 in 9141313

def train_selection_validation_split(data: pd.DataFrame,

Preprocessing - categorical_data_preprocessor

We need to add option model_type which can be [classification, regression] to the class:

cobra/cobra/preprocessing/categorical_data_processor.py

Line 61 in 6ef6723

def __init__(self, regroup: bool=True, regroup_name: str="Other",
Think about implementation. I'd say keep this as it is, but add a paremeter model_type.

cobra/cobra/preprocessing/categorical_data_processor.py

Line 216 in 6ef6723

pval = (CategoricalDataProcessor
We need to add statistical test to regroup categories given continuous target. I'd choose non-parametric version of t-test mann whitney or kruskal walis. But I am open to suggestions. There'd be two methods and _compute_p_value would choose one or the other based on input parameter.

cobra/cobra/preprocessing/categorical_data_processor.py

Line 399 in 6ef6723

def _compute_p_value(X: pd.Series, y: pd.Series, category: str,
Document and write/modify unit tests

Provide functionality for scoring

Provide for user functions which allow easy scoring of trained model. Currently, we export the data prep and model pipeline as dict, but in order to produce scores, users need to manually calculate them.

To be discussed how this will be implemented (perhaps as a predict_proba in the model class?)

Discuss implementation
Implement
update documentation

Methods to do not return DF, but modify them

Instead of return, use self. for one dataframe to be modified

Forward selection crushes sometimes

Forward selection sometimes throws an error when there is no positive coef - it happens when we force two weak variables (like scont1 and scont2) and they give negative coefs. Somehow fix it. It happens randomly - find the cause!

Comparison of multiple models in evaluator plots & metrics

When the user builds multiple different models, it would be handy to evaluate them in the same time. This means that the plotting functions:

evaluator.plot_roc_curve()
evaluator.plot_confusion_matrix()
evaluator.plot_cumulative_gains()
evaluator.plot_lift_curve()
evaluator.plot_cumulative_response_curve()

would have the possibility to add results from multiple models.

Preprocessing - preprocessor

First implement changes in target_encoder and categorical_data_processor before starting this.

Add parameter model_type which can be [classification, regression] to the input.

cobra/cobra/preprocessing/preprocessor.py

Line 56 in 6ef6723

def __init__(self, categorical_data_processor: CategoricalDataProcessor,
Make sure to change the from_params:

cobra/cobra/preprocessing/preprocessor.py

Line 68 in 6ef6723

def from_params(cls,
Make sure that the pipeline has also this new parameter when saving (pipeline["model_type"] = 'classification'), loading and validating:

cobra/cobra/preprocessing/preprocessor.py

Line 396 in 48d8a9a

def serialize_pipeline(self) -> dict:

and

cobra/cobra/preprocessing/preprocessor.py

Line 424 in 48d8a9a

def _is_valid_pipeline(pipeline: dict) -> bool:
Document and write/modify unit tests

Parameter sample_1/sample_0

Parameter sample_1/sample_0 into one parameter - we always take all 1's and then we sample the 0's.
The parameter can be just a ration of how many 0 for each 1.

Refactor codebase with Pylint & Black

Before release, refactor the code so it has the same formatting using Black (https://github.com/psf/black)
Afterwards, we can add Black checks part of the CI pipeline and enforce code quality

Add option to use different algorithms

In some of our projects, it is necessary to be able to have a library in which you can deviate from the ordinary regression models with a binary target. Our current methodology should work fine in those cases, so it is only natural to add those to COBRA.

As the model now is encapsulated in a class and is heavily used in the ForwardFeatureSelection class, it makes sense to implement these models by means of an (abstract) factory design pattern to make them easy to use.

	if train_prop + selection_prop + validation_prop != 1.0:
	raise ValueError("The sum of train_prop, selection_prop and "
	"validation_prop cannot differ from 1.0")

	if self.regroup and len(self._cleaned_categories_by_column) == 0:
	msg = ("{} instance is not fitted yet. Call 'fit' with "
	"appropriate arguments before using this method.")

	raise NotFittedError(msg.format(self.__class__.__name__))

	df = (df_auc[df_auc["preselection"]]
	.sort_values(by='AUC train', ascending=False))

	return (df_auc.sort_values(by='AUC selection', ascending=False)
	.reset_index(drop=True))

	importance_by_variable = {
	utils.clean_predictor_name(predictor): stats.pearsonr(
	data[predictor],
	y_pred
	)[0]
	for predictor in self.predictors
	}

pythonpredictions / cobra Goto Github PK

cobra's People

Contributors

Stargazers

Watchers

Forkers

cobra's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs