GithubHelp home page GithubHelp logo

pythonpredictions / cobra Goto Github PK

View Code? Open in Web Editor NEW
30.0 30.0 5.0 14.48 MB

A Python package to build predictive linear and logistic regression models focused on performance and interpretation

Home Page: https://pythonpredictions.github.io/cobra.io

License: MIT License

Python 20.46% Jupyter Notebook 79.48% Makefile 0.06%
linear-regression logistic-regression predictive-analytics-for-business

cobra's People

Contributors

hendrikdewinter8 avatar janbenisek avatar matthiasroelspython avatar patrickleonardy avatar pythongeert avatar sandervh14 avatar sborms avatar zlatansky avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cobra's Issues

Model_building - univariate_selection

  • Since AUC cannot be applied to continuous target, we need another metric. Since we want univariate selection RMSE could be used instead. Review this before continuing.
  • After, extend the function with model_type which can be [classification, regression]. Then, the function returns either RMSE or AUC. The order has to be correct (we want the highest AUC, but the lowest RMSE)
    def compute_univariate_preselection(target_enc_train_data: pd.DataFrame,
  • Document and write/modify unit tests

train/selection/validation split error

Floating point issue when separating data
0.7+0.2+0.1=0.99999

which raises error:

if train_prop + selection_prop + validation_prop != 1.0:
--> 385             raise ValueError("The sum of train_prop, selection_prop and "
    386                              "validation_prop cannot differ from 1.0")

The code is located here:

if train_prop + selection_prop + validation_prop != 1.0:
raise ValueError("The sum of train_prop, selection_prop and "
"validation_prop cannot differ from 1.0")

Improve documentation

Make documentation more precise, based from feedback of our users, these are the points they were missing:

  • if X happens, do Y or that (example when variable is dropped when it has on 6 bins, but we set 10)
  • be more explicit in what Cobra does and does not do
    • AUC is being optimized
    • no interaction variables
    • if model fails, try this or that
    • ...
  • why it works
    • why logistic regression (and not decision tree)
    • why we bin continuous variables (we could keep them as they are)
    • why we do incidence replacement
    • link to papers/more fundamental research
  • how some steps works
    • show on example how regrouping works
  • add link to Geert's paper
  • explain how preprocessing works
    • regrouping
    • binning

Discretizer gives error with NaNs

(reported by user, to be investigated)

When fitting the preprocessor, some continuous variables gave Value Error.
I suspect this happends if a continuous variables of type np.float64 has only NaN values (or in one of the splits).
The error probably is raised because pandas cannot set interval index properly.

To be investigated (attached picture and traceback in text file)
image
traceback.txt

Reorganise evaluation

In a previous phase, the model evaluation is decoupled from the model building phase. In the later, only one metric was used for optimisation where in the former, many metrics can be used to evaluate the final model. After adding the options to use different algorithms (SVM, linear regression with continuous target, ...) for building your model, there should also be different options for evaluation as there are different types of models:

  • Binary target with binary scoring: basic evaluator class for binary model (contains accuracy, F1, precision, recall, ...)
  • Binary target with probabilistic scoring: basic evaluator (with threshold on probability) + AUC, lift, gains and lift curves, roc curve, ...)
  • Multi class model
  • Ordinal regression models
  • continuous target
  • ...

All of these types of models can have different Evaluator classes, so it is important to use inheritance as much as possible, as well as having the proper utility functions split off for plotting.

Improve PIGs

After some discussion with Geert, below his suggestions:

Incidence graph:

  • the colors don't really match with our slide template AND I believe the frequency bars should be less visually attractive (eg grey), while the incidence graph should be really present (blue).
  • from this graph it is impossible to guess which Y-axis (left vs right) applies to incidence, and which applies to proportion of cases. Usually, I would align the color of the axis to the graph color - i.e. grey for population size on the left and blue for incidence on the right.
  • I would personally put incidence on the left instead of the right, as that is the most important information and typically, we are used to look at the left axis first. I would also clean up the graph labels a little so they don't contain .0 (left, right and bottom)...
  • make sure we always refer to them as Predictor Insights Graphs!

Attached is how the plot could look like.

PIG_age_v2

Preprocessing - target_encoder

  • Discuss the methodology - how to do incidence replacement for continuous target? Currently we calculate average per group and size of the group and do the replacement. This will be working for continuous target as well.
  • Document and write/modify unit tests
  • If time permits, this is a good opportunity to implement more sophisticated encoding - see #24 and #61

Unexpected bin limit rounding

While reviewing a pull request, I tested the below method with some extra things.
I noticed this:

@pytest.mark.parametrize("n_bins, auto_adapt_bins, data, expected",
                                          [
                                              (2, 
                                               False,
                                               # Variable contains floats, nans and inf (e.g. "available resources" variable): WORKS
                                               pd.DataFrame({"variable": [7.5, 8.2, 14.9, np.inf]}),
                                               [(7.0, 12.0), (12.0, np.inf)])
                                           ],
                                           ids=["unexpected bin limit"])
def test_fit_column(self, n_bins, auto_adapt_bins, data, expected):
        discretizer = KBinsDiscretizer(n_bins=n_bins,
                                                        auto_adapt_bins=auto_adapt_bins)

        actual = discretizer._fit_column(data, column_name="variable")

        assert actual == expected

Result:

[(8.0, 12.0), (12.0, inf)] != [(7.0, 12.0), (12.0, inf)]

Expected :[(7.0, 12.0), (12.0, inf)]
Actual :[(8.0, 12.0), (12.0, inf)]

The column minimum is 7.5, which is rounded up (!) in KBinsDiscretizer._compute_bins_from_edges() to format the lower (!) bin limit, although it should have been rounded down: column value 7.5 falls in a bin that starts at 7.0, not 8.0.

I thought of just changing

        for a, b in zip(bin_edges, bin_edges[1:]):
            fmt_a = round(a, precision)
            fmt_b = round(b, precision)

to just

        for a, b in zip(bin_edges, bin_edges[1:]):
            fmt_a = floor(a, precision)  # <====
            fmt_b = round(b, precision)

but that will break the precision, which can be even negative apparently:

compute the minimal precision of the bin_edges. This can be a negative number, which then rounds numbers to the nearest 10, 100, ...

So minor bug, but floor() is not a good enough solution - I won't fix this right away with a commit in the pull request (which is actually not enough related too), so will log it here and continue with that pull request.

Evaluation - pigs_tables

  • Add parameter model_type which can be [classification, regression]
  • Based on the parameter, change names (we no longer talk about incidence rate)
  • Document and write/modify unit tests

regroup_name does not work

When giving custom name to regrouped variables (currently set to Others), the change does not work and Others is keep appearing.

Looks like the parameter is set in class

def __init__(self, regroup: bool=True, regroup_name: str="Other",

but the value in the body is actually hard-coded:

return data.apply(lambda x: str(x) if x in categories else "Other")

Custom number of bins per variable

Now, number of bins are always the same across variables.
It'd be useful for this to be optional, like for variable A I want 10 bins, for variable B I want 6 bins.

However, this requires more complex changes - let's first see the complexity.
Additionally, we need to assess how this fits into our methodology.

Analyze and improve speed and memory consumption

We had a use case at Argenta, where we worked with table of about 300 cols and ~2 mil. of rows.
There, the preprocessing took a lot of time and memory especially.

What we’d need is to find any dataset which is of similar size and is close to reality (mixture of categorical, flags and continuous variables, has missing) and see how much memory Cobra uses and how slow it is.

The issue occurs in preprocessor.fit() and preprocessor.transform() – but these guys do a lot behind, so I am trying to pinpoint the cause (is it the binning? Incidence replacement? Maybe the data types of intermediate tables are not efficient and it takes too much memory … ).

Once we find the cause, we can figure out how to fix it.

Index error with categorical variables

(issue submitted by user)

During preprocessing, when flag variable is constant (either in full dataset or in just train set), the following error occures.
Needs to be investigated, reproduced and fixed.
Occurs when we call preprocessor.fit()
image

Preprocessing - TargetEncoder is dangerous

Congrats with the release of this package! I thought I'd contribute back a little with this issue.

The TargetEncoder strikes me as a dangerous transformation. While the docstring does openly say that it suffers from leakage, it gives the impression that it isn't a problem if you apply regularisation or cross-validation. I find that somewhat misleading and think the encoder should probably best be avoided in general.

To illustrate the danger: imagine you have a dataset with only one data point x with corresponding label y, then it's clear that the TargetEncoder will encode x as the exact label y, even when applying regularisation! The issue is that each example x's target value y is used to encode x, and that remains true even as you increase the number of examples.

Let's say you want to deal with that issue by implementing a "LeaveOneOutTargetEncoder", which replaces each example's categorical value with the average target of the other examples that share the same categorical value (see e.g. [1]). That sounds a bit better because none of the examples are allowed to use their own target value to encode their features. But even this encoder suffers from leakage! To see this, imagine that the encoder encodes a category as the leave-one-out sum (instead of the average). The model could then learn the per-category target sums, and simply subtract an example x's leave-one-out sum from the per-category sum to predict the exact label y for the example x.

In general, any transformation that "inserts y into X" should be treated with a lot of scrutiny.

[1] https://contrib.scikit-learn.org/category_encoders/leaveoneout.html

Variable importance is empty

Sometime plot variable importance has empty fields - this occured on titanic dataset.
We need to reproduce and locate this error

Binning not constant

Binning differs in this current version from the original.

When we calculate AUC in univariate selection, we get list of variables and corresponding AUC.
Now, look at this: (left is the output from your code, right is mine code)

image

You can see that capital-loss has AUC of 0.53, hence its not chosen for forward selection. Whilst in your version, it has AUC of 0.534, hence it is chosen.
Also, the order of the variables is slightly different (capital-gain and workclass) .

That results in slightly different model after forward selection. Although the difference is small, I cant find the reason for the differences.

I set the same seed, so the partition is the same in both versions.

I see that when I run the two programs, the binns are different. Below is a count of scont_1 grouped by itself the binns in train partition. Even though both versions are somewhat similar, I believe this is the reason why the AUC produce the small differences.
image

Hence the question: In the eqfreq function, is there any way to get different binns with the same data? Maybe some randomness or rounding?

Test if class is fitted on another attribute in categorical_data_processor.py

Currently, we test whether instance has been fitted based on two attributes:

if self.regroup and len(self._cleaned_categories_by_column) == 0:
msg = ("{} instance is not fitted yet. Call 'fit' with "
"appropriate arguments before using this method.")
raise NotFittedError(msg.format(self.__class__.__name__))

However, this will break if there is for example just one categorical variable and it has been skipped. The fit() will run but will not populate self._cleaned_categories_by_column and thus error will be raised.

Define fitted instance on another attribute (for example like in PreProcessor class where we have separate flag):

self._is_fitted = True # set fitted boolean to True

Evaluation - evaluator

  • I'd say split into classification_evaluator and regression_evaluator classes.
  • Define metrics for regression case (inspiration here). As for plots, maybe QQ or residuals here. Some extra inspiration also here.
  • Document and write/modify unit tests

Improve import

Now we must write cobra.cobra to import it, fix it so we do only import cobra as c

Keep size of categories in JSON

(suggesting from a user)
The output JSON file can optionally store size of each category - this allows Data Scientists to debug and verify the preprocessing

Value is trying to set on a copy

During preprocessing, pandas raises following warning:

A value is trying to be set on a copy of a slice from a DataFrame

Locate which function is causing the trouble and rewrite it so the warning does not fire.
This is probably happening in multiple places!
(one was found, see the image)

image001

Speed improvement

  1. Find which parts of the code are the slowest
  2. Try to vectorize most functions
  3. Use more numpy
  4. We keep lots of DFs in th memory. Or the main DF has both bined and incidence columns - inefficient. Drop as much as possible

Train/selection/validation split is slow

In the current design, train/sel/val split is slow, because it is creating many sub-dataframes, which are then concatenated.
In cases with larger tables (milions of rows, hundreds of columns), this takes too long and too much memory.

We could just create a list with three categories - train / sel / val and append it to the dataframe, without creating any new datafreame or concatenation.

Before the solution is merge, it'd be useful to see the impact (for example with %timeit).

One challenge might be implementation of stratified split in this simplfied version.

Refactoring

Refactor eqfreq and incidenceReplacement - messy

AUC sorting inconsistency

We found a small inconsistency - in plot_univariate_predictor_quality() we sort by 'AUC train'

df = (df_auc[df_auc["preselection"]]
.sort_values(by='AUC train', ascending=False))

while in compute_univariate_preselection() we sort by 'AUC selection'.

return (df_auc.sort_values(by='AUC selection', ascending=False)
.reset_index(drop=True))

It does not ahve any effect on the modeling, but then the plot and preselected_predictors have different order of variables, which is confusing.

Both should be sorted by 'AUC selection'.

Improve .csv import with C engine

Now, Python engine is used to import .csv. Which is nice, because it can infer datatypes, but its slower. I believe we can gain some speed with using C engine. But it has to be tested.

Finish unit testing

Right now, there are some unit tests in place. However, code coverage is still sub optimal as there are several crucial parts of the code which are not tested. This should be mitigated asap.

Variable importance - optional metrics

We compute variable importance by calculating Pearson's correlation between scores and target encoded variables:

importance_by_variable = {
utils.clean_predictor_name(predictor): stats.pearsonr(
data[predictor],
y_pred
)[0]
for predictor in self.predictors
}

It'd be nice to choose different correlation (like Kendall)? Pearson assumes normality, but doesn't always hold for the variables considered.

https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall

Preprocessing should have option verbose

In cases when preprocessing takes longer, it'd be useful to see what is happening inside.
We can implement verbose boolean parameter which when set to True with print what is happening ("processing variable X ...")

Optionally, the verbose can be setup in the same was as in sklearn - it accepts integer and the bigger the integer, the more details are printed.

Number of deciles in evaluator plots

Cumulative lift, cumulative response now always show 10 bins (decile), change it to be optional.

  • plot_cumulative_gains() plot has optional number of bins

  • plot_lift_curve() has optional number of bins

  • plot_cumulative_response_curve() has optional number of bins

Incidence replacement - Missing value imputation of previously unknown categories

In some cases, it could be that for a categorical variable some rare category is not present in the train set, but only in the selection/validation set. In such cases, the incidence replaced variable will contain missing values. To score/evaluate the model, it is important to impute the missing values. This could be either with:

  • the average incidence
  • the incidence of the category with the lowest (resp. heighest) incidence rate

At the moment, this is not implemented in the TargetEncoder it should be fixed!

Preprocessing - categorical_data_preprocessor

Provide functionality for scoring

Provide for user functions which allow easy scoring of trained model. Currently, we export the data prep and model pipeline as dict, but in order to produce scores, users need to manually calculate them.

To be discussed how this will be implemented (perhaps as a predict_proba in the model class?)

  • Discuss implementation
  • Implement
  • update documentation

Forward selection crushes sometimes

Forward selection sometimes throws an error when there is no positive coef - it happens when we force two weak variables (like scont1 and scont2) and they give negative coefs. Somehow fix it. It happens randomly - find the cause!

Comparison of multiple models in evaluator plots & metrics

When the user builds multiple different models, it would be handy to evaluate them in the same time. This means that the plotting functions:

  • evaluator.plot_roc_curve()
  • evaluator.plot_confusion_matrix()
  • evaluator.plot_cumulative_gains()
  • evaluator.plot_lift_curve()
  • evaluator.plot_cumulative_response_curve()

would have the possibility to add results from multiple models.

Preprocessing - preprocessor

First implement changes in target_encoder and categorical_data_processor before starting this.

Parameter sample_1/sample_0

Parameter sample_1/sample_0 into one parameter - we always take all 1's and then we sample the 0's.
The parameter can be just a ration of how many 0 for each 1.

Add option to use different algorithms

In some of our projects, it is necessary to be able to have a library in which you can deviate from the ordinary regression models with a binary target. Our current methodology should work fine in those cases, so it is only natural to add those to COBRA.

As the model now is encapsulated in a class and is heavily used in the ForwardFeatureSelection class, it makes sense to implement these models by means of an (abstract) factory design pattern to make them easy to use.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.