winvector / pyvtreat Goto Github PK

vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.

Home Page: https://winvector.github.io/pyvtreat/

License: Other

Python 99.78% Shell 0.22%

data-science machine-learning pydata python

pyvtreat's Issues

recommended variables in `MultinomialOutcomeTreatment`

I noticed that if you use MultinomialOutcomeTreatment.fit_transform and then follow the recipe in the examples of selecting variables with

good_variables = plan.score_frame_.variable[plan.score_frame_.recommended].values

you'll end up with duplicated entries in good_variables (because you're doing separate tests for each output class, I think?)

Would you suggest just calling

good_variables = plan.score_frame_.variable[plan.score_frame_.recommended].unique()

instead? Or is there some other recommended way to do variable selection in the MultinomialOutcomeTreatment case?

Thanks!
~ Ben

Snippet:

import vtreat
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

X, y = pd.DataFrame(iris['data']), iris['target']

plan  = vtreat.BinomialOutcomeTreatment(outcome_target=True)
X_new = plan.fit_transform(X, y == 0)
plan.score_frame_

#    variable  orig_variable   treatment  y_aware  has_range  PearsonR  significance  vcount  default_threshold  recommended
# 0         0              0  clean_copy    False       True  0.079396  3.341524e-01     4.0               0.25        False
# 1         1              1  clean_copy    False       True -0.467703  1.595624e-09     4.0               0.25         True
# 2         2              2  clean_copy    False       True  0.201754  1.329302e-02     4.0               0.25         True
# 3         3              3  clean_copy    False       True  0.117899  1.507473e-01     4.0               0.25         True

plan  = vtreat.MultinomialOutcomeTreatment()
X_new = plan.fit_transform(X, y)
plan.score_frame_
#     variable  orig_variable   treatment  y_aware  has_range  PearsonR  significance  vcount  default_threshold  recommended  outcome_target
# 0          0              0  clean_copy    False       True -0.717416  5.288768e-25     4.0               0.25         True               0
# 1          1              1  clean_copy    False       True  0.603348  3.054699e-16     4.0               0.25         True               0
# 2          2              2  clean_copy    False       True -0.922765  3.623379e-63     4.0               0.25         True               0
# 3          3              3  clean_copy    False       True -0.887344  1.288504e-51     4.0               0.25         True               0
# 4          0              0  clean_copy    False       True  0.079396  3.341524e-01     4.0               0.25        False               1
# 5          1              1  clean_copy    False       True -0.467703  1.595624e-09     4.0               0.25         True               1
# 6          2              2  clean_copy    False       True  0.201754  1.329302e-02     4.0               0.25         True               1
# 7          3              3  clean_copy    False       True  0.117899  1.507473e-01     4.0               0.25         True               1
# 8          0              0  clean_copy    False       True  0.638020  1.619533e-18     4.0               0.25         True               2
# 9          1              1  clean_copy    False       True -0.135645  9.791170e-02     4.0               0.25         True               2
# 10         2              2  clean_copy    False       True  0.721011  2.381987e-25     4.0               0.25         True               2
# 11         3              3  clean_copy    False       True  0.769445  1.297773e-30     4.0               0.25         True               2

Pure data algebra training path

Implement a pure data algebra training path, or a Polars training shim

Research future warning from Pandas

Get a test that isolates the future warning from Pandas, research how to future-proof the code, and remove the init suppression of these warnings.

Use with PySpark

With the rise of Spark usage in the ML community, it would be interesting to have Vtreat been able to deal with Spark RDDs

error on DataFrame

When tried to fit_transform on DataFrame got this:


  File "<ipython-input-34-4749a25525c1>", line 2, in <module>
    train_labeled_vtreat.append(plan.fit_transform(pd.DataFrame(train_labeled[column]).all(), target))

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 297, in fit_transform
    raise Exception("X should be a Pandas DataFrame")

Exception: X should be a Pandas DataFrame

or this:

cross_frame = plan.fit_transform(train_labeled, target)
Traceback (most recent call last):

  File "<ipython-input-35-e8d2bce0ab6b>", line 1, in <module>
    cross_frame = plan.fit_transform(train_labeled, target)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 315, in fit_transform
    params=self.params_,

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 460, in fit_multinomial_outcome_treatment
    params=params,

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 176, in fit_binomial_impact_code
    sf = vtreat.util.grouped_by_x_statistics(x, y)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\util.py", line 31, in grouped_by_x_statistics
    if n != len(y):

TypeError: len() of unsized object

Hi, after applying the fit_transform and finding the set of useful features I would like to be able to use the same set obtained on another dataset composed of original features and different observations.
Is there a specific way to achieve this reproducibility? I suppose that reapplying the fit_transform can lead to a set of different features; I tried to do the application of fit and transform separately but maybe there is a apposite function (and the UserWarning: possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead) tells me that it is probably not the correct approach).

Thanks in advance and congratulations on your excellent work.

vtreat and sklearn pipeline

First of all really interesting project, that could save a lot of repetitive work and provide good baseline.
I've tried to find example in docs that uses Pipeline from scikit-learn but I didn't, so this is my quick and dirty attempt based on yours:

import pandas as pd
import numpy as np
import numpy.random
import vtreat
import vtreat.util
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

numpy.random.seed(2019)

def make_data(nrows):
    d = pd.DataFrame({'x': 5*numpy.random.normal(size=nrows)})
    d['y'] = numpy.sin(d['x']) + 0.1*numpy.random.normal(size=nrows)
    d.loc[numpy.arange(3, 10), 'x'] = numpy.nan                           # introduce a nan level
    d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
    d['x2'] = np.random.normal(size=nrows)
    d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan  # introduce a nan level
    d['yc'] = d['y']>0.5
    return d

df = make_data(500)

df = df.drop(columns=['y'])

transform = vtreat.BinomialOutcomeTreatment(outcome_target=True)

clf = Pipeline(steps=[
    ('preprocessor', transform),
    ('classifier', LogisticRegression())]
)

X, y = df, df.pop('yc')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)

print("model score: %.3f" % clf.score(X_test, y_test))

In general, it seems to work, but :

It'd be great to have: __repr__ , get_params etc. to have nice representation in Pipeline
similarly, get_feature_names method to have clf['preprocessor'].get_feature_names()
I don't use parameter cols_to_copy and drop y manually, to avoid leaking y
I'm not sure that vtreat.cross_plan... could be replaced by validation schemes from scikit-learn like GridSearchCV

Add and test Polars path

Add and test Polars path. At least on the data algebra pipeline, and then later on training.

categorical variables

If the categorical column appears to have only a numeric variables (like: 5, 7, 8, 1).
What is the way to specify it to vtreat.NumericOutcomeTreatment?

Or the most simple way is to convert numeric values (categorical column) to some kind of strings?

Deprecated dependency towards sklearn package

The correct package name vtreat should depend on would be scikit-learn

Unspecified upper version limits for dependencies

Issue
The setup.py file does not specify upper version limits for any dependencies which can lead to major version upgrades with breaking changes.
Impact
This can be currently seen via pandas which recently got a major version bump. Upon release vtreat==1.2.8 did not upgrade pandas to >=2.0.0, however it currently does.
Potential Fix
Specify approximate version number numbers using ~= or upper limits using <

Note: Other than an unspecified upper limit in vtreat, pandas also indirectly gets a version bump through data_algebra which now requires pandas>=2.0.0

Code up categorical variables with shared indicator space example

Make it easy for categorical variables to share indicator space additively, and same for derived impact columns.

Look into variations of transform implementation

Consider looking into pandas.applymap() (may or may not be able to use it in the presence of novel levels and missing values) and also for a data_algebra version of .transform().

documentation fro score_frame_ ?

First, thank you so much for vtreat, it has definitely changed how I approach pre-processing data.
I am trying to understand the different columns created by the method score_frame_ for a BinomialOutcomeTreatment. I've looked through the python examples, the python api code, and the original paper, but I can't seem to find any information on 'has_range' and 'vcount' . What are the definitions of those columns and/or where can I find more documentation on score_frame_ ?

Bug: import Error due to statsmodels' Appender

Hello,

I am experiencing the following error when importing pyvtreat

`---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
in
2 import pandas as pd
3 import numpy as np
----> 4 import vtreat
5 import vtreat.util
6 import wvpy.util

~/anaconda3/envs/python3/lib/python3.6/site-packages/vtreat/init.py in
5 import numpy
6
----> 7 from vtreat.vtreat_api import *
8
9 docformat = "restructuredtext"

~/anaconda3/envs/python3/lib/python3.6/site-packages/vtreat/vtreat_api.py in
4 import numpy
5
----> 6 import vtreat.vtreat_impl as vtreat_impl
7 import vtreat.util
8 import vtreat.cross_plan

~/anaconda3/envs/python3/lib/python3.6/site-packages/vtreat/vtreat_impl.py in
13 import pandas
14
---> 15 import vtreat.util
16 import vtreat.transform
17

~/anaconda3/envs/python3/lib/python3.6/site-packages/vtreat/util.py in
16 with warnings.catch_warnings():
17 warnings.filterwarnings("ignore", category=DeprecationWarning)
---> 18 import statsmodels.api
19 import statsmodels.formula.api
20

~/anaconda3/envs/python3/lib/python3.6/site-packages/statsmodels/api.py in
18 from . import robust
19 from .robust.robust_linear_model import RLM
---> 20 from .discrete.discrete_model import (Poisson, Logit, Probit,
21 MNLogit, NegativeBinomial,
22 GeneralizedPoisson,

~/anaconda3/envs/python3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py in
26 from scipy.stats import nbinom
27
---> 28 from statsmodels.compat.pandas import Appender
29
30 import statsmodels.tools.tools as tools

ImportError: cannot import name 'Appender'`

I looked at SciPy's issues, and it seems like multiple packages are having issues with importing Appender, but SciPy has closed all of those issues without changing their package. I've also tried using different versions of Pandas(0.23.0 - 1.0.1) and that didn't do anything. Scipy seems to suggest using
from pandas.util._decorators import Appender

I really like vtreat, so any help/suggestions on how to import it would be appreciated.
Thanks!

Question - How to encode high cardinal variables

I came across this package while I was googling to find a way to encode my high cardinal variables.

My categorical variable has 15 level and my dataset has only 900 rows.

I would like to encode them in a manner that will allow us to interpret later (unlike hash etc).

So, Is there any tutorial or method on how can we encode high cardinal variable without losing interpretability?

Indicator Code is False for has_range/example outdated?

Even running the example code, my prepared data frame doesn't include indicator_code variables. When I check the transform.score_frame_ I see that the indicator_code variables are False under has_range which might be why they weren't created. Is this intended, since obviously the example data does have different levels and varies. And possibly would it help to update the example to the latest behavior? Thank you!

ipdb> !d.head()
     x         y          xc        x2  x3
0  0.0 -0.111698  level_-0.0 -0.098463   1
1  0.1  0.270348   level_0.5  0.370653   1
2  0.2 -0.057853  level_-0.0  0.111180   1
3  0.3  0.412467   level_0.5  1.305242   1
4  0.4  0.469221   level_0.5  0.490332   1

ipdb> d_prepared.columns
Index(['y', 'xc_is_bad', 'x', 'x2', 'xc_prevalence_code'], dtype='object')

ipdb> transform.score_frame_
              variable orig_variable          treatment  y_aware  has_range  PearsonR  significance  recommended  vcount
0            xc_is_bad            xc  missing_indicator    False       True       NaN           NaN         True     1.0
1                    x             x         clean_copy    False       True       NaN           NaN         True     2.0
2                   x2            x2         clean_copy    False       True       NaN           NaN         True     2.0
3   xc_prevalence_code            xc    prevalence_code    False       True       NaN           NaN         True     1.0
4    xc_lev_level_-0.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
5     xc_lev_level_1.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
6     xc_lev_level_0.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
7    xc_lev_level_-0.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
8          xc_lev__NA_            xc     indicator_code    False      False       NaN           NaN        False     7.0
9     xc_lev_level_1.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
10    xc_lev_level_0.0            xc     indicator_code    False      False       NaN           NaN        False     7.0

Pass default value to clean_copy

Hi,

I'm using the unsupervised treatment module and noticed that the clean_copy fills in any missing/bad values with the mean for the distribution. I was wondering if I could specify a default value based on column name through something like a dictionary?

The reason is I might have a variable that has a long tail that few people ever reach. This might be something like money spent after reaching level 100 in a game or something. Now most people won't reach level 100, so their value will be missing/NA. However, of those that do, most people will spend 0, and a few might spend a large amount, maybe 1000.

vtreat will default fill in the clean copy of revenue (which is hugely influential indicator) with maybe like 150 which kind of ruins any signal from this long tail distribution. Could I pass a dictionary with like 0 as a default value in this case? Thank you!

This could be something like

transform = vtreat.UnsupervisedTreatment(
    cols_to_copy=[],          # columns to "carry along" but not treat as input variables
    cols_fill_values={
        'col1': 0,
        'col2': mean,
        'col3': 1
    }
)

Column name update/seeding tutorials

Hi. I didn't want to fork the repo for this, but under the Python classification example in the exploratory section, the notebook says:

'Find the mean value of yc'

I think 'yc' is a nominal column and finding the mean wouldn't be possible. With that in mind, here's two friendly suggestions:

Add something like numpy.random.seed(42) or another seed value at the top of the examples for reproducibility by those following the tutorial.
Update the mean value sections. I could be wrong and may have misread the document, but I went through another of the tutorials and some of the stuff copied over could have been mislabeled.

Other than that, the package looks interesting so far.

Thanks!

Add distinct +/-infinity treatments

Add distinct +/-infinity treatments. May not be able to distinguish NA from NaN on pandas.

winvector / pyvtreat Goto Github PK

pyvtreat's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs