winvector / pyvtreat Goto Github PK

vtreat is a data frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner. Distributed under a BSD-3-Clause license.

Home Page: https://winvector.github.io/pyvtreat/

License: Other

Python 99.77% Shell 0.23%

pydata machine-learning data-science python

pyvtreat's Introduction

This is the Python version of the vtreat data preparation system (also available as an R package).

vtreat is a DataFrame processor/conditioner that prepares real-world data for supervised machine learning or predictive modeling in a statistically sound manner.

Installing

Install vtreat with either of:

pip install vtreat
pip install https://github.com/WinVector/pyvtreat/raw/master/pkg/dist/vtreat-0.4.6.tar.gz

Video Introduction

Our PyData LA 2019 talk on vtreat is a good video introduction to what problems vtreat can be used to solve. The slides can be found here.

Details

vtreat takes an input DataFrame that has a specified column called "the outcome variable" (or "y") that is the quantity to be predicted (and must not have missing values). Other input columns are possible explanatory variables (typically numeric or categorical/string-valued, these columns may have missing values) that the user later wants to use to predict "y". In practice such an input DataFrame may not be immediately suitable for machine learning procedures that often expect only numeric explanatory variables, and may not tolerate missing values.

To solve this, vtreat builds a transformed DataFrame where all explanatory variable columns have been transformed into a number of numeric explanatory variable columns, without missing values. The vtreat implementation produces derived numeric columns that capture most of the information relating the explanatory columns to the specified "y" or dependent/outcome column through a number of numeric transforms (indicator variables, impact codes, prevalence codes, and more). This transformed DataFrame is suitable for a wide range of supervised learning methods from linear regression, through gradient boosted machines.

The idea is: you can take a DataFrame of messy real world data and easily, faithfully, reliably, and repeatably prepare it for machine learning using documented methods using vtreat. Incorporating vtreat into your machine learning workflow lets you quickly work with very diverse structured data.

To get started with vtreat please check out our documentation:

Getting started using vtreat for classification.
Getting started using vtreat for regression.
Getting started using vtreat for multi-category classification.
Getting started using vtreat for unsupervised tasks.
The vtreat Score Frame (a table mapping new derived variables to original columns).
The original vtreat paper this note describes the methodology and theory. (The article describes the R version, however all of the examples can be found worked in Python here).

Some vtreat common capabilities are documented here:

Score Frame score_frame_, using the score_frame_ information.
Cross Validation Customized Cross Plans, controlling the cross validation plan.

vtreat is available as a Python/Pandas package, and also as an R package.

(logo: Julie Mount, source: “The Harvest” by Boris Kustodiev 1914)

vtreat is used by instantiating one of the classes vtreat.NumericOutcomeTreatment, vtreat.BinomialOutcomeTreatment, vtreat.MultinomialOutcomeTreatment, or vtreat.UnsupervisedTreatment. Each of these implements the sklearn.pipeline.Pipeline interfaces expecting a Pandas DataFrame as input. The vtreat steps are intended to be a "one step fix" that works well with sklearn.preprocessing stages.

The vtreat Pipeline.fit_transform() method implements the powerful cross-frame ideas (allowing the same data to be used for vtreat fitting and for later model construction, while mitigating nested model bias issues).

Background

Even with modern machine learning techniques (random forests, support vector machines, neural nets, gradient boosted trees, and so on) or standard statistical methods (regression, generalized regression, generalized additive models) there are common data issues that can cause modeling to fail. vtreat deals with a number of these in a principled and automated fashion.

In particular vtreat emphasizes a concept called “y-aware pre-processing” and implements:

Treatment of missing values through safe replacement plus an indicator column (a simple but very powerful method when combined with downstream machine learning algorithms).
Treatment of novel levels (new values of categorical variable seen during test or application, but not seen during training) through sub-models (or impact/effects coding of pooled rare events).
Explicit coding of categorical variable levels as new indicator variables (with optional suppression of non-significant indicators).
Treatment of categorical variables with very large numbers of levels through sub-models (again impact/effects coding).
Correct treatment of nested models or sub-models through data split / cross-frame methods (please see here) or through the generation of “cross validated” data frames (see here); these are issues similar to what is required to build statistically efficient stacked models or super-learners).

The idea is: even with a sophisticated machine learning algorithm there are many ways messy real world data can defeat the modeling process, and vtreat helps with at least ten of them. We emphasize: these problems are already in your data, you simply build better and more reliable models if you attempt to mitigate them. Automated processing is no substitute for actually looking at the data, but vtreat supplies efficient, reliable, documented, and tested implementations of many of the commonly needed transforms.

To help explain the methods we have prepared some documentation:

The vtreat package overall.
Preparing data for analysis using R white-paper
The types of new variables introduced by vtreat processing (including how to limit down to domain appropriate variable types).
Statistically sound treatment of the nested modeling issue introduced by any sort of pre-processing (such as vtreat itself): nested over-fit issues and a general cross-frame solution.
Principled ways to pick significance based pruning levels.

Example

This is an supervised classification example taken from the KDD 2009 cup. A copy of the data and details can be found here: https://github.com/WinVector/PDSwR2/tree/master/KDD2009. The problem was to predict account cancellation ("churn") from very messy data (column names not given, numeric and categorical variables, many missing values, some categorical variables with a large number of possible levels). In this example we show how to quickly use vtreat to prepare the data for modeling. vtreat takes in Pandas DataFrames and returns both a treatment plan and a clean Pandas DataFrame ready for modeling.

to install

!pip install vtreat !pip install wvpy Load our packages/modules.

import pandas
import xgboost
import vtreat
import vtreat.cross_plan
import numpy.random
import wvpy.util
import scipy.sparse

Read in explanitory variables.

# data from https://github.com/WinVector/PDSwR2/tree/master/KDD2009
dir = "../../../PracticalDataScienceWithR2nd/PDSwR2/KDD2009/"
d = pandas.read_csv(dir + 'orange_small_train.data.gz', sep='\t', header=0)
vars = [c for c in d.columns]
d.shape

(50000, 230)

Read in dependent variable we are trying to predict.

churn = pandas.read_csv(dir + 'orange_small_train_churn.labels.txt', header=None)
churn.columns = ["churn"]
churn.shape

(50000, 1)

churn["churn"].value_counts()

-1    46328
 1     3672
Name: churn, dtype: int64

Arrange test/train split.

numpy.random.seed(855885)
n = d.shape[0]
# https://github.com/WinVector/pyvtreat/blob/master/Examples/CustomizedCrossPlan/CustomizedCrossPlan.md
split1 = vtreat.cross_plan.KWayCrossPlanYStratified().split_plan(n_rows=n, k_folds=10, y=churn.iloc[:, 0])
train_idx = set(split1[0]['train'])
is_train = [i in train_idx for i in range(n)]
is_test = numpy.logical_not(is_train)

(The reported performance runs of this example were sensitive to the prevalance of the churn variable in the test set, we are cutting down on this source of evaluation variarance by using the stratified split.)

d_train = d.loc[is_train, :].copy()
churn_train = numpy.asarray(churn.loc[is_train, :]["churn"]==1)
d_test = d.loc[is_test, :].copy()
churn_test = numpy.asarray(churn.loc[is_test, :]["churn"]==1)

Take a look at the dependent variables. They are a mess, many missing values. Categorical variables that can not be directly used without some re-encoding.

d_train.head()

	Var1	Var2	Var3	Var4	Var5	Var6	Var7	Var8	Var9	Var10	...	Var221	Var222	Var223	Var224	Var225	Var226	Var227	Var228	Var229	Var230
0	NaN	NaN	NaN	NaN	NaN	1526.0	7.0	NaN	NaN	NaN	...	oslk	fXVEsaq	jySVZNlOJy	NaN	NaN	xb3V	RAYp	F2FyR07IdsN7I	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	525.0	0.0	NaN	NaN	NaN	...	oslk	2Kb5FSF	LM8l689qOp	NaN	NaN	fKCe	RAYp	F2FyR07IdsN7I	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	5236.0	7.0	NaN	NaN	NaN	...	Al6ZaUT	NKv4yOc	jySVZNlOJy	NaN	kG3k	Qu4f	02N6s8f	ib5G6X1eUxUn6	am7c	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN	NaN	NaN	...	oslk	CE7uk3u	LM8l689qOp	NaN	NaN	FSa2	RAYp	F2FyR07IdsN7I	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	1029.0	7.0	NaN	NaN	NaN	...	oslk	1J2cvxe	LM8l689qOp	NaN	kG3k	FSa2	RAYp	F2FyR07IdsN7I	mj86	NaN

5 rows × 230 columns

d_train.shape

(45000, 230)

Try building a model directly off this data (this will fail).

fitter = xgboost.XGBClassifier(n_estimators=10, max_depth=3, objective='binary:logistic')
try:
    fitter.fit(d_train, churn_train)
except Exception as ex:
    print(ex)

DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields Var191, Var192, Var193, Var194, Var195, Var196, Var197, Var198, Var199, Var200, Var201, Var202, Var203, Var204, Var205, Var206, Var207, Var208, Var210, Var211, Var212, Var213, Var214, Var215, Var216, Var217, Var218, Var219, Var220, Var221, Var222, Var223, Var224, Var225, Var226, Var227, Var228, Var229

Let's quickly prepare a data frame with none of these issues.

We start by building our treatment plan, this has the sklearn.pipeline.Pipeline interfaces.

plan = vtreat.BinomialOutcomeTreatment(outcome_target=True)

Use .fit_transform() to get a special copy of the treated training data that has cross-validated mitigations againsst nested model bias. We call this a "cross frame." .fit_transform() is deliberately a different DataFrame than what would be returned by .fit().transform() (the .fit().transform() would damage the modeling effort due nested model bias, the .fit_transform() "cross frame" uses cross-validation techniques similar to "stacking" to mitigate these issues).

cross_frame = plan.fit_transform(d_train, churn_train)

Take a look at the new data. This frame is guaranteed to be all numeric with no missing values, with the rows in the same order as the training data.

cross_frame.head()

	Var2_is_bad	Var3_is_bad	Var4_is_bad	Var5_is_bad	Var6_is_bad	Var10_is_bad	Var11_is_bad	Var14_is_bad	...	Var227_lev_RAYp	Var228_logit_code	Var228_prevalence_code	Var228_lev_F2FyR07IdsN7I	Var229_logit_code	Var229_prevalence_code	Var229_lev__NA_	Var229_lev_am7c	Var229_lev_mj86
0	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	...	1.0	0.151682	0.653733	1.0	0.172744	0.567422	1.0	0.0	0.0
1	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	...	1.0	0.146119	0.653733	1.0	0.175707	0.567422	1.0	0.0	0.0
2	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	...	0.0	-0.629820	0.053956	0.0	-0.263504	0.234400	0.0	1.0	0.0
3	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0	...	1.0	0.145871	0.653733	1.0	0.159486	0.567422	1.0	0.0	0.0
4	1.0	1.0	1.0	1.0	0.0	1.0	1.0	1.0	...	1.0	0.147432	0.653733	1.0	-0.286852	0.196600	0.0	0.0	1.0

5 rows × 216 columns

cross_frame.shape

(45000, 216)

Pick a recommended subset of the new derived variables.

plan.score_frame_.head()

	variable	orig_variable	treatment	y_aware	has_range	PearsonR	significance	vcount	default_threshold	recommended
0	Var1_is_bad	Var1	missing_indicator	False	True	0.003283	0.486212	193.0	0.001036	False
1	Var2_is_bad	Var2	missing_indicator	False	True	0.019270	0.000044	193.0	0.001036	True
2	Var3_is_bad	Var3	missing_indicator	False	True	0.019238	0.000045	193.0	0.001036	True
3	Var4_is_bad	Var4	missing_indicator	False	True	0.018744	0.000070	193.0	0.001036	True
4	Var5_is_bad	Var5	missing_indicator	False	True	0.017575	0.000193	193.0	0.001036	True

model_vars = numpy.asarray(plan.score_frame_["variable"][plan.score_frame_["recommended"]])
len(model_vars)

Fit the model

cross_frame.dtypes

Var2_is_bad                            float64
Var3_is_bad                            float64
Var4_is_bad                            float64
Var5_is_bad                            float64
Var6_is_bad                            float64
                                  ...         
Var229_logit_code                      float64
Var229_prevalence_code                 float64
Var229_lev__NA_           Sparse[float64, 0.0]
Var229_lev_am7c           Sparse[float64, 0.0]
Var229_lev_mj86           Sparse[float64, 0.0]
Length: 216, dtype: object

# fails due to sparse columns
# can also work around this by setting the vtreat parameter 'sparse_indicators' to False
try:
    cross_sparse = xgboost.DMatrix(data=cross_frame.loc[:, model_vars], label=churn_train)
except Exception as ex:
    print(ex)

DataFrame.dtypes for data must be int, float or bool.
                Did not expect the data types in fields Var193_lev_RO12, Var193_lev_2Knk1KF, Var194_lev__NA_, Var194_lev_SEuy, Var195_lev_taul, Var200_lev__NA_, Var201_lev__NA_, Var201_lev_smXZ, Var205_lev_VpdQ, Var206_lev_IYzP, Var206_lev_zm5i, Var206_lev__NA_, Var207_lev_me75fM6ugJ, Var207_lev_7M47J5GA0pTYIFxg5uy, Var210_lev_uKAI, Var211_lev_L84s, Var211_lev_Mtgm, Var212_lev_NhsEn4L, Var212_lev_XfqtO3UdzaXh_, Var213_lev__NA_, Var214_lev__NA_, Var218_lev_cJvF, Var218_lev_UYBR, Var221_lev_oslk, Var221_lev_zCkv, Var225_lev__NA_, Var225_lev_ELof, Var225_lev_kG3k, Var226_lev_FSa2, Var227_lev_RAYp, Var227_lev_ZI9m, Var228_lev_F2FyR07IdsN7I, Var229_lev__NA_, Var229_lev_am7c, Var229_lev_mj86

# also fails
try:
    cross_sparse = scipy.sparse.csc_matrix(cross_frame[model_vars])
except Exception as ex:
    print(ex)

no supported conversion for types: (dtype('O'),)

# works
cross_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(cross_frame[[vi]]) for vi in model_vars])

# https://xgboost.readthedocs.io/en/latest/python/python_intro.html
fd = xgboost.DMatrix(
    data=cross_sparse, 
    label=churn_train)

x_parameters = {"max_depth":3, "objective":'binary:logistic'}
cv = xgboost.cv(x_parameters, fd, num_boost_round=100, verbose_eval=False)

cv.head()

	train-error-mean	train-error-std	test-error-mean	test-error-std
0	0.073378	0.000322	0.073733	0.000669
1	0.073411	0.000257	0.073511	0.000529
2	0.073433	0.000268	0.073578	0.000514
3	0.073444	0.000283	0.073533	0.000525
4	0.073444	0.000283	0.073533	0.000525

best = cv.loc[cv["test-error-mean"]<= min(cv["test-error-mean"] + 1.0e-9), :]
best

	train-error-mean	train-error-std	test-error-mean	test-error-std
21	0.072756	0.000177	0.073267	0.000327

ntree = best.index.values[0]
ntree

fitter = xgboost.XGBClassifier(n_estimators=ntree, max_depth=3, objective='binary:logistic')
fitter

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=21, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

model = fitter.fit(cross_sparse, churn_train)

Apply the data transform to our held-out data.

test_processed = plan.transform(d_test)

Plot the quality of the model on training data (a biased measure of performance).

pf_train = pandas.DataFrame({"churn":churn_train})
pf_train["pred"] = model.predict_proba(cross_sparse)[:, 1]
wvpy.util.plot_roc(pf_train["pred"], pf_train["churn"], title="Model on Train")

0.7424056263753072

Plot the quality of the model score on the held-out data. This AUC is not great, but in the ballpark of the original contest winners.

test_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processed[[vi]]) for vi in model_vars])
pf = pandas.DataFrame({"churn":churn_test})
pf["pred"] = model.predict_proba(test_sparse)[:, 1]
wvpy.util.plot_roc(pf["pred"], pf["churn"], title="Model on Test")

0.7328696191869485

Notice we dealt with many problem columns at once, and in a statistically sound manner. More on the vtreat package for Python can be found here: https://github.com/WinVector/pyvtreat. Details on the R version can be found here: https://github.com/WinVector/vtreat.

We can compare this to the R solution (link).

We can compare the above cross-frame solution to a naive "design transform and model on the same data set" solution as we show below. Note we turn off filter_to_recommended as this is computed using cross-frame techniques (and hence is a non-naive estimate).

plan_naive = vtreat.BinomialOutcomeTreatment(
    outcome_target=True,              
    params=vtreat.vtreat_parameters({'filter_to_recommended':False}))
plan_naive.fit(d_train, churn_train)
naive_frame = plan_naive.transform(d_train)

naive_sparse = scipy.sparse.hstack([scipy.sparse.csc_matrix(naive_frame[[vi]]) for vi in model_vars])

fd_naive = xgboost.DMatrix(data=naive_sparse, label=churn_train)
x_parameters = {"max_depth":3, "objective":'binary:logistic'}
cvn = xgboost.cv(x_parameters, fd_naive, num_boost_round=100, verbose_eval=False)

bestn = cvn.loc[cvn["test-error-mean"]<= min(cvn["test-error-mean"] + 1.0e-9), :]
bestn

	train-error-mean	train-error-std	test-error-mean	test-error-std
94	0.0485	0.000438	0.058622	0.000545

ntreen = bestn.index.values[0]
ntreen

fittern = xgboost.XGBClassifier(n_estimators=ntreen, max_depth=3, objective='binary:logistic')
fittern

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=94, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)

modeln = fittern.fit(naive_sparse, churn_train)

test_processedn = plan_naive.transform(d_test)
test_processedn = scipy.sparse.hstack([scipy.sparse.csc_matrix(test_processedn[[vi]]) for vi in model_vars])

pfn_train = pandas.DataFrame({"churn":churn_train})
pfn_train["pred_naive"] = modeln.predict_proba(naive_sparse)[:, 1]
wvpy.util.plot_roc(pfn_train["pred_naive"], pfn_train["churn"], title="Overfit Model on Train")

0.9492686875296688

pfn = pandas.DataFrame({"churn":churn_test})
pfn["pred_naive"] = modeln.predict_proba(test_processedn)[:, 1]
wvpy.util.plot_roc(pfn["pred_naive"], pfn["churn"], title="Overfit Model on Test")

0.5960012412998182

Note the naive test performance is worse, despite its far better training performance. This is over-fit due to the nested model bias of using the same data to build the treatment plan and model without any cross-frame mitigations.

Solution Details

Some vreat data treatments are “y-aware” (use distribution relations between independent variables and the dependent variable).

The purpose of vtreat library is to reliably prepare data for supervised machine learning. We try to leave as much as possible to the machine learning algorithms themselves, but cover most of the truly necessary typically ignored precautions. The library is designed to produce a DataFrame that is entirely numeric and takes common precautions to guard against the following real world data issues:

Categorical variables with very many levels.

We re-encode such variables as a family of indicator or dummy variables for common levels plus an additional impact code (also called “effects coded”). This allows principled use (including smoothing) of huge categorical variables (like zip-codes) when building models. This is critical for some libraries (such as randomForest, which has hard limits on the number of allowed levels).
Rare categorical levels.

Levels that do not occur often during training tend not to have reliable effect estimates and contribute to over-fit.
Novel categorical levels.

A common problem in deploying a classifier to production is: new levels (levels not seen during training) encountered during model application. We deal with this by encoding categorical variables in a possibly redundant manner: reserving a dummy variable for all levels (not the more common all but a reference level scheme). This is in fact the correct representation for regularized modeling techniques and lets us code novel levels as all dummies simultaneously zero (which is a reasonable thing to try). This encoding while limited is cheaper than the fully Bayesian solution of computing a weighted sum over previously seen levels during model application.
Missing/invalid values NA, NaN, +-Inf.

Variables with these issues are re-coded as two columns. The first column is clean copy of the variable (with missing/invalid values replaced with either zero or the grand mean, depending on the user chose of the scale parameter). The second column is a dummy or indicator that marks if the replacement has been performed. This is simpler than imputation of missing values, and allows the downstream model to attempt to use missingness as a useful signal (which it often is in industrial data).

The above are all awful things that often lurk in real world data. Automating mitigation steps ensures they are easy enough that you actually perform them and leaves the analyst time to look for additional data issues. For example this allowed us to essentially automate a number of the steps taught in chapters 4 and 6 of Practical Data Science with R (Zumel, Mount; Manning 2014) into a very short worksheet (though we think for understanding it is essential to work all the steps by hand as we did in the book). The 2nd edition of Practical Data Science with R covers using vtreat in R in chapter 8 "Advanced Data Preparation."

The idea is: DataFrames prepared with the vtreat library are somewhat safe to train on as some precaution has been taken against all of the above issues. Also of interest are the vtreat variable significances (help in initial variable pruning, a necessity when there are a large number of columns) and vtreat::prepare(scale=TRUE) which re-encodes all variables into effect units making them suitable for y-aware dimension reduction (variable clustering, or principal component analysis) and for geometry sensitive machine learning techniques (k-means, knn, linear SVM, and more). You may want to do more than the vtreat library does (such as Bayesian imputation, variable clustering, and more) but you certainly do not want to do less.

References

Some of our related articles (which should make clear some of our motivations, and design decisions):

A directory of worked examples can be found here.

We intend to add better Python documentation and a certification suite going forward.

Installation

To install, please run:

# To install:
pip install vtreat

Some notes on controlling vtreat cross-validation can be found here.

Note on data types.

.fit_transform() expects the first argument to be a pandas.DataFrame with trivial row-indexing and scalar column names, (i.e. .reset_index(inplace=True, drop=True)) and the second to be a vector-like object with a len() equal to the number of rows of the first argument. We are working on supporting column types other than string and numeric at this time.

pyvtreat's People

Contributors

Stargazers

Watchers

Forkers

padmanabh275 jtanman myamullaciencia emailhy arita37 binita72 vishalbelsare peterhamfelt

pyvtreat's Issues

Use with PySpark

With the rise of Spark usage in the ML community, it would be interesting to have Vtreat been able to deal with Spark RDDs

Pass default value to clean_copy

Hi,

I'm using the unsupervised treatment module and noticed that the clean_copy fills in any missing/bad values with the mean for the distribution. I was wondering if I could specify a default value based on column name through something like a dictionary?

The reason is I might have a variable that has a long tail that few people ever reach. This might be something like money spent after reaching level 100 in a game or something. Now most people won't reach level 100, so their value will be missing/NA. However, of those that do, most people will spend 0, and a few might spend a large amount, maybe 1000.

vtreat will default fill in the clean copy of revenue (which is hugely influential indicator) with maybe like 150 which kind of ruins any signal from this long tail distribution. Could I pass a dictionary with like 0 as a default value in this case? Thank you!

This could be something like

transform = vtreat.UnsupervisedTreatment(
    cols_to_copy=[],          # columns to "carry along" but not treat as input variables
    cols_fill_values={
        'col1': 0,
        'col2': mean,
        'col3': 1
    }
)

recommended variables in `MultinomialOutcomeTreatment`

I noticed that if you use MultinomialOutcomeTreatment.fit_transform and then follow the recipe in the examples of selecting variables with

good_variables = plan.score_frame_.variable[plan.score_frame_.recommended].values

you'll end up with duplicated entries in good_variables (because you're doing separate tests for each output class, I think?)

Would you suggest just calling

good_variables = plan.score_frame_.variable[plan.score_frame_.recommended].unique()

instead? Or is there some other recommended way to do variable selection in the MultinomialOutcomeTreatment case?

Thanks!
~ Ben

Snippet:

import vtreat
import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()

X, y = pd.DataFrame(iris['data']), iris['target']

plan  = vtreat.BinomialOutcomeTreatment(outcome_target=True)
X_new = plan.fit_transform(X, y == 0)
plan.score_frame_

#    variable  orig_variable   treatment  y_aware  has_range  PearsonR  significance  vcount  default_threshold  recommended
# 0         0              0  clean_copy    False       True  0.079396  3.341524e-01     4.0               0.25        False
# 1         1              1  clean_copy    False       True -0.467703  1.595624e-09     4.0               0.25         True
# 2         2              2  clean_copy    False       True  0.201754  1.329302e-02     4.0               0.25         True
# 3         3              3  clean_copy    False       True  0.117899  1.507473e-01     4.0               0.25         True

plan  = vtreat.MultinomialOutcomeTreatment()
X_new = plan.fit_transform(X, y)
plan.score_frame_
#     variable  orig_variable   treatment  y_aware  has_range  PearsonR  significance  vcount  default_threshold  recommended  outcome_target
# 0          0              0  clean_copy    False       True -0.717416  5.288768e-25     4.0               0.25         True               0
# 1          1              1  clean_copy    False       True  0.603348  3.054699e-16     4.0               0.25         True               0
# 2          2              2  clean_copy    False       True -0.922765  3.623379e-63     4.0               0.25         True               0
# 3          3              3  clean_copy    False       True -0.887344  1.288504e-51     4.0               0.25         True               0
# 4          0              0  clean_copy    False       True  0.079396  3.341524e-01     4.0               0.25        False               1
# 5          1              1  clean_copy    False       True -0.467703  1.595624e-09     4.0               0.25         True               1
# 6          2              2  clean_copy    False       True  0.201754  1.329302e-02     4.0               0.25         True               1
# 7          3              3  clean_copy    False       True  0.117899  1.507473e-01     4.0               0.25         True               1
# 8          0              0  clean_copy    False       True  0.638020  1.619533e-18     4.0               0.25         True               2
# 9          1              1  clean_copy    False       True -0.135645  9.791170e-02     4.0               0.25         True               2
# 10         2              2  clean_copy    False       True  0.721011  2.381987e-25     4.0               0.25         True               2
# 11         3              3  clean_copy    False       True  0.769445  1.297773e-30     4.0               0.25         True               2

documentation fro score_frame_ ?

First, thank you so much for vtreat, it has definitely changed how I approach pre-processing data.
I am trying to understand the different columns created by the method score_frame_ for a BinomialOutcomeTreatment. I've looked through the python examples, the python api code, and the original paper, but I can't seem to find any information on 'has_range' and 'vcount' . What are the definitions of those columns and/or where can I find more documentation on score_frame_ ?

Code up categorical variables with shared indicator space example

Make it easy for categorical variables to share indicator space additively, and same for derived impact columns.

Question - How to encode high cardinal variables

I came across this package while I was googling to find a way to encode my high cardinal variables.

My categorical variable has 15 level and my dataset has only 900 rows.

I would like to encode them in a manner that will allow us to interpret later (unlike hash etc).

So, Is there any tutorial or method on how can we encode high cardinal variable without losing interpretability?

Unspecified upper version limits for dependencies

Issue
The setup.py file does not specify upper version limits for any dependencies which can lead to major version upgrades with breaking changes.
Impact
This can be currently seen via pandas which recently got a major version bump. Upon release vtreat==1.2.8 did not upgrade pandas to >=2.0.0, however it currently does.
Potential Fix
Specify approximate version number numbers using ~= or upper limits using <

Note: Other than an unspecified upper limit in vtreat, pandas also indirectly gets a version bump through data_algebra which now requires pandas>=2.0.0

vtreat and sklearn pipeline

First of all really interesting project, that could save a lot of repetitive work and provide good baseline.
I've tried to find example in docs that uses Pipeline from scikit-learn but I didn't, so this is my quick and dirty attempt based on yours:

import pandas as pd
import numpy as np
import numpy.random
import vtreat
import vtreat.util
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

numpy.random.seed(2019)

def make_data(nrows):
    d = pd.DataFrame({'x': 5*numpy.random.normal(size=nrows)})
    d['y'] = numpy.sin(d['x']) + 0.1*numpy.random.normal(size=nrows)
    d.loc[numpy.arange(3, 10), 'x'] = numpy.nan                           # introduce a nan level
    d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
    d['x2'] = np.random.normal(size=nrows)
    d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan  # introduce a nan level
    d['yc'] = d['y']>0.5
    return d

df = make_data(500)

df = df.drop(columns=['y'])

transform = vtreat.BinomialOutcomeTreatment(outcome_target=True)

clf = Pipeline(steps=[
    ('preprocessor', transform),
    ('classifier', LogisticRegression())]
)

X, y = df, df.pop('yc')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

clf.fit(X_train, y_train)

print("model score: %.3f" % clf.score(X_test, y_test))

In general, it seems to work, but :

It'd be great to have: __repr__ , get_params etc. to have nice representation in Pipeline
similarly, get_feature_names method to have clf['preprocessor'].get_feature_names()
I don't use parameter cols_to_copy and drop y manually, to avoid leaking y
I'm not sure that vtreat.cross_plan... could be replaced by validation schemes from scikit-learn like GridSearchCV

Pure data algebra training path

Implement a pure data algebra training path, or a Polars training shim

Reproducibility

Hi, after applying the fit_transform and finding the set of useful features I would like to be able to use the same set obtained on another dataset composed of original features and different observations.
Is there a specific way to achieve this reproducibility? I suppose that reapplying the fit_transform can lead to a set of different features; I tried to do the application of fit and transform separately but maybe there is a apposite function (and the UserWarning: possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead) tells me that it is probably not the correct approach).

Thanks in advance and congratulations on your excellent work.

Indicator Code is False for has_range/example outdated?

Even running the example code, my prepared data frame doesn't include indicator_code variables. When I check the transform.score_frame_ I see that the indicator_code variables are False under has_range which might be why they weren't created. Is this intended, since obviously the example data does have different levels and varies. And possibly would it help to update the example to the latest behavior? Thank you!

ipdb> !d.head()
     x         y          xc        x2  x3
0  0.0 -0.111698  level_-0.0 -0.098463   1
1  0.1  0.270348   level_0.5  0.370653   1
2  0.2 -0.057853  level_-0.0  0.111180   1
3  0.3  0.412467   level_0.5  1.305242   1
4  0.4  0.469221   level_0.5  0.490332   1

ipdb> d_prepared.columns
Index(['y', 'xc_is_bad', 'x', 'x2', 'xc_prevalence_code'], dtype='object')

ipdb> transform.score_frame_
              variable orig_variable          treatment  y_aware  has_range  PearsonR  significance  recommended  vcount
0            xc_is_bad            xc  missing_indicator    False       True       NaN           NaN         True     1.0
1                    x             x         clean_copy    False       True       NaN           NaN         True     2.0
2                   x2            x2         clean_copy    False       True       NaN           NaN         True     2.0
3   xc_prevalence_code            xc    prevalence_code    False       True       NaN           NaN         True     1.0
4    xc_lev_level_-0.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
5     xc_lev_level_1.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
6     xc_lev_level_0.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
7    xc_lev_level_-0.0            xc     indicator_code    False      False       NaN           NaN        False     7.0
8          xc_lev__NA_            xc     indicator_code    False      False       NaN           NaN        False     7.0
9     xc_lev_level_1.5            xc     indicator_code    False      False       NaN           NaN        False     7.0
10    xc_lev_level_0.0            xc     indicator_code    False      False       NaN           NaN        False     7.0

Column name update/seeding tutorials

Hi. I didn't want to fork the repo for this, but under the Python classification example in the exploratory section, the notebook says:

'Find the mean value of yc'

I think 'yc' is a nominal column and finding the mean wouldn't be possible. With that in mind, here's two friendly suggestions:

Add something like numpy.random.seed(42) or another seed value at the top of the examples for reproducibility by those following the tutorial.
Update the mean value sections. I could be wrong and may have misread the document, but I went through another of the tutorials and some of the stuff copied over could have been mislabeled.

Other than that, the package looks interesting so far.

Thanks!

categorical variables

If the categorical column appears to have only a numeric variables (like: 5, 7, 8, 1).
What is the way to specify it to vtreat.NumericOutcomeTreatment?

Or the most simple way is to convert numeric values (categorical column) to some kind of strings?

Add distinct +/-infinity treatments

Add distinct +/-infinity treatments. May not be able to distinguish NA from NaN on pandas.

Look into variations of transform implementation

Consider looking into pandas.applymap() (may or may not be able to use it in the presence of novel levels and missing values) and also for a data_algebra version of .transform().

error on DataFrame

When tried to fit_transform on DataFrame got this:


  File "<ipython-input-34-4749a25525c1>", line 2, in <module>
    train_labeled_vtreat.append(plan.fit_transform(pd.DataFrame(train_labeled[column]).all(), target))

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 297, in fit_transform
    raise Exception("X should be a Pandas DataFrame")

Exception: X should be a Pandas DataFrame

or this:

cross_frame = plan.fit_transform(train_labeled, target)
Traceback (most recent call last):

  File "<ipython-input-35-e8d2bce0ab6b>", line 1, in <module>
    cross_frame = plan.fit_transform(train_labeled, target)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\__init__.py", line 315, in fit_transform
    params=self.params_,

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 460, in fit_multinomial_outcome_treatment
    params=params,

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\vtreat_impl.py", line 176, in fit_binomial_impact_code
    sf = vtreat.util.grouped_by_x_statistics(x, y)

  File "C:\Users\TEMP.Main.001\.conda\envs\snakes\lib\site-packages\vtreat\util.py", line 31, in grouped_by_x_statistics
    if n != len(y):

TypeError: len() of unsized object

Deprecated dependency towards sklearn package

The correct package name vtreat should depend on would be scikit-learn

Bug: import Error due to statsmodels' Appender

Hello,

I am experiencing the following error when importing pyvtreat

`---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
in
2 import pandas as pd
3 import numpy as np
----> 4 import vtreat
5 import vtreat.util
6 import wvpy.util

~/anaconda3/envs/python3/lib/python3.6/site-packages/vtreat/init.py in
5 import numpy
6
----> 7 from vtreat.vtreat_api import *
8
9 docformat = "restructuredtext"

~/anaconda3/envs/python3/lib/python3.6/site-packages/vtreat/vtreat_api.py in
4 import numpy
5
----> 6 import vtreat.vtreat_impl as vtreat_impl
7 import vtreat.util
8 import vtreat.cross_plan

~/anaconda3/envs/python3/lib/python3.6/site-packages/vtreat/vtreat_impl.py in
13 import pandas
14
---> 15 import vtreat.util
16 import vtreat.transform
17

~/anaconda3/envs/python3/lib/python3.6/site-packages/vtreat/util.py in
16 with warnings.catch_warnings():
17 warnings.filterwarnings("ignore", category=DeprecationWarning)
---> 18 import statsmodels.api
19 import statsmodels.formula.api
20

~/anaconda3/envs/python3/lib/python3.6/site-packages/statsmodels/api.py in
18 from . import robust
19 from .robust.robust_linear_model import RLM
---> 20 from .discrete.discrete_model import (Poisson, Logit, Probit,
21 MNLogit, NegativeBinomial,
22 GeneralizedPoisson,

~/anaconda3/envs/python3/lib/python3.6/site-packages/statsmodels/discrete/discrete_model.py in
26 from scipy.stats import nbinom
27
---> 28 from statsmodels.compat.pandas import Appender
29
30 import statsmodels.tools.tools as tools

ImportError: cannot import name 'Appender'`

I looked at SciPy's issues, and it seems like multiple packages are having issues with importing Appender, but SciPy has closed all of those issues without changing their package. I've also tried using different versions of Pandas(0.23.0 - 1.0.1) and that didn't do anything. Scipy seems to suggest using
from pandas.util._decorators import Appender

I really like vtreat, so any help/suggestions on how to import it would be appreciated.
Thanks!

Research future warning from Pandas

Get a test that isolates the future warning from Pandas, research how to future-proof the code, and remove the init suppression of these warnings.

Add and test Polars path

Add and test Polars path. At least on the data algebra pipeline, and then later on training.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble