Comments (9)
Thanks,
I haven't finished the Pipeline integration, but you have given me some good pointers on steps to get there. I'll close this after I add some of the suggestions.
from pyvtreat.
In version 0.3.4 of vtreat the transform implement a lot more of the sklearn step interface. Also we have a new really neat feature that warms if .fit(X, y).transform(X)
is ever called (looks like sklearn Pipeline calls .fit_transform(X, y)
, which is what vtreat wants to prevent over-fit issues using its cross-frame methodology). I have re-run your example here https://github.com/WinVector/pyvtreat/blob/master/Examples/Pipeline/Pipeline_Example.md
Thanks for helping with the package.
from pyvtreat.
Thanks!
- Minor issue, a few classes lacks
__repr__
method, e.g.
'cross_validation_plan': <vtreat.cross_plan.KWayCrossPlanYStratified object at 0x10fa81b50>
- So, as far I see, It'd be hard to put properly
vtreat
directly inPipeline
like in my example, but I'll think about it -Pipelines
are nice, alsoGridSearch
forvtreat
parameters would be cool.
from pyvtreat.
- Ah, yes I'll add the
__repr__()
to the support classes. - Not sure if grid-searching for
vtreat
parameters is that much a benefit. Would it be better ifvtreat
hid its parameters from the pipeline?
Also, vtreat
isn't using the cross-validation effects to search for hyper-parameter values. It is using them to try and avoid the nested model bias issue seen in not-cross validated stacked models. So there may be less of a connection to GridSearchCV
than it first appears.
from pyvtreat.
-
get_params
andset_params
, I see following cases:
i) when you extend base classget_params
withsuper
in child__init__
is often used to reduce boiler-plate code
ii) just for display
iii) just to have compatibility withsklearn.base.BaseEstimator
and avoid monkey-patching -
I was thinking about
fit(X_tr, y_tr).transform(X_tr)
vsfit_transform(X_tr, y_tr)
and correct me if I'm wrong:
i) the mismatch is when target-encoding is used for high-cardinality categorical variables, so e.g. zipcode36104
from train could be target-encoded to 0.7 byfit_transform
but the same zipcode in test set could be target-encoded to 0.8. So, basically there are two "mappings"
ii) infit
method, there is an internal call toself.fit_transform(X=X, y=y)
e.g. line 216: https://github.com/WinVector/pyvtreat/blob/master/pkg/build/lib/vtreat/vtreat_api.py
as result X is transformed anyway but result is not stored. So, here is an idea:
- add attribute
.X_fit_tr
and store result of internal.fit_transform
from.fit
in it - add attribute
.X_fit
and store input X or some hash-id of it to save memory - Next, you can modify
.transform(X)
by adding conditionif X == self.X_fit : return self.X_fit_tr
In such way there will befit(X_tr, y_tr).transform(X_tr)
==fit_transform(X_tr, y_tr)
Alternatively, instead of storing dataframe in.X_fit_tr
just store "mapping" to get it during transform (if possible). This alternative is more memory efficient and alsofit
is separated fromtransform
.
- Regarding
GridSearchCV
I was thinking about e.g.indicator_min_fraction
parameter and checking values 0.05, 0.1, 0.2. Within pipeline it should be completly independent aside stuff from point 2.
Thanks for explanations!
from pyvtreat.
I am going to stub-out the get/set parameters until I have some specific use-case/applications to code them to (are they tuned over during cross-validation, are they used to build new pipelines, are they just for display, are they used to simulate pickling?). I've added some more pretty-printing, but a lot of these objects are too complicated to be re-built from their printable form.
from pyvtreat.
First, thank you very much for spending so much time to give useful and productive advice. I've tried to incorporate a lot of it into vtreat
. It is very important to me that vtreat
be Pythonic and sklearn-idiomatic.
Back to your points.
Yes vtreat
uses cross-validated out of sample methods in fit_transform()
to actually implement fit
, and then throws the transform frame away. The out of sample frame is needed to get accurate estimates of out of sample performance for the score frame.
I've decided not to cache the result for user in a later transform()
step. My concerns are this is a reference leak to a large object, and I feel I need to really not paper-over the differences of simulated out of sample methods and split methods (using different data for .fit()
and .transform()
). It is indeed not-sklearn like to have .fit_transform(X, y)
return the same answer as .fit(X, y).transform(X)
. However it is also not safe to supply the user with .fit_transform(X, y)
when they call .fit(X, y).transform(X)
as the cross-validation gets rid of the very strong nested model bias in .fit(X, y).transform(X)
, but exposes a bit of negative nested model bias. So I want to encourage users that want to call .fit(X, y).transform(X)
to call .fit(X1, y).transform(X2)
where X1
, X2
are a random disjoint partition of X
.
Overall in the cross-validated mode not only do the impact_code
variables code to something different than through the .transform()
method, they are also not functions of the input variable alone even in the .fit_transform()
step- they are functions of both the input variable values and the cross fold ids.
I did add warnings based on caching the id of the data used in .fit()
. So I have made the issue more surface visible to the user.
I've spent some more time researching sklearn
objects and added a lot more methods to make the vtreat
steps duck-type to these structures.
Regarding parameters, I still am not exposing them. You correctly identified the most interesting one: indicator_min_fraction
. My intent there is: one would set indicator_min_fraction
to the smallest value you are willing to work with and then use a later sklearn
stage to throw away columns you do not want, or even leave this to the modeling step. I think this is fairly compatible with sklearn
, a bit more inconvenient but I think leaving column filtering to a later step is a good approach.
If you strongly disagree, or have new ideas, or I have missed something, please do re-open this issue or file another one. If anything is unclear open an issue and I will be happy to build up more documentation.
from pyvtreat.
I've got it: configure which parameters are exposed to the pipeline controls during construction. I am going to work on that a bit.
from pyvtreat.
I've worked out an example of vtreat
in a pipeline used in hyper-parameters search using an adapter: https://github.com/WinVector/pyvtreat/blob/main/Examples/Pipeline/Pipeline_Example.ipynb . Overall I don't find the combination that helpful, so unless I have a specific request with a good example I am not going to further integrate.
The issues include:
- The grid search clones objects in addition to using get/set parameters. This means I would have to uglify the constructors to match the parameters (which is what I do in the adapter).
- It is really slow as we are paying for a lot of nested instead of sequential cross-validation.
- The vtreat parameters essentially get masked by other regularization parameters in the pipeline. This is also confirmation we are not too sensitive to these parameters, allowing us to leave more of them out.
from pyvtreat.
Related Issues (20)
- Pass default value to clean_copy HOT 5
- Column name update/seeding tutorials HOT 1
- documentation fro score_frame_ ? HOT 2
- Bug: import Error due to statsmodels' Appender HOT 4
- categorical variables HOT 1
- Reproducibility HOT 3
- Use with PySpark HOT 2
- Code up categorical variables with shared indicator space example
- Question - How to encode high cardinal variables HOT 1
- error on DataFrame HOT 15
- Deprecated dependency towards sklearn package HOT 2
- Unspecified upper version limits for dependencies HOT 3
- Add and test Polars path HOT 1
- Pure data algebra training path
- Research future warning from Pandas HOT 1
- Look into variations of transform implementation HOT 1
- Add distinct +/-infinity treatments HOT 1
- recommended variables in `MultinomialOutcomeTreatment` HOT 2
- Indicator Code is False for has_range/example outdated? HOT 10
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyvtreat.