GithubHelp home page GithubHelp logo

vtreat and sklearn pipeline about pyvtreat HOT 9 CLOSED

winvector avatar winvector commented on June 19, 2024 1
vtreat and sklearn pipeline

from pyvtreat.

Comments (9)

JohnMount avatar JohnMount commented on June 19, 2024 1

Thanks,

I haven't finished the Pipeline integration, but you have given me some good pointers on steps to get there. I'll close this after I add some of the suggestions.

from pyvtreat.

JohnMount avatar JohnMount commented on June 19, 2024 1

In version 0.3.4 of vtreat the transform implement a lot more of the sklearn step interface. Also we have a new really neat feature that warms if .fit(X, y).transform(X) is ever called (looks like sklearn Pipeline calls .fit_transform(X, y), which is what vtreat wants to prevent over-fit issues using its cross-frame methodology). I have re-run your example here https://github.com/WinVector/pyvtreat/blob/master/Examples/Pipeline/Pipeline_Example.md

Thanks for helping with the package.

from pyvtreat.

mglowacki100 avatar mglowacki100 commented on June 19, 2024 1

Thanks!

  1. Minor issue, a few classes lacks __repr__ method, e.g.
    'cross_validation_plan': <vtreat.cross_plan.KWayCrossPlanYStratified object at 0x10fa81b50>
  2. So, as far I see, It'd be hard to put properly vtreat directly in Pipeline like in my example, but I'll think about it - Pipelines are nice, also GridSearch for vtreat parameters would be cool.

from pyvtreat.

JohnMount avatar JohnMount commented on June 19, 2024 1
  1. Ah, yes I'll add the __repr__() to the support classes.
  2. Not sure if grid-searching for vtreat parameters is that much a benefit. Would it be better if vtreat hid its parameters from the pipeline?

Also, vtreat isn't using the cross-validation effects to search for hyper-parameter values. It is using them to try and avoid the nested model bias issue seen in not-cross validated stacked models. So there may be less of a connection to GridSearchCV than it first appears.

from pyvtreat.

mglowacki100 avatar mglowacki100 commented on June 19, 2024 1
  1. get_params and set_params, I see following cases:
    i) when you extend base class get_params with super in child __init__ is often used to reduce boiler-plate code
    ii) just for display
    iii) just to have compatibility with sklearn.base.BaseEstimator and avoid monkey-patching

  2. I was thinking about fit(X_tr, y_tr).transform(X_tr) vs fit_transform(X_tr, y_tr) and correct me if I'm wrong:
    i) the mismatch is when target-encoding is used for high-cardinality categorical variables, so e.g. zipcode 36104 from train could be target-encoded to 0.7 by fit_transform but the same zipcode in test set could be target-encoded to 0.8. So, basically there are two "mappings"
    ii) in fit method, there is an internal call to self.fit_transform(X=X, y=y)
    e.g. line 216: https://github.com/WinVector/pyvtreat/blob/master/pkg/build/lib/vtreat/vtreat_api.py
    as result X is transformed anyway but result is not stored. So, here is an idea:

  • add attribute .X_fit_tr and store result of internal .fit_transform from .fit in it
  • add attribute .X_fit and store input X or some hash-id of it to save memory
  • Next, you can modify .transform(X) by adding condition if X == self.X_fit : return self.X_fit_tr
    In such way there will be fit(X_tr, y_tr).transform(X_tr) == fit_transform(X_tr, y_tr)
    Alternatively, instead of storing dataframe in .X_fit_tr just store "mapping" to get it during transform (if possible). This alternative is more memory efficient and also fit is separated from transform.
  1. Regarding GridSearchCV I was thinking about e.g. indicator_min_fraction parameter and checking values 0.05, 0.1, 0.2. Within pipeline it should be completly independent aside stuff from point 2.

Thanks for explanations!

from pyvtreat.

JohnMount avatar JohnMount commented on June 19, 2024

I am going to stub-out the get/set parameters until I have some specific use-case/applications to code them to (are they tuned over during cross-validation, are they used to build new pipelines, are they just for display, are they used to simulate pickling?). I've added some more pretty-printing, but a lot of these objects are too complicated to be re-built from their printable form.

from pyvtreat.

JohnMount avatar JohnMount commented on June 19, 2024

First, thank you very much for spending so much time to give useful and productive advice. I've tried to incorporate a lot of it into vtreat. It is very important to me that vtreat be Pythonic and sklearn-idiomatic.

Back to your points.

Yes vtreat uses cross-validated out of sample methods in fit_transform() to actually implement fit, and then throws the transform frame away. The out of sample frame is needed to get accurate estimates of out of sample performance for the score frame.

I've decided not to cache the result for user in a later transform() step. My concerns are this is a reference leak to a large object, and I feel I need to really not paper-over the differences of simulated out of sample methods and split methods (using different data for .fit() and .transform()). It is indeed not-sklearn like to have .fit_transform(X, y) return the same answer as .fit(X, y).transform(X). However it is also not safe to supply the user with .fit_transform(X, y) when they call .fit(X, y).transform(X) as the cross-validation gets rid of the very strong nested model bias in .fit(X, y).transform(X), but exposes a bit of negative nested model bias. So I want to encourage users that want to call .fit(X, y).transform(X) to call .fit(X1, y).transform(X2) where X1, X2 are a random disjoint partition of X.

Overall in the cross-validated mode not only do the impact_code variables code to something different than through the .transform() method, they are also not functions of the input variable alone even in the .fit_transform() step- they are functions of both the input variable values and the cross fold ids.

I did add warnings based on caching the id of the data used in .fit(). So I have made the issue more surface visible to the user.

I've spent some more time researching sklearn objects and added a lot more methods to make the vtreat steps duck-type to these structures.

Regarding parameters, I still am not exposing them. You correctly identified the most interesting one: indicator_min_fraction. My intent there is: one would set indicator_min_fraction to the smallest value you are willing to work with and then use a later sklearn stage to throw away columns you do not want, or even leave this to the modeling step. I think this is fairly compatible with sklearn, a bit more inconvenient but I think leaving column filtering to a later step is a good approach.

If you strongly disagree, or have new ideas, or I have missed something, please do re-open this issue or file another one. If anything is unclear open an issue and I will be happy to build up more documentation.

from pyvtreat.

JohnMount avatar JohnMount commented on June 19, 2024

I've got it: configure which parameters are exposed to the pipeline controls during construction. I am going to work on that a bit.

from pyvtreat.

JohnMount avatar JohnMount commented on June 19, 2024

I've worked out an example of vtreat in a pipeline used in hyper-parameters search using an adapter: https://github.com/WinVector/pyvtreat/blob/main/Examples/Pipeline/Pipeline_Example.ipynb . Overall I don't find the combination that helpful, so unless I have a specific request with a good example I am not going to further integrate.

The issues include:

  • The grid search clones objects in addition to using get/set parameters. This means I would have to uglify the constructors to match the parameters (which is what I do in the adapter).
  • It is really slow as we are paying for a lot of nested instead of sequential cross-validation.
  • The vtreat parameters essentially get masked by other regularization parameters in the pipeline. This is also confirmation we are not too sensitive to these parameters, allowing us to leave more of them out.

from pyvtreat.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.