GithubHelp home page GithubHelp logo

Comments (12)

vruusmann avatar vruusmann commented on August 20, 2024

Most Scikit-Learn transformers and estimators assume dense input matrices. Therefore, the missing values need to be handled during data pre-processing, using the Imputer transformer type.

Currently, JPMML-SkLearn encodes the Imputer transformer type as a PMML DerivedField element. Ideally, if the Imputer transformer is the first element in the workflow (ie. operates on the "raw" input column, and all the other requirements permitting), then it should be encoded as the MiningField@missingValueReplacement attribute instead.

As for enhancing DataField and MiningField elements in general, then I have been thinking about defining a couple of new special-purpose transformer types in the sklearn2pmml package. They would be called something like CategoricalDomain, ContinuousDomain etc, and they would also take care of computing the minimum and maximum values for a particular column, and keeping track of missing value and invalid value specifications.

from sklearn2pmml.

CodeKiller48 avatar CodeKiller48 commented on August 20, 2024

Hi Villu,

Something like CategoricalDomain/ContinuousDomain is definitely useful especially when training set is small and cannot cover all the possibilities of features.

Just wondering when do you think these functions would be released?

Thanks,
wei

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

This issue has been sitting in the middle of my private TODO file for several months. The main obstacle in the way is that my Python programming skills are not so good.

Anyway, if you provided the initial versions of Python classes sklearn2pmml.(preprocessing.)ContinuousDomain and sklearn2pmml.(preprocessing.)CategoricalDomain (eg. as a PR), then I could do the Java side of things in a couple of hours.

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

There's a whole new "field decoration" framework in place now. In addition to sklearn2pmml.decoration.CategoricalDomain and sklearn2pmml.decoration.ContinuousDomain transformation types, it also applies special treatment to the sklearn.preprocessing.Imputer transformation type.

If you want to take advantage of this framework, then simply insert those decorators in the beginning of your transformers list:

iris_mapper = sklearn_pandas.DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), Imputer(), PCA(n_components = 3)]),
    ("Species", None)
])

from sklearn2pmml.

CodeKiller48 avatar CodeKiller48 commented on August 20, 2024

Fatalistic news . I will give it a try.

from sklearn2pmml.

jcollingosch avatar jcollingosch commented on August 20, 2024

Can you provide an example for how this is supposed to work with CategoricalDomain? I see in the source code for the fit method you use self.data_ = numpy.unique(y[~numpy.isnan(y)]) and I keep getting an error from the numpy.isnan(y) since y is a string... but that is precisely why I am using the CategoricalDomain, since I am dealing with strings, is there a way around this?

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

The CategoricalDomain class is definitely under-utilized, so logic or programming errors cannot be ruled out in this point.

I'm not a Python expert, so I'd happily incorporate code from people who are. How would you approach this task - getting the list of unique category levels for some column?

The pandas (or pandas_sklearn) package should contain some utility for this. Can fall back to that option if necessary.

from sklearn2pmml.

jcollingosch avatar jcollingosch commented on August 20, 2024

@vruusmann I think the way your are currently doing it works just fine, the only problem is when passing strings numpy.isnan fails, maybe we can use pandas.isnull instead?

I think in CategoricalDomain if you changed
self.data_ = numpy.unique(y[~numpy.isnan(y)])
and instead used:
self.data_ = numpy.unique(y[~pandas.isnull(y)])

since pandas.isnull will work for both numeric and object arrays.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

@jcollingosch Thanks for the pandas.isnull() suggestion. It will be implemented in the next release (that should be out by the end of this week).

The idea is that CategoricalDomain can be used to cast a feature into appropriate operational type (complementary to data type). For example, if you apply ContinuousDomain to some numeric column, then its valid value space will be encoded using an Interval element. However, if you apply CategoricalDomain to the same column, then its valid value space will be encoded using a list of Value elements. This Interval vs. Value thing is singificant when validating new input data. By default, you will not be allowed to score the model with invalid data.

from sklearn2pmml.

jcollingosch avatar jcollingosch commented on August 20, 2024

@vruusmann differentiating between Interval vs. Value totally makes sense. The main reason I got excited seeing these decoration framework was I was hoping to use the parameter invalid_value_treatment and set it to "as_missing" with something like:

mapper = sklearn_pandas.DataFrameMapper([ ("string_feature" , [CategoricalDomain(invalid_value_treatment="as_missing"), LabelBinarizer()], ("target", None) ])

Here I am trying to be able to handle categorical variables, to onehot encode them, and convert "unseen" levels of a categorical to "missing" rather than ""return_invalid" so that in production the api call will not fail but treat unseen data as "missing". Since the modeling framework I am using xgboost handles missing data in the algorithm, passing "as_missing" through should work I think.

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

@jcollingosch Your reasoning about converting unseen values to missing values is correct.

Alternatively, you might consider deleting all Value elements from the DataField@name="string_feature" element altogether. The logic of this is that if the valid value space is unspecified, then any value will be treated as a valid value, and the MiningField@invalidValueTreatment="return_invalid" is not triggered. XGBoost should be sending unseen values in the majority's direction:

if(string_feature == "known_category"){
  // A matching value
} else {
  // All other values, including invalid and missing values
}

If you simply want to modify this single MiningField@invalidValueTreatment attribute value, then you could just load the newly generated PMML document in Python, apply the change, and save it back. Or if you're familiar with XSL transforms, then this modification can be performed using command-line tools (PMML input file + XSL stylesheet -> PMML output file).

from sklearn2pmml.

jcollingosch avatar jcollingosch commented on August 20, 2024

@vruusmann I saw reading some pmml documentation that I could edit the MiningField@invalidValueTreatment and explored editing the raw pmml manually which worked, but I never explored doing this in an automated way as you suggested via loading into Python or using XSL, I will have to look into that.

In the ideal scenario I will be able to have a python script to train and retrain models that can produce these pmml files that are ready to go and shipped to a production app, with limited post-processing (but if I learned about using the XSL transforms that might be a good route as you suggested)

In any rate, I really appreciate all your help and advice! Big props on all the work integrating many great tools!!

from sklearn2pmml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.