GithubHelp home page GithubHelp logo

Comments (12)

vruusmann avatar vruusmann commented on August 20, 2024

A lookup table (between two categorical value spaces) is represented in PMML using the MapValues element. The MapValues transformation can work with external data sources (such as CSV files or SQL databases), but most of the time it's easier if the data is inlined. The choice between those two options (ie. external source vs. inline source) depends on how frequently/on what extent this lookup table needs to be updated, and by whom.

In your case, if those ten features are independent of one another, then you should define ten independent MapValues transformations (each with one input column and one output column). Otherwise, you should define one big MapValues transformation.

The fact that you're dealing with a 100'000-element lookup table is a bit concerning, because it would probably require some special tweaking/configuration at the PMML consumer side (such as the JPMML-Evaluator library) in order to ensure the best performance.

As for conversion from Scikit-Learn representation to PMML representation, then it's a low-level technicality. What Scikit-Learn classes do you use for that in your application code? Can you provide a working example using some toy dataset?

from sklearn2pmml.

jibybabu avatar jibybabu commented on August 20, 2024

Hi Vilu,
Thanks for the quick turn around! excited to see your quick response!

So here is my story,
I am building a model to solve a classification problem and needs to convert it into JPMML for deployment.
One of model feature is a string, say name of a person, and my model looks for 10 additional features based numerical(decimal) values based on the name of person. I have kept this map in a CSV file (Name, Col1, Col2, Col 3....Col10) with rows for example("ABC",1.534346,5.1232343,....).
So whenever a observation comes for prediction, the model needs to look for the respective 10 numerical values and use them for scoring the model.

Right now, i haven't used any skikitlearn transformations to map this, Just populating the column values using python by reading the lookup.csv file. And for deploying, converting into JPMML , it looks like a challenge now. I am ready to change the implementation.

Also, the lookup table doesn't need to be updated frequently. So that is why would like to embed everything into one file.

Any suggestions/advices are much appreciated!

Thanks,
Jiby

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

The MapValues transformation is designed for mapping many input values to one output value. You're interested in achieving exactly the opposite - mapping one input value (eg. entity's identifier) to many output values (eg. entity's descriptors).

You could achieve this by defining many MapValue transformations - one for each output dimension. This could work if the size of input value set is relatively small and fixed.

Is the "input value set is fixed" requirement met in your use case? In other words, are you going to make predictions only about persons that were known at the time when the model was trained (and converted to PMML). Aren't you interested in making predictions about new persons?

Otherwise, I'd recommend using a layered approach, where the lookup table functionality is moved outside of the PMML document, to a specialized service. The latter could be some REST web service, or an SQL handler. You would be performing this call using a custom Java user-defined function (UDF):

<Apply function="com.mycompany.rest.PersonService">
  <FieldRef name="Id"/> <!-- Entity identifier -->
  <Constant>Col1, Col2, Col3, .., Col10</Constant> <!-- Entity attributes to select -->
</Apply>

from sklearn2pmml.

jibybabu avatar jibybabu commented on August 20, 2024

Hi Vilu,

Thanks a lot for the quick response!

Is the "input value set is fixed" requirement met in your use case?Aren't you interested in making predictions about new persons?

Yes. As like this example, when a new person comes in future , i need to look into the lookup table and get the respective 10 values. I am taking care of the scenarios where the persons name is not in the lookup table.

I was thinking of a workaround; Let me know your suggestion on that prespective.
how about creating a customized transformation function, somthing like http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html, for each 10 mappings, put that into an sklearn pipeline and convert the pipeline into dataframeMapper ?

Btw.. Do you have any example of converting sklearn pipeline to dataframeMapper using sklearn2pmml?

Thanks,
Jiby

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

how about creating a customized transformation function, somthing like http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html, for each 10 mappings, put that into an sklearn pipeline and convert the pipeline into dataframeMapper ?

That would be unnecessarily complex solution. It's possible to define a new Scikit-Learner transformer class and map it directly to any number of columns:

person_mapper = DataFrameMapper([
  ((Gender, Age, Height), CSVLookupTable("persons.csv", "ID"))
])

Class CSVLookupTable would be defined in sklearn2pmml.preprocessing module. It takes the name of the filesystem file, and the name of the primary key column as arguments.

Do you have any example of converting sklearn pipeline to dataframeMapper using sklearn2pmml?

The conversion of Scikit-Learn pipelines is discussed in a separate issue:
jpmml/jpmml-sklearn#3

from sklearn2pmml.

jibybabu avatar jibybabu commented on August 20, 2024

Thanks Villu, Thats a great approach! Let me try it out that.

Btw, just to make sure, about passing the "ID", the "ID" should the primary column name of the lookUpTable rather than the respective data set column name correct?
And this will make sure that we are taking care of scenarios outside of the data set right?

Thanks,
Jiby

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

You can't try it out, because class sklearn2pmml.preprocessing.CSVLookupTable is fictional.

But the solution to your problem could be implemented like this. If you can implement the Python side of this class, then I can do the rest.

from sklearn2pmml.

jibybabu avatar jibybabu commented on August 20, 2024

I implemented that in the python just now. But the problem is do the variables in the mapper and estimator while doing sklearn2python to be synced? I am not using the input variable, for example the name in the model, but the output variables, for example Gender, Age, Height, in the model.
Because if the input variable is not there in the JPMML , how java can map it? Also i wont be able to take of the new input variable,say name , which are not in the dataset but in the look up table?

And its returning the error like

SEVERE: Failed to convert Estimator
java.lang.IllegalArgumentException
    at org.jpmml.sklearn.FeatureMapper.updateActiveFields(FeatureMapper.java:236)
    at sklearn.Classifier.createSchema(Classifier.java:59)
    at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
    at org.jpmml.sklearn.Main.run(Main.java:189)
    at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" java.lang.IllegalArgumentException
    at org.jpmml.sklearn.FeatureMapper.updateActiveFields(FeatureMapper.java:236)
    at sklearn.Classifier.createSchema(Classifier.java:59)
    at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
    at org.jpmml.sklearn.Main.run(Main.java:189)
    at org.jpmml.sklearn.Main.main(Main.java:107)
Traceback (most recent call last):
  File "/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/IPython/core/interactiveshell.py", line 2869, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-48-dcba1c38aa0e>", line 1, in <module>
    sklearn2pmml(rf, jobs_mapper, "sample.pmml", with_repr = False)
  File "/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/__init__.py", line 65, in sklearn2pmml
    subprocess.check_call(cmd)
  File "/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['java', '-cp', '/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/guava-19.0.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-converter-1.1.1.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.1.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.1.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-agent-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-schema-1.3.3.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pyrolite-4.13.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/serpent-1.12.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:/Users/jibybabu/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/var/folders/7r/mhw1rgdd729cw3rvrm2mtcj40000gn/T/estimator-3qlFmc.pkl.z', '--pkl-mapper-input', '/var/folders/7r/mhw1rgdd729cw3rvrm2mtcj40000gn/T/mapper-iKqPdf.pkl.z', '--pmml-output', 'sample.pmml']' returned non-zero exit status 1

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

A typical workflow for implementing a custom transformer:

  1. Create Python class. This should load a CSV file, and insert three columns (Gender, Age, Height) to the resulting Python data matrix that the estimator can see and use.
  2. Create Java class. This should load the same CSV file, and generate three MapValues transformations. You can communicate the name of the future input column by defining an appropriate "helper attribute" in the above Python class.
  3. Update the sklearn2pmml package to include both Python and Java classes.

The Java part might be tricky, especially considering that there's not much documentation about it. But you could take a look at earlier commits that implemented different Scikit-Learn transformers. For example: jpmml/jpmml-sklearn@5c4a181

I will be unable to work on this till the 1st of November.

from sklearn2pmml.

jibybabu avatar jibybabu commented on August 20, 2024

Ok. Thanks a ton Villu! I can take care of the python.
Please let me know here, i will check constantly, when it is ready!

Also just on a side note, it will be great if you could implement it in a smarter way, as u r always, sothat any of the future crazy transformations can be taken care if somebody implements the corresponding python class

from sklearn2pmml.

le-vision avatar le-vision commented on August 20, 2024

The MapValues transformation is designed for mapping many input values to one output value.

Thanks for pointing me towards MapValues for representing a lookup table using PMML, the InlineTable type looks ideal.

In the application I'm working on the PMML needs to represent only that lookup logic, i.e. map from all combinations of 2 input values (e.g. age, height) to some output/target/label value (e.g. isTall).

Having read the above I now think I should create a MapValues type DerivedField, and simply assign the value of that DerivedField to the output of the pmml model.

Is there an existing/simple/recommended way to produce such a pmml? I can always write some code to produce the inlineTable xml representation from a pandas dataframe or something, but perhaps there's an existing or better solution. Although it feels unnecessary, are there any transformations in sklearn that would produce such a MapValues DerivedField?

I opened this SO question just before I found this thread: http://stackoverflow.com/questions/40498703/how-to-generate-a-pmml-that-represents-a-simple-lookup-table-logic-using-python

Thanks!

from sklearn2pmml.

jibybabu avatar jibybabu commented on August 20, 2024

from sklearn2pmml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.