GithubHelp home page GithubHelp logo

Comments (13)

vruusmann avatar vruusmann commented on August 20, 2024

In the pmml file, those dummy variables which had all zero values do not appear in the datadictionary or miningmodel sections.

My PMML conversion tools, including SkLearn2PMML/JPMML-SkLearn, perform several PMML "tidying" actions. One of the tidying actions is deleting all unused DataDictionary/DataField, TransformationDictionary/DerivedField (and the corresponding MiningSchema/MiningField) elements. Why keep those dummy features, if they are not needed for model scoring? They only increase the size of the PMML document, and make it harder to read and analyze for humans.

However, when I use random forest in this package, all those zero value dummy variables would be included.

If a feature is included, then it is used by the model - your dummy feature wasn't so dummy after all (maybe it's 99% zero values, and 1% one values, and the latter are really helpful in explaining the variance).

The difference between selected feature sets between XGBoost and Random Forest comes from their algorithmic difference - XGBoost picks the next feature very carefully (prioritizing high-info features over low-info features), whereas Random Forest picks the next feature pretty much randomly (thereby increasing the chance that low-info features are selected).

I am wondering if there is anything I can do to fix it (via adding lines in the pmml file).

This is not an error (it simply reflects the different logic XGBoost and Random Forest algorithms), and it does not need to be fixed - both PMML documents are valid/correct.

from sklearn2pmml.

s2chen avatar s2chen commented on August 20, 2024

Thanks for your response
I am using the same dataset for xgboost and random forest. over 80% of the columns have zero values - I manually appended those zero dummy columns (see reason below). In RF all columns were included and in xgb those zero columns were dropped

The reason I need to do this is as follow:
In the training data, there are four categorical variables. I am using the past month of data to train the model. Each of the categorical variable only contains 10-20 values, whereas the universe might contain 50-100 values. Therefore I need to create a dummy variable as a placeholder for those values for future predictions.

Couple more thoughts on your response, so based on your response, does that mean there is no way to add those zero dummy fields in the xgb conversion?

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

Each of the categorical variable only contains 10-20 values, whereas the universe might contain 50-100 values. Therefore I need to create a dummy variable as a placeholder for those values for future predictions.

You need to re-train your XGBoost or RF model periodically, as more data becomes available. Today's model does not contain "prediction logic" for handling future category levels, so it would be making non-sensical/bad predictions in that case.

does that mean there is no way to add those zero dummy fields in the xgb conversion?

A categorical feature is defined by a single /PMML/DataDictionary/DataField element. So, at the moment it contains 10-20 category levels (each represented as a DataField/Value element). If you input an unseen category level during model scoring, then by default the model will fail with an exception message "category level $value is not defined for field $x". However, you can "relax" this behaviour by chaning the value of the corresponding MiningSchema/MiningField@invalidValueTreatment attribute from returnInvalid to asIs: http://dmg.org/pmml/v4-3/MiningSchema.html#xsdType_INVALID-VALUE-TREATMENT-METHOD

You can specify the "invalid value treatment method" for a column by prepending sklearn2pmml.CategoricalDomain to the list of its transformations:

mapper = DataFrameMapper([
  ("x", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()])
])

Also see this thread: https://groups.google.com/d/topic/jpmml/g01KjriBlcs/discussion

from sklearn2pmml.

s2chen avatar s2chen commented on August 20, 2024

Hello Villu,

I tried to add a mapper before running sklearn2pmml and it returned this error:

('python: ', '2.7.12')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('pandas: ', u'0.18.0')
('sklearn_pandas: ', '1.3.0')
('sklearn2pmml: ', '0.20.2')
java -cp /Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/guava-20.0.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-converter-1.2.3.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.7.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.3.2.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.7.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-agent-1.3.6.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-1.3.6.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.6.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pmml-schema-1.3.6.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/pyrolite-4.19.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/serpent-1.18.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/Users/shchen/.local/lib/python2.7/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar org.jpmml.sklearn.Main --pkl-pipeline-input /var/folders/6f/bhmp_nxx3rx32lhhqhm3h6vcssf2ks/T/pipeline-Jh8VBG.pkl.z --pmml-output /Users/shchen/Desktop/ATO/model_output/pato1_1_xgb_pmml_20160920_20160925.pmml
('Preserved joblib dump file(s): ', '/var/folders/6f/bhmp_nxx3rx32lhhqhm3h6vcssf2ks/T/pipeline-Jh8VBG.pkl.z')

    raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

Here is my code

clf = xgb.XGBClassifier()

#only included categorical variables in the mapper
mapper = DataFrameMapper([
('CAT1',[CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
('CAT2',[CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
('CAT3',[CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
])

pipeline = PMMLPipeline([
    ("mapper",mapper),
    ("estimator", clf)
])

pipeline.fit(training_features, training_label)
sklearn2pmml(pipeline, path, with_repr = True)

Could you please look into this issue?

Thanks!!

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

So, what did the Java process print to standard output and/or error streams? I need to know the Java exception; this Python exception "The JPMML-SkLearn conversion application has failed .." is completely useless to me.

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

XGBoost demo using the Audit dataset:

import pandas

df = pandas.read_csv("Audit.csv")

df["Deductions"] = df["Deductions"].replace(True, "TRUE").replace(False, "FALSE").astype(str)
df["Adjusted"] = df["Adjusted"].astype(int)

from sklearn2pmml import sklearn2pmml, PMMLPipeline
from sklearn2pmml.decoration import CategoricalDomain
from sklearn.preprocessing import LabelBinarizer
from sklearn_pandas import DataFrameMapper
from xgboost.sklearn import XGBClassifier

mapper = DataFrameMapper([
	("Occupation", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
	("Education", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()])
])
classifier = XGBClassifier()

pipeline = PMMLPipeline([
	("mapper", mapper),
	("classifier", classifier)
])
pipeline.fit(df, df["Adjusted"])

sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)

If the resulting pipeline.pmml file is opened in text editor, then the following /PMML/MiningModel/MiningSchema declaration can be seen:

<MiningSchema>
	<MiningField name="Adjusted" usageType="target"/>
	<MiningField name="Education" missingValueTreatment="asIs" invalidValueTreatment="asIs"/>
	<MiningField name="Occupation" missingValueTreatment="asIs" invalidValueTreatment="asIs"/>
</MiningSchema>

from sklearn2pmml.

s2chen avatar s2chen commented on August 20, 2024

I am not sure where to find the error messages from Java process, is there anywhere to specify the output stream for debugging in the sk2learnpmml function ?

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

Step by step:

  1. Copy my code from #37 (comment), and paste it into a new file test.py. Then, download the Audit dataset file, and put it into the same directory as an Audit.csv file.
  2. Run the test.py script from command-line, and capture its standard output and error streams to a test.log.txt file: $ python test.py > test.log.txt 2>&1
  3. Paste the contents of the test.log.txt file here.

from sklearn2pmml.

s2chen avatar s2chen commented on August 20, 2024

Thanks Villu!!

Looks like I had a simple error and upon fixing that error, I am able to run sk2learnpmml and create a pmml file with missingValueTreatment="asIs" for those mining field. However, when I used jpmml evaluator 1.3.3 to evaluate this pmml file, I am getting this error

Global Error: com.ea.eadp.risk.pmml.PMMLException: Failed to create PMML evaluator
	at com.ea.eadp.risk.pmml.impl.PMMLServiceImpl.createEvaluator(PMMLServiceImpl.java:92)
	at com.ea.eadp.risk.pmml.impl.loadtest.PMMLLoadTest.createEvaluator(PMMLLoadTest.java:221)
	at com.ea.eadp.risk.pmml.impl.loadtest.PMMLLoadTest.evaluate(PMMLLoadTest.java:82)
	at com.ea.eadp.risk.pmml.impl.loadtest.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:20)
	at com.ea.eadp.test.jpmml.Program.runLoadTest(Program.java:138)
	at com.ea.eadp.test.jpmml.Program.main(Program.java:54)
Caused by: org.jpmml.evaluator.InvalidResultException
	at org.jpmml.evaluator.FieldValueUtil.performInvalidValueTreatment(FieldValueUtil.java:190)
	at org.jpmml.evaluator.FieldValueUtil.prepareInputValue(FieldValueUtil.java:94)
	at org.jpmml.evaluator.InputField.prepare(InputField.java:64)
	at org.jpmml.evaluator.EvaluatorUtil.prepare(EvaluatorUtil.java:123)
	at com.ea.eadp.risk.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:152)
	at com.ea.eadp.risk.pmml.impl.PMMLServiceImpl.createEvaluator(PMMLServiceImpl.java:118)
	at com.ea.eadp.risk.pmml.impl.PMMLServiceImpl.createEvaluator(PMMLServiceImpl.java:69)
	... 5 more

Using the audit dataset, I got the same error.

import pandas
df = pandas.read_csv("Audit.csv")

df["Deductions"] = df["Deductions"].replace(True, "TRUE").replace(False, "FALSE").astype(str)
df["Adjusted"] = df["Adjusted"].astype(int)

df.to_csv("Audit_changed.csv",index=False)

from sklearn2pmml import sklearn2pmml, PMMLPipeline
from sklearn2pmml.decoration import CategoricalDomain,ContinuousDomain
from sklearn.preprocessing import LabelBinarizer
from sklearn_pandas import DataFrameMapper
from xgboost.sklearn import XGBClassifier
from sklearn.ensemble import RandomForestClassifier


mapper = DataFrameMapper([
      (['Age','Income','Hours'],[ContinuousDomain()]),
      ("Employment",[CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
	("Education", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
	("Marital", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
	("Gender", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
	("Deductions", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
	("Occupation", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()])
])
classifier = RandomForestClassifier()

pipeline = PMMLPipeline([
	("mapper", mapper),
	("classifier", classifier)
])
pipeline.fit(df, df["Adjusted"])
sklearn2pmml(pipeline, "pipeline_rf.pmml", with_repr = True)

Should I open a new ticket for this? Since this is no longer related to the original topic of this thread

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

However, when I used jpmml evaluator 1.3.3 to evaluate this pmml file, I am getting this error

I executed your Python script, and scored the pipeline_rf.pmml model file with Audit.csv input file using the org.jpmml.evaluator.EvaluationExample command-line application:

$ java -cp ~/Workspace/jpmml-evaluator/pmml-evaluator-example/target/example-1.3-SNAPSHOT.jar org.jpmml.evaluator.EvaluationExample --model pipeline_rf.pmml --input Audit.csv --output Audit-pred.csv --copy-columns false

The scoring completes successfully, both with JPMML-Evaluator 1.3.3 and the latest source checkout (1.3-SNAPSHOT).

So, if the scoring fails in your application, then you should analyze/troubleshoot your application code first:

  • Do you follow all the steps/guidelines as given in the JPMML-Evaluator project README file?
  • How does your application code differ from my org.jpmml.evaluator.EvaluationExample example application code? Specifically, pay attention to this code block.
  • What's the failing data record? Is it the first data record, or is it some later (say 131 of 1899 total) data record? What's the name of the field that cannot be prepared?

As for JPMML-Evaluator versioning, then I always suggest to keep up with the latest release version (that would be 1.3.6 as of today) - minimum effort, maximum benefit.

from sklearn2pmml.

s2chen avatar s2chen commented on August 20, 2024

Hi Villu

Thanks! I used 1.3.6 and it worked! A couple of other comments:

When I apply 1.3.6 to a dataset at work, I found out that some input records would cause the InvalidResultException as mentioned above. After a couple hours of debugging I found out that if I changed the mapper to include ('A', None) ('B', None), instead of (['A','B'],[ContinuousDomain()]) then the script would run without bugs

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

I found out that if I changed the mapper to include ('A', None) ('B', None), instead of (['A','B'],[ContinuousDomain()]) then the script would run without bugs.

This is a feature, not a bug. It means that your testing data contains A and B values that are outside the range of training data values.

If you specify ("A", ContinuousDomain()), then the DataField element contains an Interval child element, which declares the range of valid input values:

<DataField name="A">
  <Interval closure="closedClosed" leftMargin="0" rightMargin="1"/>
</DataField>

If you specify ("A", None), then the Interval child element is not generated:

<DataField name="A"/>

The former throws an InvalidResultException if you attempt to score the model with an "A" value that is less than 0 or greater than 1. The latter will accept any A value.

from sklearn2pmml.

chris7z avatar chris7z commented on August 20, 2024

Hello, I'm following this example to create a PMML file using Audit.csv. Now that I've got the PMML file, and it contains score for each node. A snippet is like below. Now I'd like to add additional logic to say that if score is <=0.5 then 'low', if score is >0.5 then 'high'. Is it possible to add this logic inside the PMML file through the pipeline in Python? Could you provide a code example? Thanks!

							<Node id="1">
								<True/>
								<Node id="2">
									<SimplePredicate field="Occupation" operator="notEqual" value="Executive"/>
									<Node id="3">
										<SimplePredicate field="Occupation" operator="notEqual" value="Professional"/>
										<Node id="4" score="-0.45040975411720585">
											<SimplePredicate field="Education" operator="notEqual" value="Bachelor"/>
										</Node>
										<Node id="5" score="0.4834200201761715">
											<SimplePredicate field="Education" operator="equal" value="Bachelor"/>
										</Node>
									</Node>
									<Node id="6">
										<SimplePredicate field="Occupation" operator="equal" value="Professional"/>
										<Node id="7" score="0.5173738670472605">
											<SimplePredicate field="Education" operator="notEqual" value="Professional"/>
										</Node>
										<Node id="8" score="2.055645282801034">
											<SimplePredicate field="Education" operator="equal" value="Professional"/>
										</Node>
									</Node>
								</Node>

XGBoost demo using the Audit dataset:

import pandas

df = pandas.read_csv("Audit.csv")

df["Deductions"] = df["Deductions"].replace(True, "TRUE").replace(False, "FALSE").astype(str)
df["Adjusted"] = df["Adjusted"].astype(int)

from sklearn2pmml import sklearn2pmml, PMMLPipeline
from sklearn2pmml.decoration import CategoricalDomain
from sklearn.preprocessing import LabelBinarizer
from sklearn_pandas import DataFrameMapper
from xgboost.sklearn import XGBClassifier

mapper = DataFrameMapper([
	("Occupation", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()]),
	("Education", [CategoricalDomain(invalid_value_treatment = "as_is"), LabelBinarizer()])
])
classifier = XGBClassifier()

pipeline = PMMLPipeline([
	("mapper", mapper),
	("classifier", classifier)
])
pipeline.fit(df, df["Adjusted"])

sklearn2pmml(pipeline, "pipeline.pmml", with_repr = True)

If the resulting pipeline.pmml file is opened in text editor, then the following /PMML/MiningModel/MiningSchema declaration can be seen:

<MiningSchema>
	<MiningField name="Adjusted" usageType="target"/>
	<MiningField name="Education" missingValueTreatment="asIs" invalidValueTreatment="asIs"/>
	<MiningField name="Occupation" missingValueTreatment="asIs" invalidValueTreatment="asIs"/>
</MiningSchema>

from sklearn2pmml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.