jpmml / sklearn2pmml Goto Github PK

View Code? Open in Web Editor NEW

682.0 27.0 111.0 127.03 MB

Python library for converting Scikit-Learn pipelines to PMML

License: GNU Affero General Public License v3.0

Python 98.95% Java 1.05%

sklearn2pmml's Introduction

SkLearn2PMML

Python package for converting Scikit-Learn pipelines to PMML.

Features

This package is a thin Python wrapper around the JPMML-SkLearn library.

News and Updates

The current version is 0.108.0 (20 May, 2024):

pip install sklearn2pmml==0.108.0

See the NEWS.md file.

Prerequisites

Java 1.8 or newer. The Java executable must be available on system path.
Python 2.7, 3.4 or newer.

Installation

Installing a release version from PyPI:

pip install sklearn2pmml

Alternatively, installing the latest snapshot version from GitHub:

pip install --upgrade git+https://github.com/jpmml/sklearn2pmml.git

Usage

A typical workflow can be summarized as follows:

Create a PMMLPipeline object, and populate it with pipeline steps as usual. Class sklearn2pmml.pipeline.PMMLPipeline extends class sklearn.pipeline.Pipeline with the following functionality:

If the PMMLPipeline.fit(X, y) method is invoked with pandas.DataFrame or pandas.Series object as an X argument, then its column names are used as feature names. Otherwise, feature names default to "x1", "x2", .., "x{number_of_features}".
If the PMMLPipeline.fit(X, y) method is invoked with pandas.Series object as an y argument, then its name is used as the target name (for supervised models). Otherwise, the target name defaults to "y".

Fit and validate the pipeline as usual.
Optionally, compute and embed verification data into the PMMLPipeline object by invoking PMMLPipeline.verify(X) method with a small but representative subset of training data.
Convert the PMMLPipeline object to a PMML file in local filesystem by invoking utility method sklearn2pmml.sklearn2pmml(pipeline, pmml_destination_path).

Developing a simple decision tree model for the classification of iris species:

import pandas

iris_df = pandas.read_csv("Iris.csv")

iris_X = iris_df[iris_df.columns.difference(["Species"])]
iris_y = iris_df["Species"]

from sklearn.tree import DecisionTreeClassifier
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
	("classifier", DecisionTreeClassifier())
])
pipeline.fit(iris_X, iris_y)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(pipeline, "DecisionTreeIris.pmml", with_repr = True)

Developing a more elaborate logistic regression model for the same:

import pandas

iris_df = pandas.read_csv("Iris.csv")

iris_X = iris_df[iris_df.columns.difference(["Species"])]
iris_y = iris_df["Species"]

from sklearn_pandas import DataFrameMapper
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn2pmml.decoration import ContinuousDomain
from sklearn2pmml.pipeline import PMMLPipeline

pipeline = PMMLPipeline([
	("mapper", DataFrameMapper([
		(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), SimpleImputer()])
	])),
	("pca", PCA(n_components = 3)),
	("selector", SelectKBest(k = 2)),
	("classifier", LogisticRegression(multi_class = "ovr"))
])
pipeline.fit(iris_X, iris_y)
pipeline.verify(iris_X.sample(n = 15))

from sklearn2pmml import sklearn2pmml

sklearn2pmml(pipeline, "LogisticRegressionIris.pmml", with_repr = True)

Documentation

Integrations:

Extensions:

Miscellaneous:

Archived:

Converting Scikit-Learn to PMML

De-installation

Uninstalling:

pip uninstall sklearn2pmml

License

SkLearn2PMML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use SkLearn2PMML in a proprietary software project, then it is possible to enter into a licensing agreement which makes SkLearn2PMML available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

SkLearn2PMML is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact [email protected]

sklearn2pmml's People

Contributors

Stargazers

Watchers

Forkers

ybrovman dtpryce marsjoy ndanielsen leoliudeprecated alexkevin camcairns jooon mikomou iouardi mrshanth suraj-deshmukh alfredfrancis infiton beifeizhou machinelearninghelen sabba movingname wuhaifengdhu chenyyx undata 1035976069lzm hiredd bacon0546 adairzhao pzhao16me linzhuzhen lukaschen1986 antonsko hope-bravery damienrj qiudebo hufengquna yusirkuan qqwant babylls shorey wangzhaobo albertbj mges yzu2ustc hordaway tudoufuluobo migsr22 zorrock smith6036 salemameen greenliuwhy sc-tudou chenzg2016 arunbaruah zhangpengju999 rockkb mejihero leeflora primeston mdruger jimmy-ksu manaconnan shanyuqiang liranzaixin reginababo wushicanasl sampsongi xinxiangbobby rabiniz h-waves lusonpan62678 wangbo009 zeronicyber gtxinyi felix-zzr yinqiao2017 jadidaniel wdtzliu sara-zhu blyingcc soul-an xuming1986 qinfendebirddr qinjie545 tommyhxh mnjenga2 chenye95071 houyangyang1 mitinroman gentlegant alterjx maxnoe lqrz bosgithub leostephen songym2020 sergiogaiotto suman-lee abhilb pryus joearmel scdtxdy coldteapot273k

sklearn2pmml's Issues

Does sklearn2pmml only support jdk 1.7 rather than 1.8 ?

Error: Registry key 'Software\JavaSoft\Java Runtime Environment'\CurrentVersion' has value '1.8', but '1.7' is required.
Error: could not find java.dll
Error: Could not find Java SE Runtime Environment.

Not working jpmml-converter version with GradientBoostingClassifier

The jpmml-converter-1.2.3.jar included in the current release, generates a wrong PMML document.

The second segment of the multiple model does not read the output produced by the first one (tree ensemble), generating a constant prediction.

This is an example of segment generated with the version 1.2.3

<Segment id="2">
    <True/>
    <RegressionModel functionName="classification" normalizationMethod="none">
        <MiningSchema>
            <MiningField name="TARGET" usageType="target"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability(-1)" optype="continuous" dataType="double" feature="probability" value="-1"/>
            <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
        </Output>
        <RegressionTable intercept="1.0" targetCategory="1"/>
        <RegressionTable intercept="0.0" targetCategory="-1"/>
    </RegressionModel>
</Segment>

This is the same segment generated with a version >= 1.2.4

<Segment id="2">
    <True/>
    <RegressionModel functionName="classification" normalizationMethod="none">
        <MiningSchema>
            <MiningField name="TARGET" usageType="target"/>
            <MiningField name="transformedDecisionFunction(1)"/>
        </MiningSchema>
        <Output>
            <OutputField name="probability(-1)" optype="continuous" dataType="double" feature="probability" value="-1"/>
            <OutputField name="probability(1)" optype="continuous" dataType="double" feature="probability" value="1"/>
        </Output>
        <RegressionTable intercept="0.0" targetCategory="1">
            <NumericPredictor name="transformedDecisionFunction(1)" coefficient="1.0"/>
        </RegressionTable>
        <RegressionTable intercept="0.0" targetCategory="-1"/>
    </RegressionModel>
</Segment>

To solve the issue is required to update the static jars included in sklearn2pmml.

xgboost Issue with sklearn2pmml

Hello!
I am trying to use sklearn2pmml to convert xgboost (default parameter) to pmml. In the raw data, there are a lot of dummy variables that have a value of 0.

If I convert xgboost model via this syntax:

sklearn2pmml(ato_pipeline, model_output+"pato1_1_rf_pmml_20170301_20170430.pmml", with_repr = True),

In the pmml file, those dummy variables which had all zero values do not appear in the datadictionary or miningmodel sections. However, when I use random forest in this package, all those zero value dummy variables would be included in the datadictionary or miningmodel sections.

I am wondering if there is anything I can do to fix it (via adding lines in the pmml file).

Thank you very much!

FunctionTransformer should support all 1-parameter Numpy universal functions (ufuncs)

Trying to get FunctionTransformer to work with sklearn2pmml, and am getting the following error:

Aug 18, 2016 3:49:50 PM org.jpmml.sklearn.Main run
INFO: Parsing DataFrameMapper PKL..
Aug 18, 2016 3:49:50 PM org.jpmml.sklearn.Main run
INFO: Parsed DataFrameMapper PKL in 36 ms.
Aug 18, 2016 3:49:50 PM org.jpmml.sklearn.Main run
INFO: Converting DataFrameMapper..
Aug 18, 2016 3:49:50 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert DataFrameMapper
java.lang.IllegalArgumentException: The function object (Java class net.razorvine.pickle.objects.ClassDictConstructor) is not a Numpy universal function
    at sklearn.preprocessing.FunctionTransformer.encodeFeatures(FunctionTransformer.java:54)
    at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:70)
    at org.jpmml.sklearn.Main.run(Main.java:146)
    at org.jpmml.sklearn.Main.main(Main.java:107)
Caused by: java.lang.ClassCastException: net.razorvine.pickle.objects.ClassDictConstructor cannot be cast to numpy.core.UFunc
    at sklearn.preprocessing.FunctionTransformer.encodeFeatures(FunctionTransformer.java:52)
    ... 3 more

Exception in thread "main" java.lang.IllegalArgumentException: The function object (Java class net.razorvine.pickle.objects.ClassDictConstructor) is not a Numpy universal function
    at sklearn.preprocessing.FunctionTransformer.encodeFeatures(FunctionTransformer.java:54)
    at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:70)
    at org.jpmml.sklearn.Main.run(Main.java:146)
    at org.jpmml.sklearn.Main.main(Main.java:107)
Caused by: java.lang.ClassCastException: net.razorvine.pickle.objects.ClassDictConstructor cannot be cast to numpy.core.UFunc
    at sklearn.preprocessing.FunctionTransformer.encodeFeatures(FunctionTransformer.java:52)
    ... 3 more
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-213-25218f72d00f>", line 2, in <module>
    sklearn2pmml(iris_classifier, iris_mapper, "LogisticRegressionIris.pmml", with_repr = True)
  File "/Users/kwilliams/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/__init__.py", line 65, in sklearn2pmml
    subprocess.check_call(cmd)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)

As an example, I tried to apply a simple numpy universal function to the Iris data:

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

from sklearn2pmml.decoration import ContinuousDomain

import pandas
import sklearn_pandas
import numpy as np

iris = load_iris()

iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pandas.DataFrame(iris.target, columns = ["Species"])), axis = 1)

#
# Modified from Iris Example
#
from sklearn.preprocessing import FunctionTransformer

def squarex(ary):
    return np.square(ary)

iris_mapper = sklearn_pandas.DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length"], [ContinuousDomain(), PCA(n_components = 2)]),
    (["Petal.Width"], [ContinuousDomain(), FunctionTransformer(squarex)]),
    ("Species", None)
])

iris = iris_mapper.fit_transform(iris_df)

#
# Step 2: training a logistic regression model
#

from sklearn.linear_model import LogisticRegressionCV

iris_X = iris[:, 0:3]
iris_y = iris[:, 3]

iris_classifier = LogisticRegressionCV()
iris_classifier.fit(iris_X, iris_y)

#
# Step 3: conversion to PMML
#

from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_classifier, iris_mapper, "LogisticRegressionIris.pmml", with_repr = True)

`SEVERE: Failed to convert` in pmml output

All of sudden (worked a month ago!) PMML output for a GradientBoostingClassifier is broken:

Aug 24, 2017 10:42:14 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Aug 24, 2017 10:42:14 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 78 ms.
Aug 24, 2017 10:42:14 AM org.jpmml.sklearn.Main run
INFO: Converting..
Aug 24, 2017 10:42:14 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.NullPointerException
	at org.jpmml.converter.ValueUtil.asInt(ValueUtil.java:103)
	at sklearn.ensemble.gradient_boosting.GradientBoostingClassifier.getNumberOfFeatures(GradientBoostingClassifier.java:49)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:129)
	at org.jpmml.sklearn.Main.run(Main.java:144)
	at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.NullPointerException
	at org.jpmml.converter.ValueUtil.asInt(ValueUtil.java:103)
	at sklearn.ensemble.gradient_boosting.GradientBoostingClassifier.getNumberOfFeatures(GradientBoostingClassifier.java:49)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:129)
	at org.jpmml.sklearn.Main.run(Main.java:144)
	at org.jpmml.sklearn.Main.main(Main.java:93)

to reproduce:

import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn2pmml import sklearn2pmml, PMMLPipeline
df=pd.DataFrame({"ta":[0,0,0,1,1,1],"a":[0,1,2,3,4,5],"b":[5,4,3,2,1,0]})
active_fields=["a","b"]
clf=GradientBoostingClassifier().fit(df.loc[:,active_fields],df.ta)
pipe=PMMLPipeline([("classifier", clf)])
pipe.target_field="ta"
pipe.active_fields=np.array(active_fields)
sklearn2pmml(pipe, "tmp.pmml", with_repr=True)

Note that I do not want to use sklearn2pmml to train models, only to save them as pmml.

Encoding the value space information (valid, invalid, missing values) for columns

Hi,

i was using sklearn2pmml to convert sklearn model to pmml, just wonder where and how shall i define the following mining schema in pmml.

<xs:attribute name="missingValueReplacement" type="xs:string"/>
<xs:attribute name="missingValueTreatment" type="MISSING-VALUE-TREATMENT-METHOD"/>
<xs:attribute name="invalidValueTreatment" type="INVALID-VALUE-TREATMENT-METHOD" default="returnInvalid"/>

Thanks a lot for your help.

wei

Advise for Neural Network

Good evening Mr Ruusman,

I am coming to search advise from you. Indeed in this moment my goal is to export a Neural Network in PMML. My code is really basic. But when I try to execute it with the Standalone version there is a little problem ... The target and the associate probability are null.

You can see it. Could you explain to me what is wrong ?

Thanks,

Error with DecisionTrees

Hi, I ran your examples presented here but the line

iris_pipeline.fit(iris_df[iris_df.columns.difference(["Species"])], iris_df["Species"])

display errors:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2134, in get_loc
    return self._engine.get_loc(key)
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'Species'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib/python3.5/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib/python3.5/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python3.5/site-packages/pandas/core/internals.py", line 3543, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib/python3.5/site-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'Species'

How can I fix these errors? Thanks in advance

How to pass constant values to the PMML from PMMLPipeline in a custom transformer as an argument

I am writing an exponential function x^n as a custom transformer. Where n is an argument which needs
to be passed from the python pipeline to generate a PMML. I am able to create a PMML manually with the constant integer.

<DerivedField name="com.project.ai.jpmml.function.Exponent(number)" optype="continuous" dataType="double">
			<Apply function="com.project.ai.jpmml.function.Exponent">
				<FieldRef field="number"/>
				<Constant dataType="integer">3</Constant>
			</Apply>		
</DerivedField>

"number" is the name of the column where we want to apply the transformation with (number)^3. The transformation is applied in the JPMML custom transformer function
successfully.
But I need to pass 3 as an argument from the python PMML pipeline to generate the PMML with the above mentioned tag.
I am not able to pass that argument in the pipeline, however I was able to create custom transformer from the pipeline by passing the argument in the python
custom transformer.

exponent=3
mapper = DataFrameMapper([(['number'],[customtransform_add(function="Exponent",arguments=exponent)]),('amount',None)])
pipeline = PMMLPipeline(
    [( "mapper", mapper) ,("classifier",clf)]     
)

Please help me with a solution, on how to send the argument to the PMML side so that it generates a tag from the python pipeline.

Custom multivariate transformation and sklearn2pmml

suppose that I have a data set with two features and one response variable like following:

"height", "width", "is_palm_tree",

in which the "height" and "width" variables are predictors and the "is_palm_tree" is training label for our classifier. We'd like to compute a derived variable like

ht_wt_ratio = height/width,
and build a logistic regression classifier using "ht_wt_ratio" to classify "is_palm_tree". We'd like to output a PMML file with the custom transformation and classifier. Is that something we can do with sklearn transformers and sklearn2pmml ?

Mechanism for registering custom Estimator and Transformer converter classes

How can I integrate a custom transformer to skelarn2pmml?

I am thinking about some preprocessing code which cleans the data, hand handles the imputation of missing values.
Is it correct to assume pyrolite will handle any pickled transformer?

Error Converting PMMLPipeline to PMML file

I'm trying to cover ta pipeline, containing a DataFrameMapper, such as

DataFrameMapper (
[(col, [CategoricalDomain(), LabelEncoder()]) for col in df.columns] + [ (numerical columns), None]
)

and an estimator, and so the pipeline looks like

PMMLPipeline([
    ("Preprocess", DataFrameMapper),
    ("estimator", GradientBoostingRegressor())
])

This is how I output PMML file
sklearn2pmml(pipeline, 'path/GBRegression.pmml', with_repr=True, debug=True)

Here is the debug info:
Python: 3.6.2
sklearn: 0.19.0
sklearn.externals.joblib: 0.11
pandas: 0.20.3
sklearn_pandas: 1.5.0
sklearn2pmml: 0.22.0
java -cp (path to the modules above) output_pmml_path
Preserved job lib dump file(s): /tmp/pipeline-0vx5splj.pkl.z

I'm not sure what's wrong and cannot find a solution from previous issues. :/

sklearn2pmml: 0.17.4 - InvalidOpcodeException for IsolationForest

Hello,

First of all thank you for creating and maintaining this great library.
Secondly I seem to be running into some issues when trying to create a PMML from following pipeline

    IForest_pipeline = PMMLPipeline([("isolationforest", IsolationForest(n_estimators=100,random_state=0))])

The respective errors are :
SEVERE: Failed to parse PKL net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 248

Versions are

sklearn:  0.18.1
sklearn.externals.joblib: 0.10.3
pandas:  0.19.2
sklearn_pandas:  1.3.0
sklearn2pmml:  0.17.4
java:1.8.0
joblib==0.11
python==3.6.0

The weird thing is that this code (using kmeans) generates the PMML correctly, so it might be something model specific

IForest_pipeline = PMMLPipeline([("classifier", KMeans(n_clusters=2, random_state=0))])

Run on windows 7 SP1 64bit

Could you give me pointers on where I might be looking to solve this error ?
Thanks in advance.

How do you use Dataframe mapper for LabelEncoding the target column

Hi,
I am trying to do a multi class text classification and here is an excerpt of the classifier that I was playing with where
X = pandas dataseries which has sentences
y = pandas dataseries with classes as text

I am trying to make label encoding of the target values part of the pipeline. I have gone through the main.py to try and understand how to tackle this but besides the sentiment text classification, I was unable to find anything. Any guidance on this will be super helpful.
Is the the limitation of sklearn_pandas?

def build_classifier(classifier, name, with_proba=True):
    mapper = DataFrameMapper([
        (ytl, LabelEncoder()),
        (X, None)
    ])
    pipeline = PMMLPipeline([
        ("mapper", mapper),
        ("tf-idf",
         TfidfVectorizer(analyzer="word", strip_accents=None, lowercase=True,
                         tokenizer=Splitter(), stop_words="english", norm=None, dtype=(
             np.float32 if isinstance(classifier, RandomForestClassifier) else np.float64))),
        ("classifier", classifier)
    ])
    pipeline.fit(X, ytl)
    store_pkl(pipeline, name + ".pkl")
    score = pd.DataFrame(pipeline.predict(X), columns=["Score"])
    if (with_proba == True):
        score_proba = pd.DataFrame(pipeline.predict_proba(X), columns=classifier.classes_)
        score = pd.concat((score, score_proba), axis=1)
    store_csv(score, name + ".csv")
    sklearn2pmml(pipeline, name +".pmml", with_repr=True)


build_classifier(RandomForestClassifier(random_state=13, n_jobs=-1), "RandomForest")

When I run this i get the following error

/home/mluser/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py:216: FutureWarning: numpy not_equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (is)) and will change.
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
/home/mluser/anaconda3/lib/python3.6/site-packages/numpy/lib/arraysetops.py:275: FutureWarning: numpy equal will not check object identity in the future. The comparison did not return the same result as suggested by the identity (is)) and will change.
return aux[:-1][aux[1:] == aux[:-1]]
Traceback (most recent call last):
File "/home/mluser/Dropbox/Code/parallelEvaluators/temp/sklearn-pmml-multiout.py", line 229, in
build_sentiment(RandomForestClassifier(random_state=13, n_jobs=-1), "RandomForestBrand")
File "/home/mluser/Dropbox/Code/parallelEvaluators/temp/sklearn-pmml-multiout.py", line 219, in build_sentiment
pipeline.fit(X, ytl)
File "/home/mluser/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py", line 268, in fit
Xt, fit_params = self._fit(X, y, **fit_params)
File "/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/init.py", line 39, in _fit
return Pipeline._fit(self, X, y, **fit_params)
File "/home/mluser/anaconda3/lib/python3.6/site-packages/sklearn/pipeline.py", line 234, in fit
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "/home/mluser/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 1352, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "/home/mluser/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 839, in fit_transform
self.fixed_vocabulary)
File "/home/mluser/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 762, in _count_vocab
for feature in analyze(doc):
File "/home/mluser/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 241, in
tokenize(preprocess(self.decode(doc))), stop_words)
File "/home/mluser/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py", line 207, in
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

When I remove the dataframemapper, I get an error while generating the pmml file

def build_classifier(classifier, name, with_proba=True):
    # mapper = DataFrameMapper([
    #     (ytl, LabelEncoder()),
    #     (X, None)
    # ])
    pipeline = PMMLPipeline([
        #("mapper", mapper),
        ("tf-idf",
         TfidfVectorizer(analyzer="word", strip_accents=None, lowercase=True,
                         tokenizer=Splitter(), stop_words="english", norm=None, dtype=(
             np.float32 if isinstance(classifier, RandomForestClassifier) else np.float64))),
        ("classifier", classifier)
    ])
    pipeline.fit(X, ytl)
    store_pkl(pipeline, name + ".pkl")
    score = pd.DataFrame(pipeline.predict(X), columns=["Score"])
    if (with_proba == True):
        score_proba = pd.DataFrame(pipeline.predict_proba(X), columns=classifier.classes_)
        score = pd.concat((score, score_proba), axis=1)
    store_csv(score, name + ".csv")
    sklearn2pmml(pipeline, name +".pmml", with_repr=True)


build_classifier(RandomForestClassifier(random_state=13, n_jobs=-1), "RandomForest")

I get the following error where it is not able to convert the text classes into integer

/home/mluser/anaconda3/bin/python /home/mluser/Dropbox/Code/parallelEvaluators/temp/sklearn-pmml-multiout.py
Jul 21, 2017 11:55:23 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Jul 21, 2017 11:55:36 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 12587 ms.
Jul 21, 2017 11:55:36 AM org.jpmml.sklearn.Main run
INFO: Converting..
Jul 21, 2017 11:55:36 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: 2s Factory
at sklearn2pmml.PMMLPipeline$1.apply(PMMLPipeline.java:248)
at sklearn2pmml.PMMLPipeline$1.apply(PMMLPipeline.java:241)
at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:638)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
at org.jpmml.converter.PMMLUtil.addValues(PMMLUtil.java:121)
at org.jpmml.converter.PMMLUtil.addValues(PMMLUtil.java:110)
at org.jpmml.converter.PMMLEncoder.createDataField(PMMLEncoder.java:173)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:96)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.IllegalArgumentException: 20s Factory
at sklearn2pmml.PMMLPipeline$1.apply(PMMLPipeline.java:248)
at sklearn2pmml.PMMLPipeline$1.apply(PMMLPipeline.java:241)
at com.google.common.collect.Lists$TransformingRandomAccessList$1.transform(Lists.java:638)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:47)
at org.jpmml.converter.PMMLUtil.addValues(PMMLUtil.java:121)
at org.jpmml.converter.PMMLUtil.addValues(PMMLUtil.java:110)
at org.jpmml.converter.PMMLEncoder.createDataField(PMMLEncoder.java:173)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:96)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
Traceback (most recent call last):
File "/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/init.py", line 140, in sklearn2pmml
subprocess.check_call(cmd)
File "/home/mluser/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['java', '-cp', '/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-converter-1.2.3.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/guava-20.0.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/serpent-1.18.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.7.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-agent-1.3.6.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-schema-1.3.6.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pyrolite-4.19.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.3.3.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.7.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-model-1.3.6.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.6.jar:/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/resources/jcommander-1.48.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', '/tmp/pipeline-e9v7_63h.pkl.z', '--pmml-output', 'RandomForest.pmml']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/mluser/Dropbox/Code/parallelEvaluators/temp/sklearn-pmml-multiout.py", line 229, in
build_classifier(RandomForestClassifier(random_state=13, n_jobs=-1), "RandomForest")
File "/home/mluser/Dropbox/Code/parallelEvaluators/temp/sklearn-pmml-multiout.py", line 226, in build_classifier
sklearn2pmml(pipeline, name +".pmml", with_repr=True)
File "/home/mluser/.local/lib/python3.6/site-packages/sklearn2pmml/init.py", line 142, in sklearn2pmml
raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

Process finished with exit code 1

Keeping the LabelEncoding outside makes it a success:

ytl = LabelEncoder().fit_transform(yt)


def build_classifier(classifier, name, with_proba=True):
    # mapper = DataFrameMapper([
    #     (ytl, LabelEncoder()),
    #     (X, None)
    # ])
    pipeline = PMMLPipeline([
        #("mapper", mapper),
        ("tf-idf",
         TfidfVectorizer(analyzer="word", strip_accents=None, lowercase=True,
                         tokenizer=Splitter(), stop_words="english", norm=None, dtype=(
             np.float32 if isinstance(classifier, RandomForestClassifier) else np.float64))),
        ("classifier", classifier)
    ])
    pipeline.fit(X, ytl)
    store_pkl(pipeline, name + ".pkl")
    score = pd.DataFrame(pipeline.predict(X), columns=["Score"])
    if (with_proba == True):
        score_proba = pd.DataFrame(pipeline.predict_proba(X), columns=classifier.classes_)
        score = pd.concat((score, score_proba), axis=1)
    store_csv(score, name + ".csv")
    sklearn2pmml(pipeline, name +".pmml", with_repr=True)


build_classifier(RandomForestClassifier(random_state=13, n_jobs=-1), "RandomForest")

No errors this time

Jul 21, 2017 11:58:30 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Jul 21, 2017 11:58:43 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 12595 ms.
Jul 21, 2017 11:58:43 AM org.jpmml.sklearn.Main run
INFO: Converting..
Jul 21, 2017 11:58:43 AM sklearn2pmml.PMMLPipeline encodePMML
WARNING: The 'target_field' attribute is not set. Assuming y as the name of the target field
Jul 21, 2017 11:58:48 AM org.jpmml.sklearn.Main run
INFO: Converted in 4783 ms.
Jul 21, 2017 11:58:48 AM org.jpmml.sklearn.Main run
INFO: Marshalling PMML..
Jul 21, 2017 11:58:59 AM org.jpmml.sklearn.Main run
INFO: Marshalled PMML in 11532 ms.
Process finished with exit code 0

pmml generation fails

Hello Villu,

I was updating my prediction model, and my script started failling were it worked before. I just added a couple of variables and transformations to the mapper, but the pmml generation keeps failing even though I ignore them.

pmml_data = mapper.fit_transform(data_pre)
print "numpy array shape: {}".format(pmml_data.shape)
print "dataset label: {} ".format(pmml_data[:,pmml_data.shape[1]-1])
clf = neural_network.MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(50,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.75,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)
clf.fit(pmml_data[:,0:pmml_data.shape[1]-1],pmml_data[:,pmml_data.shape[1]-1])
print "Shopping MAE: {:.2f}".format(mean_absolute_error(pmml_data[:,pmml_data.shape[1]-1],clf.predict(pmml_data[:,0:pmml_data.shape[1]-1])))
sklearn2pmml(clf, mapper, "Shopping.pmml",debug=True)

Will generate this result:

numpy array shape: (17658, 23)
dataset label: [ 33.7507249 26.24458863 13.42103125 ..., 46.73213882 77.65744095
114.22366247]
Shopping MAE: 8.58
('python: ', '2.7.12')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('sklearn_pandas: ', '1.2.0')
('sklearn2pmml: ', '0.12.1')
java -cp /Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/guava-19.0.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-converter-1.1.1.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.1.4.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.1.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-agent-1.3.3.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-1.3.3.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.3.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-schema-1.3.3.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pyrolite-4.14.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/serpent-1.15.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar org.jpmml.sklearn.Main --pkl-estimator-input /var/folders/6s/mbslfxzj6h90fn5h1xn1kxj80000gn/T/estimator-neXjJM.pkl.z --pkl-mapper-input /var/folders/6s/mbslfxzj6h90fn5h1xn1kxj80000gn/T/mapper-QPTKmO.pkl.z --pmml-output Shopping.pmml
('Preserved joblib dump file(s): ', '/var/folders/6s/mbslfxzj6h90fn5h1xn1kxj80000gn/T/estimator-neXjJM.pkl.z /var/folders/6s/mbslfxzj6h90fn5h1xn1kxj80000gn/T/mapper-QPTKmO.pkl.z')

CalledProcessError Traceback (most recent call last)
in ()
22 print "Shopping MAE: {:.2f}".format(mean_absolute_error(pmml_data[:,pmml_data.shape[1]-1],clf.predict(pmml_data[:,0:pmml_data.shape[1]-1])))
23
---> 24 sklearn2pmml(clf, mapper, "Shopping.pmml",debug=True)

/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/init.pyc in sklearn2pmml(estimator, mapper, pmml, with_repr, debug)
63 if(debug):
64 print(" ".join(cmd))
---> 65 subprocess.check_call(cmd)
66 finally:
67 if(debug):

/usr/local/Cellar/python/2.7.12/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.pyc in check_call(*popenargs, **kwargs)
539 if cmd is None:
540 cmd = popenargs[0]
--> 541 raise CalledProcessError(retcode, cmd)
542 return 0
543

CalledProcessError: Command '['java', '-cp', '/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/guava-19.0.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-converter-1.1.1.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.1.4.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.1.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-agent-1.3.3.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-1.3.3.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.3.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-schema-1.3.3.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pyrolite-4.14.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/serpent-1.15.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:/Users/Matias/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/var/folders/6s/mbslfxzj6h90fn5h1xn1kxj80000gn/T/estimator-neXjJM.pkl.z', '--pkl-mapper-input', '/var/folders/6s/mbslfxzj6h90fn5h1xn1kxj80000gn/T/mapper-QPTKmO.pkl.z', '--pmml-output', 'Shopping.pmml']' returned non-zero exit status 1

Do you have any ideas of what is not working?

Thanks in advance

StandardScaler in pipeline causes PMML export to error out

I am getting an error when I try and export the StandardScaler transformer through a PMMLPipeline. Here is the stack trace:

SEVERE: Failed to convert
java.lang.ClassCastException: sklearn.preprocessing.StandardScaler cannot be cast to sklearn.HasNumberOfFeatures
	at sklearn2pmml.PMMLPipeline.initFeatures(PMMLPipeline.java:153)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:115)
	at org.jpmml.sklearn.Main.run(Main.java:144)
	at org.jpmml.sklearn.Main.main(Main.java:93)
Exception in thread "main" java.lang.ClassCastException: sklearn.preprocessing.StandardScaler cannot be cast to sklearn.HasNumberOfFeatures
	at sklearn2pmml.PMMLPipeline.initFeatures(PMMLPipeline.java:153)
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:115)
	at org.jpmml.sklearn.Main.run(Main.java:144)
	at org.jpmml.sklearn.Main.main(Main.java:93)

Below is the code that reproduces this error.

import numpy as np
from sklearn import datasets, svm
from sklearn.preprocessing import StandardScaler
from sklearn2pmml import sklearn2pmml
from sklearn2pmml import PMMLPipeline

iris = datasets.load_iris()
X = iris.data
y = iris.target

X = X[y != 0, :2]
y = y[y != 0]
y = y.astype(np.float)

feat_normalizer = StandardScaler()
clf = svm.SVC(kernel='rbf', gamma=10)

clf_module = PMMLPipeline([('normalizer', feat_normalizer), ('classifier', clf)])
clf_module.fit(X, y)
sklearn2pmml(clf_module, "classifier.pmml", with_repr=True)

Any thoughts on what might be going on here?

Called Process Error

I keep getting the error message below when I try to convert a fitted pipeline to a PMML file.

CalledProcessError Traceback (most recent call last)
in ()
13 from sklearn2pmml import sklearn2pmml
14
---> 15 sklearn2pmml(iris_pipeline, "DecisionTreeIris.pmml", with_repr = True)

C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml_init_.py in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug)
130 if(debug):
131 print(" ".join(cmd))
--> 132 subprocess.check_call(cmd)
133 finally:
134 if(debug):

C:\Users\dla805\AppData\Local\Continuum\Anaconda3\lib\subprocess.py in check_call(*popenargs, **kwargs)
579 if cmd is None:
580 cmd = popenargs[0]
--> 581 raise CalledProcessError(retcode, cmd)
582 return 0
583

CalledProcessError: Command '['java', '-cp', 'C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\guava-19.0.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\istack-commons-runtime-2.21.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\jaxb-core-2.2.11.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\jaxb-runtime-2.2.11.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\jcommander-1.48.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\jpmml-converter-1.2.1.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\jpmml-lightgbm-1.0.2.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\jpmml-sklearn-1.2.6.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\jpmml-xgboost-1.1.5.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\pmml-agent-1.3.4.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\pmml-model-1.3.4.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\pmml-model-metro-1.3.4.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\pmml-schema-1.3.4.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\pyrolite-4.16.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\serpent-1.16.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\slf4j-api-1.7.22.jar;C:\Users\dla805\AppData\Roaming\Python\Python35\site-packages\sklearn2pmml\resources\slf4j-jdk14-1.7.22.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', 'C:\Users\dla805\AppData\Local\Temp\pipeline-xpbmt8jp.pkl.z', '--repr-pipeline', "PMMLPipeline(steps=[('classifier', DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n max_features=None, max_leaf_nodes=None,\n min_impurity_split=1e-07, min_samples_leaf=1,\n min_samples_split=2, min_weight_fract

error msg when converting randomforest to pmml

I've got errors when converting the example randomforest to pmml (the last step). Can you help me out?

my code:

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pandas
import sklearn_pandas
from sklearn.ensemble.forest import RandomForestClassifier
from sklearn2pmml import sklearn2pmml

iris = load_iris()
iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pandas.DataFrame(iris.target, columns = ["Species"])), axis = 1)
iris_mapper = sklearn_pandas.DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], PCA(n_components = 3)),
    ("Species", None)
])

iris = iris_mapper.fit_transform(iris_df)

iris_X = iris[:, 0:3]
iris_y = iris[:, 3]

iris_forest = RandomForestClassifier(min_samples_leaf = 5)
iris_forest.fit(iris_X, iris_y)

sklearn2pmml(iris_forest, iris_mapper, "forest.pmml")

the error msg:

Exception in thread "main" java.lang.StackOverflowError
    at sun.misc.FDBigInteger.leftShift(Unknown Source)
    at sun.misc.FDBigInteger.valueOfPow52(Unknown Source)
    at sun.misc.FloatingDecimal$BinaryToASCIIBuffer.dtoa(Unknown Source)
    at sun.misc.FloatingDecimal$BinaryToASCIIBuffer.access$100(Unknown Source)
    at sun.misc.FloatingDecimal.getBinaryToASCIIConverter(Unknown Source)
    at sun.misc.FloatingDecimal.getBinaryToASCIIConverter(Unknown Source)
    at sun.misc.FloatingDecimal.toJavaFormatString(Unknown Source)
    at java.lang.Double.toString(Unknown Source)
    at org.jpmml.converter.ValueUtil.formatValue(ValueUtil.java:118)
    at sklearn.tree.TreeModelUtil.encodeNode(TreeModelUtil.java:81)
    at sklearn.tree.TreeModelUtil.encodeNode(TreeModelUtil.java:96)
    at sklearn.tree.TreeModelUtil.encodeNode(TreeModelUtil.java:96)
    at sklearn.tree.TreeModelUtil.encodeNode(TreeModelUtil.java:96)
    at sklearn.tree.TreeModelUtil.encodeNode(TreeModelUtil.java:96)

Imputations for Categorical Non-numerical Data

Hello,
I was trying to deal with missing/new values with Imputer(). I was able to handle numerical encodings by imputation using Imputer() and LabelEncoder() chained together in a sklearn2pmml dataframemapper pipeline. But I was not able to deal with input features having non-numerical content. Imputer() fails in that case. So I tried writing my own CustomImputer to deal with non numerical categorical variables. But I am getting an error from Java side while creating a PMML, where as the python pipeline is error free.

INFO: Converting..
at sklearn.Initializer.encodeFeatures(Initializer.java:53)
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:122)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
java.lang.IllegalArgumentException: com.highradius.ai.jpmml.functions.NonNumericalImputer(document_type)
at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:145)
at sklearn.preprocessing.LabelEncoder.encodeFeatures(LabelEncoder.java:97)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
Aug 17, 2017 11:33:08 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: com.highradius.ai.jpmml.functions.NonNumericalImputer(document_type)
at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:145)
at sklearn.preprocessing.LabelEncoder.encodeFeatures(LabelEncoder.java:97)
at sklearn_pandas.DataFrameMapper.initializeFeatures(DataFrameMapper.java:75)
at sklearn.Initializer.encodeFeatures(Initializer.java:53)
at sklearn.pipeline.Pipeline.encodeFeatures(Pipeline.java:93)
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:122)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)

Following is my Data frame mapper:

mapper = DataFrameMapper([(['amount'],[Imputer(strategy='median'),CustomTransformFunctionGenerator(function="Binning",arguments="10,100,1000,10000")]),
                          ('document_type',[CustomTransformFunctionGenerator(function="NonNumericalImputer"),LabelEncoder()]),
                          ])

XGBClassifier error

When I running the following code

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain
import xgboost as xgb

import pandas
import sklearn_pandas

iris = load_iris()

iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pandas.DataFrame(iris.target, columns = ["Species"])), axis = 1)

iris_mapper = sklearn_pandas.DataFrameMapper([
    (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), PCA(n_components = 3),StandardScaler()]),
    ("Species", None)
])

iris = iris_mapper.fit_transform(iris_df)

iris_X = iris[:, 0:3]
iris_y = iris[:, 3]

iris_classifier = xgb.XGBClassifier()
iris_classifier.fit(iris_X, iris_y)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_classifier, iris_mapper, "XGBClassifier.pmml", with_repr = True,debug =True)

I got the following error:

python:  3.5.1
sklearn:  0.17
sklearn.externals.joblib: 0.9.3
sklearn_pandas:  1.1.0
sklearn2pmml:  0.10.0
java -cp /home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/serpent-1.12.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-agent-1.2.17.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.0.6.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/guava-19.0.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pyrolite-4.12.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-schema-1.2.17.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-1.2.17.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.0.0.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-metro-1.2.17.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-converter-1.0.8.jar org.jpmml.sklearn.Main --pkl-estimator-input /tmp/estimator-izxjv8gy.pkl.z --repr-estimator XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1) --pkl-mapper-input /tmp/mapper-wve3iwbh.pkl.z --repr-mapper DataFrameMapper(default=False,
        features=[(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width'], TransformerPipeline(steps=[('continuousdomain', ContinuousDomain(invalid_value_treatment='return_invalid')), ('pca', PCA(copy=True, n_components=3, whiten=False)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))])), ('Species', None)],
        sparse=False) --pmml-output XGBClassifier.pmml
Sep 01, 2016 12:50:53 PM org.jpmml.sklearn.Main run
INFO: Parsing DataFrameMapper PKL..
Sep 01, 2016 12:50:53 PM org.jpmml.sklearn.Main run
INFO: Parsed DataFrameMapper PKL in 84 ms.
Sep 01, 2016 12:50:53 PM org.jpmml.sklearn.Main run
INFO: Converting DataFrameMapper..
Sep 01, 2016 12:50:53 PM org.jpmml.sklearn.Main run
INFO: Converted DataFrameMapper in 56 ms.
Sep 01, 2016 12:50:53 PM org.jpmml.sklearn.Main run
INFO: Parsing Estimator PKL..
Sep 01, 2016 12:50:53 PM org.jpmml.sklearn.Main run
INFO: Parsed Estimator PKL in 14 ms.
Sep 01, 2016 12:50:53 PM org.jpmml.sklearn.Main run
INFO: Converting Estimator..
Sep 01, 2016 12:50:53 PM org.jpmml.sklearn.Main run
SEVERE: Failed to convert Estimator
java.lang.RuntimeException: java.io.IOException
    at xgboost.sklearn.Booster.loadLearner(Booster.java:53)
    at xgboost.sklearn.Booster.getLearner(Booster.java:41)
    at xgboost.sklearn.BoosterUtil.getNumberOfFeatures(BoosterUtil.java:35)
    at xgboost.sklearn.XGBClassifier.getNumberOfFeatures(XGBClassifier.java:38)
    at sklearn.Classifier.createSchema(Classifier.java:59)
    at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
    at org.jpmml.sklearn.Main.run(Main.java:189)
    at org.jpmml.sklearn.Main.main(Main.java:107)
Caused by: java.io.IOException
    at org.jpmml.xgboost.XGBoostDataInput.readReserved(XGBoostDataInput.java:68)
    at org.jpmml.xgboost.GBTree.load(GBTree.java:62)
    at org.jpmml.xgboost.Learner.load(Learner.java:88)
    at org.jpmml.xgboost.XGBoostUtil.loadLearner(XGBoostUtil.java:34)
    at xgboost.sklearn.Booster.loadLearner(Booster.java:51)
    ... 7 more

Exception in thread "main" java.lang.RuntimeException: java.io.IOException
    at xgboost.sklearn.Booster.loadLearner(Booster.java:53)
    at xgboost.sklearn.Booster.getLearner(Booster.java:41)
    at xgboost.sklearn.BoosterUtil.getNumberOfFeatures(BoosterUtil.java:35)
    at xgboost.sklearn.XGBClassifier.getNumberOfFeatures(XGBClassifier.java:38)
    at sklearn.Classifier.createSchema(Classifier.java:59)
    at sklearn.EstimatorUtil.encodePMML(EstimatorUtil.java:47)
    at org.jpmml.sklearn.Main.run(Main.java:189)
    at org.jpmml.sklearn.Main.main(Main.java:107)
Caused by: java.io.IOException
    at org.jpmml.xgboost.XGBoostDataInput.readReserved(XGBoostDataInput.java:68)
    at org.jpmml.xgboost.GBTree.load(GBTree.java:62)
    at org.jpmml.xgboost.Learner.load(Learner.java:88)
    at org.jpmml.xgboost.XGBoostUtil.loadLearner(XGBoostUtil.java:34)
    at xgboost.sklearn.Booster.loadLearner(Booster.java:51)
    ... 7 more
Preserved joblib dump file(s):  /tmp/estimator-izxjv8gy.pkl.z /tmp/mapper-wve3iwbh.pkl.z
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-94-3559b9c97359> in <module>()
----> 1 sklearn2pmml(iris_classifier, iris_mapper, "XGBClassifier.pmml", with_repr = True,debug =True)

/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/__init__.py in sklearn2pmml(estimator, mapper, pmml, with_repr, debug)
     63                 if(debug):
     64                         print(" ".join(cmd))
---> 65                 subprocess.check_call(cmd)
     66         finally:
     67                 if(debug):

/home/shuangyangwang/anaconda3/lib/python3.5/subprocess.py in check_call(*popenargs, **kwargs)
    582         if cmd is None:
    583             cmd = popenargs[0]
--> 584         raise CalledProcessError(retcode, cmd)
    585     return 0
    586 

CalledProcessError: Command '['java', '-cp', '/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/serpent-1.12.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-agent-1.2.17.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.0.6.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/guava-19.0.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pyrolite-4.12.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-schema-1.2.17.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-1.2.17.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.0.0.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-metro-1.2.17.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar:/home/shuangyangwang/anaconda3/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-converter-1.0.8.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/tmp/estimator-izxjv8gy.pkl.z', '--repr-estimator', "XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,\n       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,\n       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,\n       objective='multi:softprob', reg_alpha=0, reg_lambda=1,\n       scale_pos_weight=1, seed=0, silent=True, subsample=1)", '--pkl-mapper-input', '/tmp/mapper-wve3iwbh.pkl.z', '--repr-mapper', "DataFrameMapper(default=False,\n        features=[(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width'], TransformerPipeline(steps=[('continuousdomain', ContinuousDomain(invalid_value_treatment='return_invalid')), ('pca', PCA(copy=True, n_components=3, whiten=False)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))])), ('Species', None)],\n        sparse=False)", '--pmml-output', 'XGBClassifier.pmml']' returned non-zero exit status 1

how to deal with that?

Sklearn2pmml 0.10.0 Throwing Java Exception

I want to create a pmml file version 4.2 or lesser from Python and I could download sklearn2pmml version 0.10.0 on this regard. But it's stopping me with the Exception below when running this over my Windows 7 64 bit machine with Anaconda 2.5 and Python version 2.7.13

Code

from sklearn_pandas import DataFrameMapper
default_mapper = DataFrameMapper([(i, None) for i in x_train.columns + ['Response']])

from sklearn2pmml import sklearn2pmml
sklearn2pmml(estimator=fit_cart, 
             mapper=default_mapper, 
             pmml="./DT.pmml")

Exception

CalledProcessError                        Traceback (most recent call last)
<ipython-input-17-8c8eafc3f31e> in <module>()
      5 sklearn2pmml(estimator=fit_cart, 
      6              mapper=default_mapper,
----> 7              pmml="./DT.pmml")

C:\Anaconda2\lib\site-packages\sklearn2pmml\__init__.pyc in sklearn2pmml(estimator, mapper, pmml, with_repr, debug)
     63                 if(debug):
     64                         print(" ".join(cmd))
---> 65                 subprocess.check_call(cmd)
     66         finally:
     67                 if(debug):

C:\Anaconda2\lib\subprocess.pyc in check_call(*popenargs, **kwargs)
    184         if cmd is None:
    185             cmd = popenargs[0]
--> 186         raise CalledProcessError(retcode, cmd)
    187     return 0
    188 

CalledProcessError: Command '['java', '-cp', 'C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\guava-19.0.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\istack-commons-runtime-2.21.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\jaxb-core-2.2.11.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\jaxb-runtime-2.2.11.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\jcommander-1.48.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\jpmml-converter-1.0.8.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\jpmml-sklearn-1.0.0.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\jpmml-xgboost-1.0.6.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\pmml-agent-1.2.17.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\pmml-model-1.2.17.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\pmml-model-metro-1.2.17.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\pmml-schema-1.2.17.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\pyrolite-4.12.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\serpent-1.12.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\slf4j-api-1.7.21.jar;C:\\Anaconda2\\lib\\site-packages\\sklearn2pmml\\resources\\slf4j-jdk14-1.7.21.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', 'c:\\users\\5239647\\appdata\\local\\temp\\estimator-fwa91n.pkl.z', '--pkl-mapper-input', 'c:\\users\\5239647\\appdata\\local\\temp\\mapper-kmzk1k.pkl.z', '--pmml-output', './DT.pmml']' returned non-zero exit status 1

Help me to resolve this. Thank You.

Why the pmml file is too large?

Here is my code:

X_train_1, y_train_1 = load_svmlight_file('test.txt')
clf = RandomForestClassifier(n_estimators=10, n_jobs=-1, class_weight="balanced")
clf = clf.fit(X_train_1, y_train_1)
from sklearn2pmml import sklearn2pmml
sklearn2pmml(clf, None, "rfmodel_test.pmml", with_repr = False)

I thought the size of output (pmml file) should be irrelevant to the size of my training dataset. ( It should be only related to n_estimators and the number of features )

But in the end I found that the size of pmml file is very large (3.8G which is just slightly smaller than my training dataset) , and if I use small dataset the size of pmml file becomes small.

I am so confused.

Deprecation warning when using sklearn >= 0.18

sklearn.cross_validation got deprecated in favor of sklearn.model_selection

LabelEncoder + Inputer + LabelBinarizer in mapper fails

Hi,

It looks like setting a Label Binarizer after an Imputer in a mapper fails.

mapper = DataFrameMapper([      
           (['column'], [LabelEncoder(), Imputer(), LabelBinarizer()]),                                                                                                                                  
])

-> ValueError: Multioutput target data is not supported with label binarization

I think that the problem comes from the Imputer. Out of it, the data has dimensions of (1, n_samples) instead of (n_samples, 1). The new CategoricalImputer from the sklearn-pandas works but it can't be exported to PMML. Any help to fix this?

Many thanks!!

PMML 4.2 compatible

The following link which had the PMML 4.2 compatible version, provided in the Google groups is not available anymore:
https://github.com/jpmml/sklearn2pmml.git@8304e7466c9138a081aa09ca1a3af5c74c8df150

I tried installing the older version of sklearn2pmml from local directory. I tried versions which were committed before Aug 2016, but installation is not proper. PMMLPipeline is not available.

from sklearn2pmml import P 
pkg_resources platform

The other versions also have the same issue. Kindly, suggest a solution.

Customizing JVM options of the background Java process

As requested in the JPMML mailing list:
https://groups.google.com/forum/#!topic/jpmml/nIpr9gWcAq8

Support for feature selection

Is it possible to directly use a regular sklearn pipeline with sklearn2pmml?

How can I specify the label encoding and split in target / X in for PMML?
How can I perform the selection of continuous / factor fields? would I need to hard code them?

# preprocessing, takes yX as a whole concatenated df -- > as some filters occur which affectBoth
prep_pipe = Pipeline([
    ('clean', Preprocessor()),
    ('enrich', Enricher()),
])

prep_pipe.fit(bigDf)
bigDf = prep_pipe.transform(bigDf)
X, y = transformToXy(bigDf)

CONTINUOUS_FIELDS = X.select_dtypes(include=['number']).columns
FACTOR_FIELDS = X.select_dtypes(include=['category']).columns
X = labelEncodeCategoricalData(X)

prediction_pipe = Pipeline([
    ('features', FeatureUnion([
        ('continuous', Pipeline([
            ('extract', ColumnExtractor(CONTINUOUS_FIELDS)),
        ])),
        ('factors', Pipeline([
            ('extract', ColumnExtractor(FACTOR_FIELDS)),
            ('oneht', OneHotEncoder()),
        ]))
    ], n_jobs=1)),
    ('clf', XGBClassifier())
])

prediction_pipe.fit(X, y)

sklearn2pmml(prediction_pipe, prep_pipe, "xgbPipeline.pmml", with_repr = True)

Random Forest Conversions and Consumption

Hello,

I'm having a few issues in testing a random forest classifier from scklearn2pmml in JPMML. I'm producing a simple PMML file from the code here:

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

import pandas
import sklearn_pandas

iris = load_iris()

iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pandas.DataFrame(iris.target, columns = ["Species"])), axis = 1)

iris_mapper = sklearn_pandas.DataFrameMapper([('Sepal.Length',None),
                                              ('Sepal.Width', None), 
                                              ('Petal.Width', None),
                                              ('Petal.Width', None),
                                              ('Species',None)])

iris = iris_mapper.fit_transform(iris_df)

from sklearn.ensemble import RandomForestClassifier

iris_X = iris[:, 0:4]
iris_y = iris[:, 4]

iris_classifier = RandomForestClassifier(n_estimators=10)
iris_classifier.fit(iris_X, iris_y)

from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_classifier, iris_mapper, "randomforest.pmml")

1. I'd like to do no transformations across my data set. Leaving them all blank transforms the data type from double to float.

            <DerivedField name="x1" optype="continuous" dataType="float">
                <FieldRef field="Sepal.Length"/>

This causes an issue with JPMML throwing the following error:

org.jpmml.evaluator.TypeCheckException: Expected FLOAT, but got DOUBLE (5.3)
    at org.jpmml.evaluator.TypeUtil.toFloat(TypeUtil.java:456)
    at org.jpmml.evaluator.TypeUtil.cast(TypeUtil.java:330)
    at org.jpmml.evaluator.TypeUtil.parseOrCast(TypeUtil.java:61)
    at org.jpmml.evaluator.FieldValueUtil.create(FieldValueUtil.java:92)
    at org.jpmml.evaluator.FieldValueUtil.refine(FieldValueUtil.java:144)
    at org.jpmml.evaluator.FieldValueUtil.refine(FieldValueUtil.java:116)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:82)
    at org.jpmml.evaluator.ExpressionUtil.evaluate(ExpressionUtil.java:63)
    at org.jpmml.evaluator.PredicateUtil.evaluateSimplePredicate(PredicateUtil.java:95)
    at org.jpmml.evaluator.PredicateUtil.evaluate(PredicateUtil.java:54)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateNode(TreeModelEvaluator.java:171)
    at org.jpmml.evaluator.TreeModelEvaluator.handleTrue(TreeModelEvaluator.java:186)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateTree(TreeModelEvaluator.java:139)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluateClassification(TreeModelEvaluator.java:111)
    at org.jpmml.evaluator.TreeModelEvaluator.evaluate(TreeModelEvaluator.java:80)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateSegmentation(MiningModelEvaluator.java:463)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluateClassification(MiningModelEvaluator.java:244)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:133)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:106)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:263)
    at com.ea.eadp.risk.service.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:114)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.evaluate(PMMLLoadTestServiceImpl.java:167)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:99)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:31)
    at com.ea.eadp.test.jpmml.Program.main(Program.java:22)

Is there a way to not use DataFrameMapper or would I have to manually change each of the float types back into double?

2. Changing the above issue, JPMML complains about the the model with the following error and I'm unable to evaluate it. Any ideas on what is the cause of this?

org.jpmml.evaluator.TypeCheckException: Expected org.jpmml.evaluator.HasProbability, but got org.jpmml.evaluator.ClassificationMap (ClassificationMap{type=VOTE, vote_entries=[0=0.0, 1=0.3, 2=0.7]})
    at org.jpmml.evaluator.OutputUtil.asResultFeature(OutputUtil.java:862)
    at org.jpmml.evaluator.OutputUtil.getProbability(OutputUtil.java:489)
    at org.jpmml.evaluator.OutputUtil.evaluate(OutputUtil.java:182)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:143)
    at org.jpmml.evaluator.MiningModelEvaluator.evaluate(MiningModelEvaluator.java:106)
    at org.jpmml.evaluator.ModelEvaluator.evaluate(ModelEvaluator.java:263)
    at com.ea.eadp.risk.service.pmml.impl.PMMLEvaluatorImpl.evaluate(PMMLEvaluatorImpl.java:114)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.evaluate(PMMLLoadTestServiceImpl.java:167)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:99)
    at com.ea.eadp.risk.service.pmml.impl.PMMLLoadTestServiceImpl.runLoadTest(PMMLLoadTestServiceImpl.java:31)
    at com.ea.eadp.test.jpmml.Program.main(Program.java:22)

Thank you!

Put the package on PyPI

This would make it installable via pip and dramatically ease handling the package as dependency.

Packing estimator and mapper together

Is there a way to pack both .pkl files (estimator and mapper) in one file? That would really help me a lot as the current java model infrastructure we use in my company only supports a single file per model.
I could pack them both in a .zip file but that would be very gruesome.
I could also just provide the jpmml file instead of 2 .pkl files, but that would inflate the model file sizes and require a lot of extra bandwith.

10x

Save feature statistics in python and retrieve them in jpmml-evaluator

Hello,

I am hoping to store feature statistics (e.g., the mean and std of each numerical feature) in the PMML file. Then when the model is evaluated on the java side, I can use the statistics to better interpret the prediction results.

May I know if there is an easy way for doing this? It seems related to the following component. However, I do not know how to add the modelStats using sklearn2pmml...
http://dmg.org/pmml/v4-3/Statistics.html

Thank you so much!

movingname

Multiple Models

Hi,
Is there any way to create and export multiple models to PMML format as described in: PMML 4.3 - Multiple Models?

Thanks!

Generate pmml file error when using adaboost

Hi
When I use sklearn2pmml for AdaboostClassifier, it generate some errors.
My script is just same like the example except the classifier. I change the classifier to randomforest or other models, it works well. Could you kindly help me to find out what's the problem? thank you
My version is:

>>> import sklearn2pmml
>>> sklearn2pmml.__version__
'0.20.3'

python:

import pandas

iris_df = pandas.read_csv("iris.csv")

from sklearn2pmml import PMMLPipeline
from sklearn2pmml.decoration import ContinuousDomain
from sklearn_pandas import DataFrameMapper
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import Imputer
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,AdaBoostRegressor

iris_pipeline = PMMLPipeline([
        ("mapper", DataFrameMapper([
                (["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), Imputer()])
        ])),
        ("pca", PCA(n_components = 3)),
        ("selector", SelectKBest(k = 2)),
        ("classifier", AdaBoostClassifier())
])
iris_pipeline.fit(iris_df, iris_df["Species"])

from sklearn2pmml import sklearn2pmml

sklearn2pmml(iris_pipeline, "model.pmml", with_repr = True)

and the error is:

java -cp /home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.3.3.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pyrolite-4.19.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-agent-1.3.6.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.6.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/guava-20.0.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/serpent-1.18.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-schema-1.3.6.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-1.3.6.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-converter-1.2.3.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.7.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.7.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-gtq64tqp.pkl.z --pmml-output model.pmml
Jun 19, 2017 11:50:19 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Jun 19, 2017 11:50:19 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 119 ms.
Jun 19, 2017 11:50:19 AM org.jpmml.sklearn.Main run
INFO: Converting..
Jun 19, 2017 11:50:19 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:74)
	at org.jpmml.sklearn.Main.run(Main.java:144)
	at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.IllegalArgumentException
	at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:74)
	at org.jpmml.sklearn.Main.run(Main.java:144)
	at org.jpmml.sklearn.Main.main(Main.java:93)
Traceback (most recent call last):
  File "/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/__init__.py", line 142, in sklearn2pmml
    subprocess.check_call(cmd)
  File "/home/ly/anaconda3/lib/python3.5/subprocess.py", line 581, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['java', '-cp', '/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.3.3.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pyrolite-4.19.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-agent-1.3.6.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-api-1.7.25.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.6.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/guava-20.0.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/serpent-1.18.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-schema-1.3.6.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.25.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-1.3.6.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-converter-1.2.3.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.7.jar:/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.7.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', '/tmp/pipeline-gtq64tqp.pkl.z', '--pmml-output', 'model.pmml']' returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test6.py", line 27, in <module>
    sklearn2pmml(iris_pipeline, "model.pmml", with_repr = True)
  File "/home/ly/.local/lib/python3.5/site-packages/sklearn2pmml/__init__.py", line 144, in sklearn2pmml
    raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

add a lookup table into the PMML

Hi,

I was wondering if there is any way i can embed a lookup table into the PMML file using sklearn2pmml. For example, My model is having 40 features.
Out of 40, 10 features i am filling using a CSV file which is of shape 100000 X 10 .
I would like to attach this info to the PMML file sothat when the model is running, it will take the value from there.

Thanks,
Jiby

Failing to map a categorical column

mapper = DataFrameMapper(
        [([column], [ContinuousDomain(), Imputer(strategy='mean')]) for column in ['Age_in_yrs','NET_INC_AMT','NET_BENEFIT_AMT','HOUSING_EXP_AMT','NET_UNERN_INC_AMT','Family_size','Four_to_eight_Changes_EDBC']] +
        [([column], [CategoricalDomain(),LabelEncoder(),LabelBinarizer()]) for column in ['MARITAL_STATUS_CD_MAL']]+
        [([column], [PolynomialFeatures(degree = 2)]) for column in ['Age_in_yrs','NET_INC_AMT','NET_BENEFIT_AMT','HOUSING_EXP_AMT','NET_UNERN_INC_AMT']]
    )

The above is my mapper function I am getting error due to the Categorical Domain variables ie the [([column], [CategoricalDomain(),LabelEncoder(),LabelBinarizer()]) for column in ['MARITAL_STATUS_CD_MAL']] line as soon as I take it out the sklearn2pmml function starts working otherwise it throws the above error.

It would be really helpful for me if you could help me understand why this is happening !

Preserving feature names

Is it possible to preserve feature names for the estimators?

For example if I fit a RandomForest classifier with a feature named 'feature1', the PMML classifier will currently have a feature named 'x1' instead.

I know that using a Mapper I could preserve the feature names but for my current needs it is an overkill.

Can not be installed with a pip requirement file

Hello,
I am having issues installing sklearn2pmml with a requirement file for pip. I am using -e git+https://github.com/jpmml/sklearn2pmml.git@70a14db90833ae64fef84cf4ad70e9e6cd227fe3#egg=sklearn2pmml to install. However the problem comes during the install phase because it look like sklearn2pmml requires imports sklearn before it is installed by pip which causes the install to fail.

The temporary solution has been to create a second requirement.txt file and then do a second round of installs, however multiple packages then need to have their version specified again to prevent unexpected upgrades to sklearn, and pandas.

Is it possible to make the installer compatible with the normal pip install process?

net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 120

I'm receiving following error after installing everything properly and running given in README.md example.

Jul 20, 2016 4:46:11 PM org.jpmml.sklearn.Main run
INFO: Parsing DataFrameMapper PKL..
Jul 20, 2016 4:46:11 PM org.jpmml.sklearn.Main run
SEVERE: Failed to parse DataFrameMapper PKL
net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 120
        at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:330)
        at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
        at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:230)
        at org.jpmml.sklearn.Main.run(Main.java:126)
        at org.jpmml.sklearn.Main.main(Main.java:107)

Exception in thread "main" net.razorvine.pickle.InvalidOpcodeException: invalid pickle opcode: 120
        at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:330)
        at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
        at org.jpmml.sklearn.PickleUtil.unpickle(PickleUtil.java:230)
        at org.jpmml.sklearn.Main.run(Main.java:126)
        at org.jpmml.sklearn.Main.main(Main.java:107)
Traceback (most recent call last):

Thank you for your effort.

Dealing with categorical variables to predict new values in the input feature

In sklearn pipeline we are able to create transformations for categorical variables via cat codes. Cat codes transformation is able to deal with new values while predicting, where as Label Encoder in sklearn2pmml isnt able to predict the new values, which were not available during training. My model can have new values for a column as an input feature while testing which would be a problem if we are using Label Encoder. Can you suggest a way how to deal with the PMML with the encoding, either in python or in JPMML side.

Generating TransformationDictonary

Hi,
Can you please explain generating a TransformationDictonary in pmml for binning. how do i write java code to generate transformation dictonary like this

<!-- declaration of formal parameters -->
<ParameterField name="TimeVal" optype="continuous" dataType="integer"/>
<!-- there can be more than one parameter field -->

<!-- The function body can be any expression-->
<!-- Parameter names are used like field names in the expression -->

<Discretize field="TimeVal">  <!-- uses name of parameter field -->
  <DiscretizeBin binValue="AM">
    <Interval closure="closedClosed" leftMargin="0" rightMargin="43199"/>
  </DiscretizeBin>
  <DiscretizeBin binValue="PM">
    <Interval closure="closedOpen" leftMargin="43200" rightMargin="86400"/>
  </DiscretizeBin>
</Discretize>

sklearn2pmml 0.18 pipeline with SMOTE oversampling

I have recently been working with the new sklearn2pmml version requiring pipelines, and have stumbled upon an issue when trying to convert to PMML including a label encoding mapper along with SMOTE oversampling of input classes. I include all versions of my packages at the bottom for your reference.

The issue is that it appears as though sklearn2pmml requires all steps in a pipeline to be fit at the same time. This causes an issue for me, since in my pipeline I have a mapper (including a labelencoder) and a random forest classifier. However, the system that eventually will interface with my output PMML will provide raw features (categorical) to the PMML. So, I believe I am required to input raw categorical data to the pipeline.fit() method in order for the PMML to reflect the labelencoding in its data mapping. However, I also want to oversample the data for the classifier training included in the pipeline. This oversampling (using SMOTE) results in non-categorical nd-array data (floats). While these floats allow my classifier to be fit, the mapper doesn't transform these features, since they are now floats and not categorical features. Is there any way to preserve the dataframe mapper (categorical-->labelencoding in this case), while also providing oversampled data (nd-array of floats) to the classifier in the pipeline?

Thanks

Versions
('python: ', '2.7.12')
('sklearn: ', '0.18')
('sklearn.externals.joblib:', '0.10.2')
('pandas: ', u'0.19.0')
('sklearn_pandas: ', '1.2.0')
('sklearn2pmml: ', '0.15.0')

non-zero exit status 1

I am currently attempting to transfer my sklearn model into a pmml file which will be read in a java program later on. My datasets have already gone through preprocessing so further transformations will not be very helpful. However, I keep receiving the error below even when I try different transformations.

The model seems to have been generated successfully and the output from the DataFrameMapper also seems correct. However, the error pops up when attempting to create and export the pmml file.

Code:

import pandas
import sklearn_pandas

import numpy as np
import pandas as pd

import sklearn
from sklearn import cross_validation
from sklearn.neighbors import KNeighborsRegressor

df = pd.read_csv('features.csv')
df = df.rename(columns={'Unnamed: 0': 'rowNumber'})
df = df.drop('rowNumber', 1)
print(len(df))
df = df.drop_duplicates()
print(len(df))
df = df[df.pitch <= 100][df.pitch >= 0]
print(len(df))
df = df.drop('matrix', 1)

df60 = df[df.participant < 22]
dfTrainInput = df60.drop('pitch', 1)

listofColumns = list(dfTrainInput.columns.values)

#DataFrameMapper step
df_mapper = sklearn_pandas.DataFrameMapper([
    (listofColumns, None),
    ("pitch", None)
])

data = df_mapper.fit_transform(df)

data_Input = data[:, 0:len(data[0]) - 1]
data_Target = data[:, len(data[0]) - 1]

#Training Step
neigh = KNeighborsRegressor(n_neighbors=400, algorithm='kd_tree', leaf_size=30, n_jobs = -1)
neigh.fit(data_Input, data_Target)

#PMML conversion and export step
from sklearn2pmml import sklearn2pmml
sklearn2pmml(neigh, df_mapper, "FeaturesAsFeaturesKNN.pmml", with_repr = True)

Error:

Traceback (most recent call last):
  File "/Users/leslie/Downloads/test_sklearn2pmml.py", line 73, in <module>
    sklearn2pmml(neigh, df_mapper, "FeaturesAsFeaturesKNN.pmml", with_repr = True)
  File "/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/__init__.py", line 65, in sklearn2pmml
    subprocess.check_call(cmd)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/subprocess.py", line 541, in check_call
    raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['java', '-cp', '/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/guava-19.0.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-converter-1.1.1.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.1.2.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.1.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-agent-1.3.3.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-1.3.3.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.3.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pmml-schema-1.3.3.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/pyrolite-4.13.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/serpent-1.12.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-api-1.7.21.jar:/Users/leslie/Library/Python/2.7/lib/python/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.21.jar', 'org.jpmml.sklearn.Main', '--pkl-estimator-input', '/var/folders/ph/8rj4z16x1w3cy86hj3qh7nxh0000gn/T/estimator-8TGJgQ.pkl.z', '--repr-estimator', "KNeighborsRegressor(algorithm='kd_tree', leaf_size=30, metric='minkowski',\n          metric_params=None, n_jobs=-1, n_neighbors=400, p=2,\n          weights='uniform')", '--pkl-mapper-input', '/var/folders/ph/8rj4z16x1w3cy86hj3qh7nxh0000gn/T/mapper-1hiTl4.pkl.z', '--repr-mapper', "DataFrameMapper(features=[(['participant', 'condition', 'yaw', 'touchX', 'touchY', 'touchW', 'touchH', 'S0cx', 'S0cy', 'S0eo', 'S0evp', 'S0evm', 'S0ee', 'S1cx', 'S1cy', 'S1eo', 'S1evp', 'S1evm', 'S1ee', 'S2cx', 'S2cy', 'S2eo', 'S2evp', 'S2evm', 'S2ee', 'T0cx', 'T0cy', 'T0eo', 'T0evp', 'T0evm', 'T0ee', 'T1cx', 'T1cy', 'T1eo', 'T1evp', 'T1evm', 'T1ee', 'T2cx', 'T2cy', 'T2eo', 'T2evp', 'T2evm', 'T2ee', 'Ucx', 'Ucy', 'Ueo', 'Uevp', 'Uevm', 'Uee'], None), ('pitch', None)],\n        sparse=False)", '--pmml-output', 'FeaturesAsFeaturesKNN.pmml']' returned non-zero exit status 1

Weirdly enough, attempting to run the same code with Jupyter returns "OSError: [Errno 2] No such file or directory".

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-39-3b3cc81c782e> in <module>()
      1 from sklearn2pmml import sklearn2pmml
      2 
----> 3 sklearn2pmml(neigh, df_mapper, 'FeaturesAsFeaturesKNN.pmml', with_repr = True)

/home/mayersn/.local/lib/python2.7/site-packages/sklearn2pmml/__init__.pyc in sklearn2pmml(estimator, mapper, pmml, with_repr, debug)
     63                 if(debug):
     64                         print(" ".join(cmd))
---> 65                 subprocess.check_call(cmd)
     66         finally:
     67                 if(debug):

/usr/lib/python2.7/subprocess.pyc in check_call(*popenargs, **kwargs)
    533     check_call(["ls", "-l"])
    534     """
--> 535     retcode = call(*popenargs, **kwargs)
    536     if retcode:
    537         cmd = kwargs.get("args")

/usr/lib/python2.7/subprocess.pyc in call(*popenargs, **kwargs)
    520     retcode = call(["ls", "-l"])
    521     """
--> 522     return Popen(*popenargs, **kwargs).wait()
    523 
    524 

/usr/lib/python2.7/subprocess.pyc in __init__(self, args, bufsize, executable, stdin, stdout, stderr, preexec_fn, close_fds, shell, cwd, env, universal_newlines, startupinfo, creationflags)
    708                                 p2cread, p2cwrite,
    709                                 c2pread, c2pwrite,
--> 710                                 errread, errwrite)
    711         except Exception:
    712             # Preserve original exception in case os.close raises.

/usr/lib/python2.7/subprocess.pyc in _execute_child(self, args, executable, preexec_fn, close_fds, cwd, env, universal_newlines, startupinfo, creationflags, shell, to_close, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite)
   1325                         raise
   1326                 child_exception = pickle.loads(data)
-> 1327                 raise child_exception
   1328 
   1329 

OSError: [Errno 2] No such file or directory

Any help or insight into the problem would be tremendously helpful!

imbalanced-learn package support

I am using imbalanced-learn package (part of scikit-learn-contrib projects) in order to balance the two classes by using SMOTE. I wanted to import my already fitted pipeline ("SMOTE_clf") into PMML to use it in Spark by simply writing:

my_pipeline = PMMLPipeline([
  ("estimator", SMOTE_clf)
])
sklearn2pmml(my_pipeline, "my_pipeline.pmml", with_repr = True)

but it gives me this error "The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams"

is it because the pmml does not support the imbalanced-learn package?

returned non-zero exit status 1 when using GridSearchCV

I am getting the the "returned non-zero exit status 1" error with the new version 0.17 sklearn2pmml, when using it with GridSearchCV.

Version info

('python: ', '2.7.6')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('pandas: ', u'0.19.2')
('sklearn_pandas: ', '1.3.0')
('sklearn2pmml: ', '0.17.0')

Code to reproduce

Working correctly:

from sklearn.datasets import load_boston
boston_data = load_boston()
X = boston_data.data
y = boston_data.target

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml

knn_pipe = PMMLPipeline([
("regressor", KNeighborsRegressor())
])

knn_pipe.fit(X,y)
sklearn2pmml(knn_pipe, ".../SimpleFit.pmml", with_repr = True, debug = True)

Throwing error:

from sklearn.datasets import load_boston
boston_data = load_boston()
X = boston_data.data
y = boston_data.target

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn2pmml import PMMLPipeline
from sklearn2pmml import sklearn2pmml

knn_pipe = PMMLPipeline([
("regressor", KNeighborsRegressor())
])

param_grid = {"regressor__n_neighbors": [3, 2,10],
          "regressor__weights": ["uniform","distance"],
          "regressor__algorithm": ["auto", "ball_tree", "kd_tree"]}
cv = GridSearchCV(knn_pipe, param_grid=param_grid)
cv.fit(X,y)

Using the following line gives "TypeError: The pipeline object is not an instance of PMMLPipeline" which is understandable.

sklearn2pmml(cv, ".../GridSearchFit.pmml", with_repr = True, debug = True)

So I tried using cv.best_estimator_ in it, but it throws the "returned non-zero exit status 1" error.

sklearn2pmml(cv.best_estimator_, ".../GridSearchFit.pmml", with_repr = True, debug = True)

Stack trace of error:

('python: ', '2.7.6')
('sklearn: ', '0.18.1')
('sklearn.externals.joblib:', '0.10.3')
('pandas: ', u'0.19.2')
('sklearn_pandas: ', '1.3.0')
('sklearn2pmml: ', '0.17.0')
java -cp /usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-api-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-schema-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-metro-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pyrolite-4.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-agent-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jcommander-1.48.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-sklearn-1.2.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/guava-19.0.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-converter-1.2.1.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/serpent-1.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.2.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.5.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-1.3.4.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-yd1bTD.pkl.z --repr-pipeline PMMLPipeline(steps=[('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=1, n_neighbors=10, p=2,
          weights='distance'))]) --pmml-output /home/.../GridSearchFit.pmml
('Preserved joblib dump file(s): ', '/tmp/pipeline-yd1bTD.pkl.z')
Traceback (most recent call last):

  File "<ipython-input-12-b7a0923021e7>", line 1, in <module>
    sklearn2pmml(cv.best_estimator_, "/home/.../GridSearchFit.pmml", with_repr = True, debug = True)

  File "/usr/local/lib/python2.7/dist-packages/sklearn2pmml/__init__.py", line 132, in sklearn2pmml
    subprocess.check_call(cmd)

  File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
    raise CalledProcessError(retcode, cmd)

CalledProcessError: Command '['java', '-cp', '/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-api-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-schema-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-metro-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pyrolite-4.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-agent-1.3.4.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jcommander-1.48.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-sklearn-1.2.6.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/guava-19.0.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.22.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-converter-1.2.1.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/serpent-1.16.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.2.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.5.jar:/usr/local/lib/python2.7/dist-packages/sklearn2pmml/resources/pmml-model-1.3.4.jar', 'org.jpmml.sklearn.Main', '--pkl-pipeline-input', '/tmp/pipeline-yd1bTD.pkl.z', '--repr-pipeline', "PMMLPipeline(steps=[('regressor', KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',\n          metric_params=None, n_jobs=1, n_neighbors=10, p=2,\n          weights='distance'))])", '--pmml-output', '/home/.../GridSearchFit.pmml']' returned non-zero exit status 1

Here is the pickle saved file for this error. I have renamed it from Grid_pipeline-yd1bTD.pkl.z to Grid_pipeline-yd1bTD.pkl.zip to be able to upload here.
Grid_pipeline-yd1bTD.pkl.zip

Convenience method for transforming existing estimator objects to `PMMLPipeline` objects

I have about a hundred sklearn classifiers (each stored pickled, if that matters) that I'd like to export to PMML. How can I apply this tool to an existing classifier, versus wrapping a new training process?

Pmml Binning Issue

Hi,

Iam trying to write a transformation for binning of the invoice amount in the pmml. But dont know how to write it, can you please guide me through a resource or can you please provide me a sample code to help me through it. I am a beginner in integrating python models in pmml. so facing difficult time in doing can you please help.

I looked in to mapvalues documentation and ruleset documentation, but it did not help or i couldn't understand the flow

Support for `MultiOutputRegressor` estimator type

Hi, I wanna to generate pmml which includes multiple targest. For RandomForestRegressor, it seems that generated pmml is single target. For MultiOutputRegressor, error "Failed to convert java.lang.IllegalArgumentException" pop up.

RandomForestRegressor example:
Here is code:

from sklearn import datasets
from sklearn.datasets.base import Bunch
import csv
import numpy as np
from time import time
import pandas as pd
import scipy

caseName = "6MultipleOutputRandomForestRegressor_conti"
df = pd.read_csv("/D/AC/5.0/ScoringWithScikitLearn/Tests/data/employ_salary.csv",sep=",")
test_X = df.iloc[:,4:7]
test_y = df[['average_montly_hours','satisfaction_level']]

from sklearn2pmml import sklearn2pmml
from sklearn2pmml import PMMLPipeline
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import ContinuousDomain
from sklearn.preprocessing import Imputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor

max_depth = 30
pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(list(test_X.columns.values), [ContinuousDomain(), Imputer()])
])),
("regression", RandomForestRegressor(max_depth=max_depth,random_state=0)),
])

pipeline.fit(test_X,test_y)
sklearn2pmml(pipeline, "/D/AC/5.0/ScoringWithScikitLearn/Tests/out/"+caseName+"_PyPMML.xml", with_repr = True)

Expected behavior: multiple target fields in MiningSchema in pmml
Current behavior: only one target field in MiningSchema in pmml, like this,

<MiningSchema>
  <MiningField name="y" usageType="target"/>
  <MiningField name="Work_accident" missingValueReplacement="0.1446096406427095" missingValueTreatment="asMean"/>
  <MiningField name="time_spend_company" missingValueReplacement="3.498233215547703" missingValueTreatment="asMean"/>
  <MiningField name="left" missingValueReplacement="0.2380825388359224" missingValueTreatment="asMean"/>
</MiningSchema>

MultiOutputRegressor example:
Here is code (similar with RandomForest, just different model):

pipeline = PMMLPipeline([
("mapper", DataFrameMapper([
(list(test_X.columns.values), [ContinuousDomain(), Imputer()])
])),
("regression", MultiOutputRegressor(RandomForestRegressor(max_depth=max_depth,random_state=0))),
])

Here is the error:

Aug 24, 2017 10:40:18 AM org.jpmml.sklearn.Main run
INFO: Parsing PKL..
Aug 24, 2017 10:40:18 AM org.jpmml.sklearn.Main run
INFO: Parsed PKL in 56 ms.
Aug 24, 2017 10:40:18 AM org.jpmml.sklearn.Main run
INFO: Converting..
Aug 24, 2017 10:40:18 AM org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:74)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.IllegalArgumentException
at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:74)
at org.jpmml.sklearn.Main.run(Main.java:144)
at org.jpmml.sklearn.Main.main(Main.java:93)
Traceback (most recent call last):
File "6MultipleOutputReg_conti.py", line 40, in 
sklearn2pmml(pipeline, "/D/AC/5.0/ScoringWithScikitLearn/Tests/out/"+caseName+"_PyPMML.xml", with_repr = True)
File "/Users/lihuaw/.local/lib/python2.7/site-packages/sklearn2pmml/init.py", line 142, in sklearn2pmml
raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")
RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

Note: I also tried KNeighborsClassifier/KNeighborsRegressor/MLPRegressor, all of them meet the same error with MultiOutputRegressor. I guess multiple target pmml are not supported now by sklearn2pmml, Am I right?
Is there any plan to support this function?
Please correct me if I'm wrong.
Thanks a lot,

pip install fails on a clean system

$ mkvirtualenv sklearn2pmml
(sklearn2pmml) $ pip install git+https://github.com/jpmml/sklearn2pmml.git
Collecting git+https://github.com/jpmml/sklearn2pmml.git
  Cloning https://github.com/jpmml/sklearn2pmml.git to /tmp/pip-vk80_S-build
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-vk80_S-build/setup.py", line 3, in <module>
        from sklearn2pmml import __license__, __version__
      File "sklearn2pmml/__init__.py", line 3, in <module>
        from sklearn.base import BaseEstimator
    ImportError: No module named sklearn.base
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-vk80_S-build/

The reason is that pip needs to execute setup.py to be get the needed install dependencies. Before that has been executed you can't reference them. Just importing version and licensor from the init.py will in turn import the dependencies.

This is quite annoying and can be fixed in a few different ways. The simplest one is to just write the strings. in the setup.py as well. Not a problem for license which will rarely change but maybe annoying for version which you bump every now and then.

I suggest copying the string for licence and moving version to version.py and just import it in sklearn2pmml/init.py and setup.py, but modify the sys.path in setup.py to be able to treat version.py as its own package. I can send a pull-request.

visibility into java errors

Is there a way to get more information than when issuing this command?
sklearn2pmml(lm, df_mapper, "my_model.pmml", with_repr = True)
CalledProcessError:
returned non-zero exit status 1

pkl file not found error

sklearn2pmml(bnb_pipeline, "menu_description_ml.pmml", with_repr = True, debug = True)

When I run the above command inside ipython, I see following error -

python:  3.5.2
sklearn:  0.18.1
sklearn.externals.joblib: 0.10.3
pandas:  0.18.1
sklearn_pandas:  1.3.0
sklearn2pmml:  0.17.4
java -cp /root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-agent-1.3.5.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-core-2.2.11.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-converter-1.2.2.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-api-1.7.24.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-sklearn-1.2.10.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/serpent-1.17.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pyrolite-4.18.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jcommander-1.48.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-lightgbm-1.0.4.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-metro-1.3.5.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/istack-commons-runtime-2.21.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-schema-1.3.5.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jaxb-runtime-2.2.11.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/guava-20.0.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/jpmml-xgboost-1.1.6.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/pmml-model-1.3.5.jar:/root/.local/lib/python3.5/site-packages/sklearn2pmml/resources/slf4j-jdk14-1.7.24.jar org.jpmml.sklearn.Main --pkl-pipeline-input /tmp/pipeline-fyqa1gdv.pkl.z --pmml-output menu_description_ml.pmml
Preserved joblib dump file(s):  /tmp/pipeline-fyqa1gdv.pkl.z

When I manually running the command, java is not able to run find /tmp/pipiline-XXXXXXX.pkl.z

Questions: When does the pipiline-XXXXXXX.pkl.z. Does it failing because of permission issues (I am running inside docker container)

jpmml / sklearn2pmml Goto Github PK

sklearn2pmml's Introduction

SkLearn2PMML

Features

News and Updates

Prerequisites

Installation

Usage

Documentation

De-installation

License

Additional information

sklearn2pmml's People

Contributors

Stargazers

Watchers

Forkers

sklearn2pmml's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs