<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Failing to map a categorical column about sklearn2pmml HOT 7 CLOSED

jpmml commented on August 20, 2024

Failing to map a categorical column

from sklearn2pmml.

Comments (7)

vruusmann commented on August 20, 2024

Getting the error: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams' while trying to convert a regression pipeline into a PMML format.

The above is a "wrapper" Python exception. What was the original Java exception that was thrown?

[([column], [CategoricalDomain(),LabelEncoder(),LabelBinarizer()]) for column in ['MARITAL_STATUS_CD_MAL']]

The LabelEncoder transformation is not necessary here. Simply do [CategoricalColumn(), LabelBinarizer()].

from sklearn2pmml.

gdhruv80 commented on August 20, 2024

Hi thanks for your help ! So I am running the LabelEncoder() first because the categorical inputs are non numeric so to convert them to numeric I am running the LableEncoder first as the Label Binarizer does not take non numeric categorical inputs.

Also below is the complete error that I am getting:

RuntimeError                              Traceback (most recent call last)
<ipython-input-5-745dd10d7243> in <module>()
    125 
    126 from sklearn2pmml import sklearn2pmml
--> 127 sklearn2pmml(Elastic_net_1, out_PMML + 'Elastic_Net_1_xx.pmml', with_repr = True)

/opt/wakari/anaconda/lib/python2.7/site-packages/sklearn2pmml/__init__.pyc in sklearn2pmml(pipeline, pmml, user_classpath, with_repr, debug)
    140                         subprocess.check_call(cmd)
    141                 except CalledProcessError:
--> 142                         raise RuntimeError("The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams")
    143         finally:
    144                 if(debug):

RuntimeError: The JPMML-SkLearn conversion application has failed. The Java process should have printed more information about the failure into its standard output and/or error streams

Any help/suggestions on this would be greatly appreciated. Also the error goes away if I just run LabelBinarizer() ( after transforming the text categorical variables as numbers using pandas)

from sklearn2pmml.

vruusmann commented on August 20, 2024

I am running the LableEncoder first as the LabelBinarizer does not take non numeric categorical inputs.

The LabelBinarizer transformation accepts non-numeric inputs. For example, it can be directly applied to string columns.

Also below is the complete error that I am getting:

This is a front-end Python exception. I need to see the back-end Java exception.

You appear to be working in the IPython/Jupyter Notebook environment. Please check its log file for this Java exception.

Alternatively, please provide a reproducible example using some publicly available dataset. For example, here's are my integration test cases (based on the Audit dataset), where the LabelBinarizer transformation is directly applied to string columns: https://github.com/jpmml/jpmml-sklearn/blob/master/src/test/resources/main.py#L121-L152

from sklearn2pmml.

gdhruv80 commented on August 20, 2024

Hi !

So the reason for running the Label Encoder and Binarizer both is that I am using a elastic net regression model which needs numeric inputs to run the regression so just a Binarizer wont do although you are absolutely right that a binarizer can take non numeric inputs.

Further below is the complete error that I am getting in Java. Also its interesting to see that I am getting this error in the sklearn2pmml(.....) step and not in the Elastic_net_1.fit() step ie the mapper function that i created is compatible with the sklearn library functions but not working correctly in the sklearn2pmml library.

Also I checked the output after running label encoder and label binarizer on the categorical variable that I am using and I am getting an array (with no columns = no unique values in the variable)

array([[0, 1, 0, 0, 0, 0],
       ..., 
       [0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0]])

which I suppose means that both label binarizer and label encoder functions are running just fine.

Error -

org.jpmml.sklearn.Main run
SEVERE: Failed to convert
java.lang.IllegalArgumentException: expected one element but was: <class java.lang.Double, class java.lang.String>
        at com.google.common.collect.Iterators.getOnlyElement(Iterators.java:322)
        at com.google.common.collect.Iterables.getOnlyElement(Iterables.java:294)
        at sklearn.TypeUtil.getDataType(TypeUtil.java:49)
        at sklearn.preprocessing.LabelEncoder.getDataType(LabelEncoder.java:59)
        at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:76)
        at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:110)
        at org.jpmml.sklearn.Main.run(Main.java:146)
        at org.jpmml.sklearn.Main.main(Main.java:93)

Exception in thread "main" java.lang.IllegalArgumentException: expected one element but was: <class java.lang.Double, class java.lang.String>
        at com.google.common.collect.Iterators.getOnlyElement(Iterators.java:322)
        at com.google.common.collect.Iterables.getOnlyElement(Iterables.java:294)
        at sklearn.TypeUtil.getDataType(TypeUtil.java:49)
        at sklearn.preprocessing.LabelEncoder.getDataType(LabelEncoder.java:59)
        at sklearn_pandas.DataFrameMapper.encodeFeatures(DataFrameMapper.java:76)
        at sklearn2pmml.PMMLPipeline.encodePMML(PMMLPipeline.java:110)
        at org.jpmml.sklearn.Main.run(Main.java:146)
        at org.jpmml.sklearn.Main.main(Main.java:93)

Let me know your thoughts and thanks again for all your help :)

from sklearn2pmml.

vruusmann commented on August 20, 2024

java.lang.IllegalArgumentException: expected one element but was: <class java.lang.Double, class java.lang.String>

That's the Java exception that I was waiting for!

The converter assumes that all the values in a column are of the same data type. However, there is a column in your dataset that contains strings mixed with float64s (aka doubles).

You can solve this issue by explicitly casting all values to the same data type. For example:

df["x1"] = df["x1"].astype(float64)

Sure, the converter could be a little bit smarter, and assume that if the column contains mixed data type values, one of which is string, then the other data type should "win".

Reopening this issue to implement this idea.

from sklearn2pmml.

gdhruv80 commented on August 20, 2024

Hey !

Firstly thanks for looking into this and sorry for the slow response was tied up somewhere else. So I looked into what you are saying and went through the various levels of the variable that was giving the error to see if there were any multiple data types in it. I found the following types

array(['MA', 'DI', 'SE', 'NM', 'WI', nan], dtype=object)

I suppose its not treating the nan as string while its treating the others as string. So as you suggested I forced converted the variable to string by using:

df["x1"] = df["x1"].astype(string)

After this I got the following distinct levels in the variable (The 'nan' got converted to string)

array(['MA', 'DI', 'SE', 'NM', 'WI', 'nan'], dtype=object)

After this I tried running the sklearn2pmml function but I am still getting the same error. During this run there are 7 variables/columns in total out of which 6 are float 64 (which are not giving a error) and 1 categorical (the string variable ) variable I described above which is still giving the same error when i run LabelEncoder().

Would appreciate your response !

Thanks
Dhruv

from sklearn2pmml.

vruusmann commented on August 20, 2024

After this I tried running the sklearn2pmml function but I am still getting the same error.

Please double-check, there must be some other mixed-data type column in your dataset.

Also, are you using the latest sklearn2pmml package version? This should print 0.20.3:

import sklearn2pmml

print(sklearn2pmml.__version__)

Regarding "the same error" - are you still using [LabelEncoder(), LabelBinarizer()], or did you switch to [LabelBinarizer()]?

As a last resort, you could save your problematic pipeline/model object in Pickle data format, and attach it here/send to my e-mail for deeper analysis:

from sklearn.externals import joblib

joblib.dump(model, "model.pkl.z", compress = 9)

from sklearn2pmml.

Failing to map a categorical column about sklearn2pmml HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs