GithubHelp home page GithubHelp logo

Comments (23)

vruusmann avatar vruusmann commented on July 20, 2024 1

sklearn2pmml(pmml_pipe, "irisClf.pmml", with_repr = True)

The updated pmml_pipe object now contains some unsupported Pandas' data type objects, because the sklearn2pmml utility function fails with the following error:

Caused by: net.razorvine.pickle.PickleException: Expected 8 attribute(s), got 9 attribute(s)
        at org.jpmml.python.CustomPythonObject.createAttributeMap(CustomPythonObject.java:81)
        at numpy.DType.__setstate__(DType.java:49)

Exactly what I wanted to have! Can work on it locally now.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024 1

ExpressionTransformer("math.sin(2 * math.pi * (X[0] + 1) / 86400)")

Well, the math.sin function is not recognized by the JPMML-Python library yet, so replaced it with the numpy.sin universal function.

After that, the pmml_pipe object converts nicely!

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024 1

How can I simply add up two features to create a new one? The following doesn't work:

Your code has two issues with it:

  1. The {'alias': 'sepal_sum'} construct is misplaced. It should be the third tuple element. You have placed it after a two-element tuple.
  2. Remove the DiscreteDomain decorator. Both input variables are continuous, not discrete. If you want to place a domain decorator there, make it ContinuousDomain instead.

Corrected code:

pretransformer = DataFrameMapper(
    [
        (
            ["sepal length (cm)", "sepal width (cm)"],
            [
                #DiscreteDomain(),
                ExpressionTransformer("X[0] + X[1]"),
            ],
            {'alias': 'sepal_sum'}
        )
    ],
    df_out=True,
    default=None,   # pass through 
)

print(pretransformer.fit_transform(iris_X))

Edit: both issues are totally not related to (SkLearn2)PMML.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024 1

I there a way to include some basic string operations?

You can convert a string value to lowercase using ExpressionTransformer("X[0].lower()").

After that - is the some_string value "sanitized" or not? I mean, can we assume a full "equals" match (ie. string.equals(x)) or is it more like a "contains" match (ie. string.indexOf(x) > -1).

In case of full match, we can use expression transformer again: ExpressionTransformer("X[0] if (X[0] in ["red", "green", "yellow", "blue", "purple", "black"]) else None").

Otherwise, if it's a "contains" match, then you would need to sanitize your string manually. For example, using sklearn2pmml.preprocessing.ReplaceTransformer. The idea is to use regex to replace everything that is not an expected color name.

Example:

extract_color = make_pipeline(
  ExpressionTransformer("X[0].lower()"),
  ReplaceTransformer(...),
  ExpressionTransformer("X[0] if (X[0] in ["red", "green", "yellow", "blue", "purple", "black"]) else None")
)

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024 1

I would like to create a boolean (0/1) feature called email_contains_name, that reflects if at least one of the names appear somewhere within the email_address.

There is a sklearn2pmml.preprocessing.MatchingTransformer class that implements "$haystack matches $needle":
https://github.com/jpmml/sklearn2pmml/blob/0.92.2/sklearn2pmml/preprocessing/__init__.py#L491-L506

Here, "matches" doesn't mean a full match. It can be a partial match as well.

Finally, the first_name and last_name should be dropped and the email_address should be reduced to the provider name only.

Use ReplaceTransformer to:

  1. Replace everything up to the first @ character with an empty string.
  2. Replace everything after the last . character with an empty string.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

I can see a number of issues coming up here. Luckily enough, they all seem fixable/resolvable.

I have created a large PMMLPipeline using some custom sklearn.preprocessing.FunctionTransformer ..

What's the enclosed function? Is it some Numpy Universal function (aka ufunc), or some user-defined function?

It must be the former, because you claim that the fitted pipeline object is pickleable.

If the pipeline object needs to be converted into the PMML representation, then I'd suggest switching from sklearn.preprocessing.FunctionTransformer to sklearn2pmml.preprocessing.ExpressionTransformer. There are some new and exciting capabilities available:
https://openscoring.io/blog/2023/03/09/sklearn_udf_expression_transformer/

.. as well as an imblearn.FunctionSampler

This class is not listed under JPMML-SkLearn supported transformers and models:
https://github.com/jpmml/jpmml-sklearn#supported-packages

However, most Imbalanced-Learn samplers are no-op transformers from the PMML perspective, so they can be made supported very easily, by simply mapping them to the imblearn.Sampler pseudo-transformer class:
https://github.com/jpmml/jpmml-sklearn/blob/1.7.27/pmml-sklearn-extension/src/main/java/imblearn/Sampler.java

This mapping could be added to the next SkLearn2PMML version. Are there any other Imbalanced-Learn sampler classes tat should be added next to it?

Well, starting from SkLearn2PMML 0.92.1 (released earlier today), it's possible to define custom Python-to-JPMML mappings on the fly. I'm writing a small article about it right now. Will get posted to https://openscoring.io/blog in a few days' time, and be also linked here.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pandas._libs.tslibs.timestamps._unpickle_timestamp).

And now a few comments about the current show-stopper.

In brief, while unpickling your pipeline object, the JPMML-SkLearn/JPMML-Python software stack has encountered a CPython class definition related to some Pandas' timestamp type.

The Java unpickler component requires that it must be informed about all potential CPython types in advance (there is no such requirement about about pure Python types).

Two ways to fix it:

  • Add the missing CPython type definition to the JPMML-Python library.
  • Remove the offending CPython object(s) from your pipeline object. For example, unpickle the pipeline object in Python, nullify all Pandas' timestamps, and pickle again.

Can you provide a reproducible code example about your use of Pandas' timestamps? Is it being used as some column dtype, or is it involved in more complex calculations (such as your FunctionTransformer work)?

from sklearn2pmml.

woodly0 avatar woodly0 commented on July 20, 2024

Dear Villu,

Thank you for your super quick answer. I will invest the issue and get back to you with a reproducible example asap.
I must admit that I am not really familiar with the pickling topic which appears to be crucial for this library.

This mapping could be added to the next SkLearn2PMML version.

That would be very nice.

Are there any other Imbalanced-Learn sampler classes tat should be added next to it?

Not from my point of view.

from sklearn2pmml.

woodly0 avatar woodly0 commented on July 20, 2024

Oh boy... I'm sorry. The error I've mentioned is just one of many.

I actually didn’t read the docs properly and was trying to cobble together a set of UDFs that can definitely not be unpickled without the original context. There is even a stateful custom class Preprocessor(BaseEstimator, TransformerMixin) that takes in a pandas.DataFrame containing a datetime64[ns] column that is then parsed into years and hours, which is probably the source of the original error.

I guess I should have a look at the sklearn-pandas and sklearn2pmml.preprocessing modules to see if there are alternative ways of doing what I’m trying to do.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

.. that takes in a pandas.DataFrame containing a datetime64[ns] column that is then parsed into years and hours, which is probably the source of the original error.

Any chance of getting a Python code example about the intended business logic as well?

In (J)PMML we have pretty decent date/time/datetime support, including doing basic arithmetic with these types. The key is to express a Python UDF in terms of predefined sklearn2pmml.preprocessing.CastTransformer, DaysSinceYearTransformer, SecondsSinceMidnightTransformer etc. transformers.

from sklearn2pmml.

woodly0 avatar woodly0 commented on July 20, 2024

Sure. The following example gives a good idea of what I am doing. Just that the original UDF transforms way more columns than just a single one.

import math
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline

def some_udf(X: pd.DataFrame):
    # transform sine of hours
    X["test_feature"] = X.some_timestamp.dt.hour.apply(
        lambda x: math.sin(2 * math.pi * (x + 1) / 24.0)
    )
    X = X.drop("some_timestamp", axis=1)
    return X


# load dummy data
data_bunch = load_iris(as_frame=True)

iris_X = data_bunch["data"]
iris_y = data_bunch["target"]

# add dummy timestamp
iris_X["some_timestamp"] = pd.to_datetime("now", utc=True)

pretransformer = FunctionTransformer(
    some_udf,
    validate=False,
)

# create and fit pipeline
pmml_pipe = PMMLPipeline(
    [
        ("pretransformer", pretransformer),
        (
            "classifier",
            LogisticRegression(
                penalty="l1",
                solver="liblinear",
            ),
        ),
    ]
)

pmml_pipe.fit(iris_X, iris_y)

# create xml
sklearn2pmml(pmml_pipe, "irisClf.pmml", with_repr = True)

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

pretransformer = FunctionTransformer(some_udf)

Your pretransformer object, translated from Python UDF to proper Scikit-Learn transformer representation:

from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import DateTimeDomain
from sklearn2pmml.preprocessing import ExpressionTransformer, SecondsSinceMidnightTransformer

pretransformer = DataFrameMapper([
	(["some_timestamp"], [DateTimeDomain(), SecondsSinceMidnightTransformer(), ExpressionTransformer("math.sin(2 * math.pi * (X[0] + 1) / 86400)")])
])

print(pretransformer.fit_transform(iris_X))

Please note that I'm using "seconds since midnight" instead of "hours since midnight". However, after the sine transform, the result should be identical between the two.

from sklearn2pmml.

woodly0 avatar woodly0 commented on July 20, 2024

Hey Villu,
thank you for your help and sorry for my absence! I have more questions for you (still very basic ones).

  • How can I simply add up two features to create a new one? The following doesn't work:
pretransformer = DataFrameMapper(
    [
        (
            ["sepal length (cm)", "sepal width (cm)"],
            [
                DiscreteDomain(),
                ExpressionTransformer("X[0] + X[1]"),
            ],
        ), 
        {'alias': 'sepal_sum'}
    ],
    df_out=True,
    default=None,   # pass through 
)

pretransformer.fit_transform(iris_X)
  • I there a way to include some basic string operations? Something like:
def f_extract_color_from_string(some_string: str) -> str:
    output = None
    if some_string is None:
        return output

    colors = ["red", "green", "yellow", "blue", "purple", "black"]
    for c in colors:
        if c in some_string.lower():
            output = c
            break

    return output

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

As for support for custom Imbalanced-Learn samplers as imblearn.FunctionSampler, then you can make them supported by mapping them to the imblearn.Sampler converter class:

from sklearn2pmml import make_class_mapping_jar, sklearn2pmml
from sklearn2pmml.util import fqn

mapping_cust = {
  fqn(FunctionSampler) : "imblearn.Sampler"
}

make_class_mapping_jar(mapping_cust, "imblearn.jar")

sklearn2pmml(pipeline, "pipeline.pmml", user_classpath = ["imblearn.jar"])

Full details here: https://openscoring.io/blog/2023/05/03/converting_sklearn_subclass_pmml/

from sklearn2pmml.

woodly0 avatar woodly0 commented on July 20, 2024

Hey Villu,
thank you so much for your help. You really do have a solution for everything!
I dare asking a last question: Is there a way of introducing a custom "stateful" transformer that remembers what was fit? E.g.:

from sklearn.base import BaseEstimator, TransformerMixin

class ColumnMatcher(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.columns_ = None

    def fit(self, X: pd.DataFrame, y: pd.Series = None):
        self.columns_ = X.columns
        return self

    def transform(self, X: pd.DataFrame):
        return X.reindex(columns=self.columns_)

Those can be very useful in more complex pipelines when succeeding stateful/implicit encoding.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

Is there a way of introducing a custom "stateful" transformer that remembers what was fit?

You can implement (J)PMML converters for custom transformer and model classes. However, custom estimators have maintenance cost - you'd need to maintain your own project(s), and build and distribute artifacts for extended periods of time.

My suggestion is to stick with existing Scikit-Learn and SkLearn2PMML estimators for as long as possible... there's almost always a way to achieve your goals, if you do some thinking.

For example, this ColumnMatcher class could be easily implemented using sklearn.compose.ColumnTransformer - select columns in whatever order you like, and map them to the "passthrough" transformer (aka identity transform). You could create a utility function for constructing such ColumnTransformer objects semi-automatically.

For example, if you want to swap the first columns of a pandas.DataFrame object:

col_swapper = ColumnTransformer([
  ("second_as_first", [1], "passthrough"),
  ("first_as_second", [0], "passthrough")
], remainder = "passthorugh")

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

class ColumnMatcher

Here's another try:

def make_column_matcher(X):
  return ColumnTransformer([
    ("ordered_cols", X.columns.tolist(), "passthrough")
  ], remainder = "drop")

from sklearn2pmml.

woodly0 avatar woodly0 commented on July 20, 2024

Hello again,
Let's say there are 3 raw nominal features:

  • first_name
  • last_name
  • email_address

First, I would like to sanitize/normalize all three which is doable using ExpressionTransformer and ReplaceTransformer.
Second, I would like to create a boolean (0/1) feature called email_contains_name, that reflects if at least one of the names appear somewhere within the email_address. That's where I get stuck.. Do you know how to achieve this?
Finally, the first_name and last_name should be dropped and the email_address should be reduced to the provider name only. This is again more strait forward.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

The problem with MatchesTransformer right now is that it expects the pattern to be supplied in its constructor. It kind of doesn't wupport a workflow, where the patter is a variable in itself (eg. some dataframe column).

It would be so much better if these two RegEx functions could be used from within the ExpressionTransformer transformer.

Something like:

transformer = ExpressionTransformer("re.search(X['name'], X['email'])")

from sklearn2pmml.

woodly0 avatar woodly0 commented on July 20, 2024

Thank you Sir for your quick response!

Did I understand it correctly? The MatchesTransformer does search a nominal feature for a certain pattern, however, it uses a static pattern input. If I want a dynamic pattern, i.e. based on another feature/column, something like

transformer = ExpressionTransformer("re.search(X['name'], X['email'])")

would be needed which is currently not implemented.

Do you think this could be part of a future update?

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

would be needed which is currently not implemented.

Do you think this could be part of a future update?

Implementing in-expression support for re.sub() and re.search() in technically easy/doable, as demonstrated by the existence of standalone ReplaceTransformer and MatchesTransformer transformer classes, respectively.

The trouble is that this code change needs to go into the JPMML-Python library, and it will take time to propagate the updated library version throughout the JPMML library stack, up until the SkLearn2PMML package.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 20, 2024

The trouble is that this code change needs to go into the JPMML-Python library

Ackshually, almost everything that has been discussed in this issue (eg. temporal Pandas' data types, (un-)pickling errors) needs to go into the JPMML-Python project.

from sklearn2pmml.

woodly0 avatar woodly0 commented on July 20, 2024

almost everything that has been discussed in this issue (eg. temporal Pandas' data types, (un-)pickling errors) needs to go into the JPMML-Python project

I see.. Unfortunately, I won't be of great help since I don't know any Java 😄

from sklearn2pmml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.