Comments (23)
sklearn2pmml(pmml_pipe, "irisClf.pmml", with_repr = True)
The updated pmml_pipe
object now contains some unsupported Pandas' data type objects, because the sklearn2pmml
utility function fails with the following error:
Caused by: net.razorvine.pickle.PickleException: Expected 8 attribute(s), got 9 attribute(s)
at org.jpmml.python.CustomPythonObject.createAttributeMap(CustomPythonObject.java:81)
at numpy.DType.__setstate__(DType.java:49)
Exactly what I wanted to have! Can work on it locally now.
from sklearn2pmml.
ExpressionTransformer("math.sin(2 * math.pi * (X[0] + 1) / 86400)")
Well, the math.sin
function is not recognized by the JPMML-Python library yet, so replaced it with the numpy.sin
universal function.
After that, the pmml_pipe
object converts nicely!
from sklearn2pmml.
How can I simply add up two features to create a new one? The following doesn't work:
Your code has two issues with it:
- The
{'alias': 'sepal_sum'}
construct is misplaced. It should be the third tuple element. You have placed it after a two-element tuple. - Remove the
DiscreteDomain
decorator. Both input variables are continuous, not discrete. If you want to place a domain decorator there, make itContinuousDomain
instead.
Corrected code:
pretransformer = DataFrameMapper(
[
(
["sepal length (cm)", "sepal width (cm)"],
[
#DiscreteDomain(),
ExpressionTransformer("X[0] + X[1]"),
],
{'alias': 'sepal_sum'}
)
],
df_out=True,
default=None, # pass through
)
print(pretransformer.fit_transform(iris_X))
Edit: both issues are totally not related to (SkLearn2)PMML.
from sklearn2pmml.
I there a way to include some basic string operations?
You can convert a string value to lowercase using ExpressionTransformer("X[0].lower()")
.
After that - is the some_string
value "sanitized" or not? I mean, can we assume a full "equals" match (ie. string.equals(x)
) or is it more like a "contains" match (ie. string.indexOf(x) > -1
).
In case of full match, we can use expression transformer again: ExpressionTransformer("X[0] if (X[0] in ["red", "green", "yellow", "blue", "purple", "black"]) else None")
.
Otherwise, if it's a "contains" match, then you would need to sanitize your string manually. For example, using sklearn2pmml.preprocessing.ReplaceTransformer
. The idea is to use regex to replace everything that is not an expected color name.
Example:
extract_color = make_pipeline(
ExpressionTransformer("X[0].lower()"),
ReplaceTransformer(...),
ExpressionTransformer("X[0] if (X[0] in ["red", "green", "yellow", "blue", "purple", "black"]) else None")
)
from sklearn2pmml.
I would like to create a boolean (0/1) feature called email_contains_name, that reflects if at least one of the names appear somewhere within the email_address.
There is a sklearn2pmml.preprocessing.MatchingTransformer
class that implements "$haystack matches $needle":
https://github.com/jpmml/sklearn2pmml/blob/0.92.2/sklearn2pmml/preprocessing/__init__.py#L491-L506
Here, "matches" doesn't mean a full match. It can be a partial match as well.
Finally, the first_name and last_name should be dropped and the email_address should be reduced to the provider name only.
Use ReplaceTransformer
to:
- Replace everything up to the first
@
character with an empty string. - Replace everything after the last
.
character with an empty string.
from sklearn2pmml.
I can see a number of issues coming up here. Luckily enough, they all seem fixable/resolvable.
I have created a large PMMLPipeline using some custom
sklearn.preprocessing.FunctionTransformer
..
What's the enclosed function? Is it some Numpy Universal function (aka ufunc), or some user-defined function?
It must be the former, because you claim that the fitted pipeline object is pickleable.
If the pipeline object needs to be converted into the PMML representation, then I'd suggest switching from sklearn.preprocessing.FunctionTransformer
to sklearn2pmml.preprocessing.ExpressionTransformer
. There are some new and exciting capabilities available:
https://openscoring.io/blog/2023/03/09/sklearn_udf_expression_transformer/
.. as well as an
imblearn.FunctionSampler
This class is not listed under JPMML-SkLearn supported transformers and models:
https://github.com/jpmml/jpmml-sklearn#supported-packages
However, most Imbalanced-Learn samplers are no-op transformers from the PMML perspective, so they can be made supported very easily, by simply mapping them to the imblearn.Sampler
pseudo-transformer class:
https://github.com/jpmml/jpmml-sklearn/blob/1.7.27/pmml-sklearn-extension/src/main/java/imblearn/Sampler.java
This mapping could be added to the next SkLearn2PMML version. Are there any other Imbalanced-Learn sampler classes tat should be added next to it?
Well, starting from SkLearn2PMML 0.92.1 (released earlier today), it's possible to define custom Python-to-JPMML mappings on the fly. I'm writing a small article about it right now. Will get posted to https://openscoring.io/blog in a few days' time, and be also linked here.
from sklearn2pmml.
net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for
pandas._libs.tslibs.timestamps._unpickle_timestamp
).
And now a few comments about the current show-stopper.
In brief, while unpickling your pipeline object, the JPMML-SkLearn/JPMML-Python software stack has encountered a CPython class definition related to some Pandas' timestamp type.
The Java unpickler component requires that it must be informed about all potential CPython types in advance (there is no such requirement about about pure Python types).
Two ways to fix it:
- Add the missing CPython type definition to the JPMML-Python library.
- Remove the offending CPython object(s) from your pipeline object. For example, unpickle the pipeline object in Python, nullify all Pandas' timestamps, and pickle again.
Can you provide a reproducible code example about your use of Pandas' timestamps? Is it being used as some column dtype
, or is it involved in more complex calculations (such as your FunctionTransformer
work)?
from sklearn2pmml.
Dear Villu,
Thank you for your super quick answer. I will invest the issue and get back to you with a reproducible example asap.
I must admit that I am not really familiar with the pickling topic which appears to be crucial for this library.
This mapping could be added to the next SkLearn2PMML version.
That would be very nice.
Are there any other Imbalanced-Learn sampler classes tat should be added next to it?
Not from my point of view.
from sklearn2pmml.
Oh boy... I'm sorry. The error I've mentioned is just one of many.
I actually didn’t read the docs properly and was trying to cobble together a set of UDFs that can definitely not be unpickled without the original context. There is even a stateful custom class Preprocessor(BaseEstimator, TransformerMixin)
that takes in a pandas.DataFrame
containing a datetime64[ns]
column that is then parsed into years and hours, which is probably the source of the original error.
I guess I should have a look at the sklearn-pandas
and sklearn2pmml.preprocessing
modules to see if there are alternative ways of doing what I’m trying to do.
from sklearn2pmml.
.. that takes in a pandas.DataFrame containing a datetime64[ns] column that is then parsed into years and hours, which is probably the source of the original error.
Any chance of getting a Python code example about the intended business logic as well?
In (J)PMML we have pretty decent date/time/datetime support, including doing basic arithmetic with these types. The key is to express a Python UDF in terms of predefined sklearn2pmml.preprocessing.CastTransformer
, DaysSinceYearTransformer
, SecondsSinceMidnightTransformer
etc. transformers.
from sklearn2pmml.
Sure. The following example gives a good idea of what I am doing. Just that the original UDF transforms way more columns than just a single one.
import math
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.pipeline import PMMLPipeline
def some_udf(X: pd.DataFrame):
# transform sine of hours
X["test_feature"] = X.some_timestamp.dt.hour.apply(
lambda x: math.sin(2 * math.pi * (x + 1) / 24.0)
)
X = X.drop("some_timestamp", axis=1)
return X
# load dummy data
data_bunch = load_iris(as_frame=True)
iris_X = data_bunch["data"]
iris_y = data_bunch["target"]
# add dummy timestamp
iris_X["some_timestamp"] = pd.to_datetime("now", utc=True)
pretransformer = FunctionTransformer(
some_udf,
validate=False,
)
# create and fit pipeline
pmml_pipe = PMMLPipeline(
[
("pretransformer", pretransformer),
(
"classifier",
LogisticRegression(
penalty="l1",
solver="liblinear",
),
),
]
)
pmml_pipe.fit(iris_X, iris_y)
# create xml
sklearn2pmml(pmml_pipe, "irisClf.pmml", with_repr = True)
from sklearn2pmml.
pretransformer = FunctionTransformer(some_udf)
Your pretransformer
object, translated from Python UDF to proper Scikit-Learn transformer representation:
from sklearn_pandas import DataFrameMapper
from sklearn2pmml.decoration import DateTimeDomain
from sklearn2pmml.preprocessing import ExpressionTransformer, SecondsSinceMidnightTransformer
pretransformer = DataFrameMapper([
(["some_timestamp"], [DateTimeDomain(), SecondsSinceMidnightTransformer(), ExpressionTransformer("math.sin(2 * math.pi * (X[0] + 1) / 86400)")])
])
print(pretransformer.fit_transform(iris_X))
Please note that I'm using "seconds since midnight" instead of "hours since midnight". However, after the sine transform, the result should be identical between the two.
from sklearn2pmml.
Hey Villu,
thank you for your help and sorry for my absence! I have more questions for you (still very basic ones).
- How can I simply add up two features to create a new one? The following doesn't work:
pretransformer = DataFrameMapper(
[
(
["sepal length (cm)", "sepal width (cm)"],
[
DiscreteDomain(),
ExpressionTransformer("X[0] + X[1]"),
],
),
{'alias': 'sepal_sum'}
],
df_out=True,
default=None, # pass through
)
pretransformer.fit_transform(iris_X)
- I there a way to include some basic string operations? Something like:
def f_extract_color_from_string(some_string: str) -> str:
output = None
if some_string is None:
return output
colors = ["red", "green", "yellow", "blue", "purple", "black"]
for c in colors:
if c in some_string.lower():
output = c
break
return output
from sklearn2pmml.
As for support for custom Imbalanced-Learn samplers as imblearn.FunctionSampler
, then you can make them supported by mapping them to the imblearn.Sampler
converter class:
from sklearn2pmml import make_class_mapping_jar, sklearn2pmml
from sklearn2pmml.util import fqn
mapping_cust = {
fqn(FunctionSampler) : "imblearn.Sampler"
}
make_class_mapping_jar(mapping_cust, "imblearn.jar")
sklearn2pmml(pipeline, "pipeline.pmml", user_classpath = ["imblearn.jar"])
Full details here: https://openscoring.io/blog/2023/05/03/converting_sklearn_subclass_pmml/
from sklearn2pmml.
Hey Villu,
thank you so much for your help. You really do have a solution for everything!
I dare asking a last question: Is there a way of introducing a custom "stateful" transformer that remembers what was fit? E.g.:
from sklearn.base import BaseEstimator, TransformerMixin
class ColumnMatcher(BaseEstimator, TransformerMixin):
def __init__(self):
self.columns_ = None
def fit(self, X: pd.DataFrame, y: pd.Series = None):
self.columns_ = X.columns
return self
def transform(self, X: pd.DataFrame):
return X.reindex(columns=self.columns_)
Those can be very useful in more complex pipelines when succeeding stateful/implicit encoding.
from sklearn2pmml.
Is there a way of introducing a custom "stateful" transformer that remembers what was fit?
You can implement (J)PMML converters for custom transformer and model classes. However, custom estimators have maintenance cost - you'd need to maintain your own project(s), and build and distribute artifacts for extended periods of time.
My suggestion is to stick with existing Scikit-Learn and SkLearn2PMML estimators for as long as possible... there's almost always a way to achieve your goals, if you do some thinking.
For example, this ColumnMatcher
class could be easily implemented using sklearn.compose.ColumnTransformer
- select columns in whatever order you like, and map them to the "passthrough"
transformer (aka identity transform). You could create a utility function for constructing such ColumnTransformer
objects semi-automatically.
For example, if you want to swap the first columns of a pandas.DataFrame
object:
col_swapper = ColumnTransformer([
("second_as_first", [1], "passthrough"),
("first_as_second", [0], "passthrough")
], remainder = "passthorugh")
from sklearn2pmml.
class ColumnMatcher
Here's another try:
def make_column_matcher(X):
return ColumnTransformer([
("ordered_cols", X.columns.tolist(), "passthrough")
], remainder = "drop")
from sklearn2pmml.
Hello again,
Let's say there are 3 raw nominal features:
- first_name
- last_name
- email_address
First, I would like to sanitize/normalize all three which is doable using ExpressionTransformer
and ReplaceTransformer
.
Second, I would like to create a boolean (0/1) feature called email_contains_name, that reflects if at least one of the names appear somewhere within the email_address. That's where I get stuck.. Do you know how to achieve this?
Finally, the first_name and last_name should be dropped and the email_address should be reduced to the provider name only. This is again more strait forward.
from sklearn2pmml.
The problem with MatchesTransformer
right now is that it expects the pattern to be supplied in its constructor. It kind of doesn't wupport a workflow, where the patter is a variable in itself (eg. some dataframe column).
It would be so much better if these two RegEx functions could be used from within the ExpressionTransformer
transformer.
Something like:
transformer = ExpressionTransformer("re.search(X['name'], X['email'])")
from sklearn2pmml.
Thank you Sir for your quick response!
Did I understand it correctly? The MatchesTransformer
does search a nominal feature for a certain pattern, however, it uses a static pattern input. If I want a dynamic pattern, i.e. based on another feature/column, something like
transformer = ExpressionTransformer("re.search(X['name'], X['email'])")
would be needed which is currently not implemented.
Do you think this could be part of a future update?
from sklearn2pmml.
would be needed which is currently not implemented.
Do you think this could be part of a future update?
Implementing in-expression support for re.sub()
and re.search()
in technically easy/doable, as demonstrated by the existence of standalone ReplaceTransformer
and MatchesTransformer
transformer classes, respectively.
The trouble is that this code change needs to go into the JPMML-Python library, and it will take time to propagate the updated library version throughout the JPMML library stack, up until the SkLearn2PMML package.
from sklearn2pmml.
The trouble is that this code change needs to go into the JPMML-Python library
Ackshually, almost everything that has been discussed in this issue (eg. temporal Pandas' data types, (un-)pickling errors) needs to go into the JPMML-Python project.
from sklearn2pmml.
almost everything that has been discussed in this issue (eg. temporal Pandas' data types, (un-)pickling errors) needs to go into the JPMML-Python project
I see.. Unfortunately, I won't be of great help since I don't know any Java 😄
from sklearn2pmml.
Related Issues (20)
- Storing LinearSVC as coefficients instead of support vectors HOT 1
- Reshape transformation results from 2-D to 1-D (column vectors) HOT 1
- sklearn2pmml does not work with sklearn version 1.3.0 and newer HOT 6
- sklearn2pmml does not work with xgboost >= 2.0.0 HOT 9
- LookupTransformer TypeError when default value is not exactly the type from the mapping HOT 2
- xgb 生成的pkl 转pmml 失败 HOT 1
- Can sklearn2pmml just for variable transformations be used? HOT 1
- PyPMML is making systematically off predictions with XGBoost PMML documents? HOT 16
- ImportError: cannot import name 'LabelBinarizer' from 'sklearn2pmml.preprocessing' HOT 2
- Compatibility with Scikit-Learn 1.4.0 HOT 8
- Convenience API for embedding model evaluation metrics as `ModelExplanation` element HOT 13
- Ability to "refine" (categorical-) valid value spaces along the transformer pipeline HOT 17
- NullPointerException when using OneHotEncoder with Integer Categorical Variables in sklearn2pmml HOT 10
- Create .pmml File for Calibrated xgboost Model HOT 8
- Pandas 2.2 FutureWarning for time series offset aliases HOT 2
- Pandas not installed as dependency HOT 2
- CategoricalDomain decorator doesn't work with nullable pandas Int64 column HOT 3
- RegEx expressions are un-evaluatable due to missing `(pc)re` imports HOT 8
- ExplainableBoostingRegressor not supporting other link functions HOT 1
- How to properly re-use raw features HOT 18
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sklearn2pmml.