Hi Villu, Thanks a lot for getting back to us. Here's the diff of the change with

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I'm fixing this issue in two stages: Performance update - repl

Patching 0.51.1 for unknown categorical variables about sklearn2pmml HOT 4 CLOSED

mshahmoh commented on August 20, 2024

Patching 0.51.1 for unknown categorical variables

from sklearn2pmml.

Comments (4)

mshahmoh commented on August 20, 2024

Essentially what we're asking is, revise class PMMLLabelEncoder with the following code. The change is intended to improve the handling of missing categorical variables:

class PMMLLabelEncoder(BaseEstimator, TransformerMixin):

	def __init__(self, missing_values = None):
		self.missing_values = missing_values

	def fit(self, X, y = None):
		X = column_or_1d(X, warn = True)
		self.classes_ = numpy.unique(X[~pandas.isnull(X)])
		return self

	def transform(self, X):
		X = column_or_1d(X, warn = True)
		mapper = {}
		for i,x in enumerate(list(self.classes_)):
		    mapper[x] = i
		Xt = numpy.array([self.missing_values if pandas.isnull(v) else mapper.get(v,self.missing_values) for v in X])
		return _col2d(Xt)

from sklearn2pmml.

vruusmann commented on August 20, 2024

@mshahmoh I have no objections to the suggested changes, but would like to ask a few clarifying questions first:

First - what is the reason for replacing sklearn2pmml.util.encode_1d with sklearn.utils.validation.column_or_1d? Is the former restrictive in some way?

The reason behind coding upensure_1d was to improve interoperability with Pandas' data containers (single-column pandas.DataFrame and pandas.Series). Any idea if column_or_1d is better now in this area? I'm thinking about a hybrid solution, where I keep the first Pandas'-oriented part of ensure_1d, but replace its second half with a column_or_1d call (maybe SkLearn warnings/errors will be more informative).

Second - if the intent of the modified transform(X) method to treat invalid values (ie. previously unseen category levels) simply as missing values? In PMML this would map to the MiningField@invalidValueTreatment="asMissing" markup.

Looks like a slightly breaking change. Maybe there should be a dedicated attribute for controlling it (eg. a invalid_as_missing boolean attribute).

Whatever the solution, the Python code in SkLearn2PMML will need to be accompanied by some new Java code in the underlying JPMML-SkLearn library (in order to see the PMML markup change). Will do it for you when the intent is clarified.

from sklearn2pmml.

mshahmoh commented on August 20, 2024

Hi @vruusmann ,
Sorry for the confusion. the diff is off of master. Here's the diff with 0.51.1. No need to modify anything with ensure_1d:
All we're asking is to have the PMMLEncoder code change to the code snippet in my comment above.

diff --git a/sklearn2pmml/preprocessing/__init__.py b/sklearn2pmml/preprocessing/__init__.py
index 9c2f4e6..e0394c8 100644
--- a/sklearn2pmml/preprocessing/__init__.py
+++ b/sklearn2pmml/preprocessing/__init__.py
@@ -219,6 +219,7 @@ class PMMLLabelBinarizer(BaseEstimator, TransformerMixin):
 			Xt = Xt.tocsr()
 		return Xt
 
+
 class PMMLLabelEncoder(BaseEstimator, TransformerMixin):
 
 	def __init__(self, missing_values = None):
@@ -231,10 +232,13 @@ class PMMLLabelEncoder(BaseEstimator, TransformerMixin):
 
 	def transform(self, X):
 		X = column_or_1d(X, warn = True)
-		index = list(self.classes_)
-		Xt = numpy.array([self.missing_values if pandas.isnull(v) else index.index(v) for v in X])
+		mapper = {}
+		for i,x in enumerate(list(self.classes_)):
+			mapper[x] = i
+		Xt = numpy.array([self.missing_values if pandas.isnull(v) else mapper.get(v,self.missing_values) for v in X])
 		return _col2d(Xt)
 
+
 class PowerFunctionTransformer(BaseEstimator, TransformerMixin):
 
 	def __init__(self, power):

from sklearn2pmml.

vruusmann commented on August 20, 2024

I'm fixing this issue in two stages:

Performance update - replacing list.index(v) with dict[v].
Functional update - returning a default value when dict[v] is about to raise a KeyError (ie. the transformer encountered an invalid value).

The first one is done, the second one needs some more thinking.

The suggestion was to replace dict[k] with dict.get(k, default_value), where the default value is the placeholder for the missing value. Looks good, but it's a functional change - previously invalid values raised an error, but now they are consumed silently.

Still think it would be nice to have some kind of switch for toggling between these two behaviours.

@mshahmoh In your Python code, is adding this switch parameter OK or not? Or is your Python code effectively immutable (eg. has been shipped far and wide)?

Reopening; will re-close when the second stage has also been completed.

from sklearn2pmml.

Patching 0.51.1 for unknown categorical variables about sklearn2pmml HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs