GithubHelp home page GithubHelp logo

Comments (4)

mshahmoh avatar mshahmoh commented on August 20, 2024

Essentially what we're asking is, revise class PMMLLabelEncoder with the following code. The change is intended to improve the handling of missing categorical variables:

class PMMLLabelEncoder(BaseEstimator, TransformerMixin):

	def __init__(self, missing_values = None):
		self.missing_values = missing_values

	def fit(self, X, y = None):
		X = column_or_1d(X, warn = True)
		self.classes_ = numpy.unique(X[~pandas.isnull(X)])
		return self

	def transform(self, X):
		X = column_or_1d(X, warn = True)
		mapper = {}
		for i,x in enumerate(list(self.classes_)):
		    mapper[x] = i
		Xt = numpy.array([self.missing_values if pandas.isnull(v) else mapper.get(v,self.missing_values) for v in X])
		return _col2d(Xt)

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

@mshahmoh I have no objections to the suggested changes, but would like to ask a few clarifying questions first:

First - what is the reason for replacing sklearn2pmml.util.encode_1d with sklearn.utils.validation.column_or_1d? Is the former restrictive in some way?

The reason behind coding upensure_1d was to improve interoperability with Pandas' data containers (single-column pandas.DataFrame and pandas.Series). Any idea if column_or_1d is better now in this area? I'm thinking about a hybrid solution, where I keep the first Pandas'-oriented part of ensure_1d, but replace its second half with a column_or_1d call (maybe SkLearn warnings/errors will be more informative).

Second - if the intent of the modified transform(X) method to treat invalid values (ie. previously unseen category levels) simply as missing values? In PMML this would map to the MiningField@invalidValueTreatment="asMissing" markup.

Looks like a slightly breaking change. Maybe there should be a dedicated attribute for controlling it (eg. a invalid_as_missing boolean attribute).

Whatever the solution, the Python code in SkLearn2PMML will need to be accompanied by some new Java code in the underlying JPMML-SkLearn library (in order to see the PMML markup change). Will do it for you when the intent is clarified.

from sklearn2pmml.

mshahmoh avatar mshahmoh commented on August 20, 2024

Hi @vruusmann ,
Sorry for the confusion. the diff is off of master. Here's the diff with 0.51.1. No need to modify anything with ensure_1d:
All we're asking is to have the PMMLEncoder code change to the code snippet in my comment above.

diff --git a/sklearn2pmml/preprocessing/__init__.py b/sklearn2pmml/preprocessing/__init__.py
index 9c2f4e6..e0394c8 100644
--- a/sklearn2pmml/preprocessing/__init__.py
+++ b/sklearn2pmml/preprocessing/__init__.py
@@ -219,6 +219,7 @@ class PMMLLabelBinarizer(BaseEstimator, TransformerMixin):
 			Xt = Xt.tocsr()
 		return Xt
 
+
 class PMMLLabelEncoder(BaseEstimator, TransformerMixin):
 
 	def __init__(self, missing_values = None):
@@ -231,10 +232,13 @@ class PMMLLabelEncoder(BaseEstimator, TransformerMixin):
 
 	def transform(self, X):
 		X = column_or_1d(X, warn = True)
-		index = list(self.classes_)
-		Xt = numpy.array([self.missing_values if pandas.isnull(v) else index.index(v) for v in X])
+		mapper = {}
+		for i,x in enumerate(list(self.classes_)):
+			mapper[x] = i
+		Xt = numpy.array([self.missing_values if pandas.isnull(v) else mapper.get(v,self.missing_values) for v in X])
 		return _col2d(Xt)
 
+
 class PowerFunctionTransformer(BaseEstimator, TransformerMixin):
 
 	def __init__(self, power):

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

I'm fixing this issue in two stages:

  1. Performance update - replacing list.index(v) with dict[v].
  2. Functional update - returning a default value when dict[v] is about to raise a KeyError (ie. the transformer encountered an invalid value).

The first one is done, the second one needs some more thinking.

The suggestion was to replace dict[k] with dict.get(k, default_value), where the default value is the placeholder for the missing value. Looks good, but it's a functional change - previously invalid values raised an error, but now they are consumed silently.

Still think it would be nice to have some kind of switch for toggling between these two behaviours.

@mshahmoh In your Python code, is adding this switch parameter OK or not? Or is your Python code effectively immutable (eg. has been shipped far and wide)?

Reopening; will re-close when the second stage has also been completed.

from sklearn2pmml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.