Hello, I am hoping to store feature statistics (e.g., the mean and s

Thank you <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

Save feature statistics in python and retrieve them in jpmml-evaluator about sklearn2pmml HOT 7 CLOSED

jpmml commented on July 19, 2024 1

Save feature statistics in python and retrieve them in jpmml-evaluator

from sklearn2pmml.

Comments (7)

vruusmann commented on July 19, 2024

This kind of functionality has been requested before. I even remember seeing such issue somewhere here (or maybe in some other JPMML-Converter sub-project), but I can't locate it at the moment.

Essentially, it will be necessary to enhance the business logic of sklearn2pmml.decoration.CategoricalDomain and sklearn2pmml.decoration.ContinuousDomain classes to collect statistics about the X argument of the fit method. For example, class ContinuousDomain currently collects the lower and upper bound (this information is used to define the "applicability interval" of individual features):
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L50-L51

Here's the deal - code up the Python part, and I'll do the other parts.

from sklearn2pmml.

movingname commented on July 19, 2024

Thank you @vruusmann for the reply and for creating these pmml-related packages.

Here is a possible plan based on your comment (forgive me if it is unreasonable since I just started with sklearn2pmml....)

Calculate and save data_mean_ and data_std_ after https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L50-L51
Then in the jpmml-sklearn side, add these statistics to the ValidValueDecorator after https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/decoration/ContinuousDomain.java#L64
Finally in jpmml-sklearn, put the statistics in the NumericInfo part like the example below (from http://dmg.org/pmml/v4-3/Statistics.html)

<ModelStats>
    <UnivariateStats field="Age">
      <Counts totalFreq="240"/>
      <NumericInfo mean="54.43" minimum="29" maximum="77"/>
      <ContStats>
        <Interval closure="openClosed" leftMargin="29" rightMargin="33.8"/>
        <Array type="int" n="10"> 1 8 28 30 30 43 51 33 13 3</Array>
      </ContStats>
    </UnivariateStats>
    ...
  </ModelStats>

It seems that only step 1 is done in Python and it only takes 2 lines of code. I am happy to create a PR (and do more) if needed. Thanks!

from sklearn2pmml.

vruusmann commented on July 19, 2024

Exactly so. I'd only need some assistance with the first item (as Python is not my main language, and I don't know Numpy and/or Pandas libraries that well).

The NumericInfo element has the following attributes; what's the most appropriate Numpy/Pandas library call for computing them (needs to be missing value safe)?

minimum
maximum
mean
standardDeviation
median
interQuartileRange

The NumericInfo element also takes Quantile child elements. Any interest in adding those?

from sklearn2pmml.

movingname commented on July 19, 2024

Thanks. Have create a short PR above to add the statistics. Could you please take a look and see if it makes sense?

For the Quantile child elements I am currently not interested:) But if you think it is good to put them in the change I can help.

from sklearn2pmml.

vruusmann commented on July 19, 2024

Thanks! Expecting to work on this issue already this week, after I have caught up with super-critical things (eg. updating SkLearn from 0.18.1 to 0.18.2, which appears to include several breaking API changes).

The NumericInfo element corresponds to continuous features. Does your use case involve any categorical features (eg. are there any string or integer columns in the original dataset)? Should try to enhance the sklearn2pmml.decoration.CategoricalDomain class as well in order to to keep things "balanced".

What would be appropriate Python code for collecting data for the Counts element? At minimum, should populate Counts@totalFreq and Counts@missingFreq attributes.

from sklearn2pmml.

movingname commented on July 19, 2024

Sure! Please take your time to handle the more urgent issue!

Changes to the PR:

Added totalFreq and missingFreq to the ContinuousDomain.
My use case currently does not have categorical feature so I just added the totalFreq and missingFreq to CategoricalDomain to keep things a bit balanced.

I am happy to make more changes if needed. Thanks a lot!

from sklearn2pmml.

movingname commented on July 19, 2024

Hi vruusmann,

I think your previous comment of adding Quantile is reasonable. In most cases people would use 1st and 3rd quantile so I have added them to the PR.

Please take your time to handle more urgent issues and then take a look at the PR.

from sklearn2pmml.

Save feature statistics in python and retrieve them in jpmml-evaluator about sklearn2pmml HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs