GithubHelp home page GithubHelp logo

Comments (7)

vruusmann avatar vruusmann commented on July 19, 2024

This kind of functionality has been requested before. I even remember seeing such issue somewhere here (or maybe in some other JPMML-Converter sub-project), but I can't locate it at the moment.

Essentially, it will be necessary to enhance the business logic of sklearn2pmml.decoration.CategoricalDomain and sklearn2pmml.decoration.ContinuousDomain classes to collect statistics about the X argument of the fit method. For example, class ContinuousDomain currently collects the lower and upper bound (this information is used to define the "applicability interval" of individual features):
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L50-L51

Here's the deal - code up the Python part, and I'll do the other parts.

from sklearn2pmml.

movingname avatar movingname commented on July 19, 2024

Thank you @vruusmann for the reply and for creating these pmml-related packages.

Here is a possible plan based on your comment (forgive me if it is unreasonable since I just started with sklearn2pmml....)

  1. Calculate and save data_mean_ and data_std_ after https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L50-L51

  2. Then in the jpmml-sklearn side, add these statistics to the ValidValueDecorator after https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/decoration/ContinuousDomain.java#L64

  3. Finally in jpmml-sklearn, put the statistics in the NumericInfo part like the example below (from http://dmg.org/pmml/v4-3/Statistics.html)

<ModelStats>
    <UnivariateStats field="Age">
      <Counts totalFreq="240"/>
      <NumericInfo mean="54.43" minimum="29" maximum="77"/>
      <ContStats>
        <Interval closure="openClosed" leftMargin="29" rightMargin="33.8"/>
        <Array type="int" n="10"> 1 8 28 30 30 43 51 33 13 3</Array>
      </ContStats>
    </UnivariateStats>
    ...
  </ModelStats>

It seems that only step 1 is done in Python and it only takes 2 lines of code. I am happy to create a PR (and do more) if needed. Thanks!

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 19, 2024

Exactly so. I'd only need some assistance with the first item (as Python is not my main language, and I don't know Numpy and/or Pandas libraries that well).

The NumericInfo element has the following attributes; what's the most appropriate Numpy/Pandas library call for computing them (needs to be missing value safe)?

  • minimum
  • maximum
  • mean
  • standardDeviation
  • median
  • interQuartileRange

The NumericInfo element also takes Quantile child elements. Any interest in adding those?

from sklearn2pmml.

movingname avatar movingname commented on July 19, 2024

Thanks. Have create a short PR above to add the statistics. Could you please take a look and see if it makes sense?

For the Quantile child elements I am currently not interested:) But if you think it is good to put them in the change I can help.

from sklearn2pmml.

vruusmann avatar vruusmann commented on July 19, 2024

Thanks! Expecting to work on this issue already this week, after I have caught up with super-critical things (eg. updating SkLearn from 0.18.1 to 0.18.2, which appears to include several breaking API changes).

The NumericInfo element corresponds to continuous features. Does your use case involve any categorical features (eg. are there any string or integer columns in the original dataset)? Should try to enhance the sklearn2pmml.decoration.CategoricalDomain class as well in order to to keep things "balanced".

What would be appropriate Python code for collecting data for the Counts element? At minimum, should populate Counts@totalFreq and Counts@missingFreq attributes.

from sklearn2pmml.

movingname avatar movingname commented on July 19, 2024

Sure! Please take your time to handle the more urgent issue!

Changes to the PR:

  • Added totalFreq and missingFreq to the ContinuousDomain.
  • My use case currently does not have categorical feature so I just added the totalFreq and missingFreq to CategoricalDomain to keep things a bit balanced.

I am happy to make more changes if needed. Thanks a lot!

from sklearn2pmml.

movingname avatar movingname commented on July 19, 2024

Hi vruusmann,

I think your previous comment of adding Quantile is reasonable. In most cases people would use 1st and 3rd quantile so I have added them to the PR.

Please take your time to handle more urgent issues and then take a look at the PR.

from sklearn2pmml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.