Comments (7)
This kind of functionality has been requested before. I even remember seeing such issue somewhere here (or maybe in some other JPMML-Converter sub-project), but I can't locate it at the moment.
Essentially, it will be necessary to enhance the business logic of sklearn2pmml.decoration.CategoricalDomain
and sklearn2pmml.decoration.ContinuousDomain
classes to collect statistics about the X
argument of the fit
method. For example, class ContinuousDomain
currently collects the lower and upper bound (this information is used to define the "applicability interval" of individual features):
https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L50-L51
Here's the deal - code up the Python part, and I'll do the other parts.
from sklearn2pmml.
Thank you @vruusmann for the reply and for creating these pmml-related packages.
Here is a possible plan based on your comment (forgive me if it is unreasonable since I just started with sklearn2pmml....)
-
Calculate and save
data_mean_
anddata_std_
after https://github.com/jpmml/sklearn2pmml/blob/master/sklearn2pmml/decoration/__init__.py#L50-L51 -
Then in the jpmml-sklearn side, add these statistics to the ValidValueDecorator after https://github.com/jpmml/jpmml-sklearn/blob/master/src/main/java/sklearn2pmml/decoration/ContinuousDomain.java#L64
-
Finally in jpmml-sklearn, put the statistics in the NumericInfo part like the example below (from http://dmg.org/pmml/v4-3/Statistics.html)
<ModelStats>
<UnivariateStats field="Age">
<Counts totalFreq="240"/>
<NumericInfo mean="54.43" minimum="29" maximum="77"/>
<ContStats>
<Interval closure="openClosed" leftMargin="29" rightMargin="33.8"/>
<Array type="int" n="10"> 1 8 28 30 30 43 51 33 13 3</Array>
</ContStats>
</UnivariateStats>
...
</ModelStats>
It seems that only step 1 is done in Python and it only takes 2 lines of code. I am happy to create a PR (and do more) if needed. Thanks!
from sklearn2pmml.
Exactly so. I'd only need some assistance with the first item (as Python is not my main language, and I don't know Numpy and/or Pandas libraries that well).
The NumericInfo
element has the following attributes; what's the most appropriate Numpy/Pandas library call for computing them (needs to be missing value safe)?
minimum
maximum
mean
standardDeviation
median
interQuartileRange
The NumericInfo
element also takes Quantile
child elements. Any interest in adding those?
from sklearn2pmml.
Thanks. Have create a short PR above to add the statistics. Could you please take a look and see if it makes sense?
For the Quantile child elements I am currently not interested:) But if you think it is good to put them in the change I can help.
from sklearn2pmml.
Thanks! Expecting to work on this issue already this week, after I have caught up with super-critical things (eg. updating SkLearn from 0.18.1 to 0.18.2, which appears to include several breaking API changes).
The NumericInfo
element corresponds to continuous features. Does your use case involve any categorical features (eg. are there any string or integer columns in the original dataset)? Should try to enhance the sklearn2pmml.decoration.CategoricalDomain
class as well in order to to keep things "balanced".
What would be appropriate Python code for collecting data for the Counts
element? At minimum, should populate Counts@totalFreq
and Counts@missingFreq
attributes.
from sklearn2pmml.
Sure! Please take your time to handle the more urgent issue!
Changes to the PR:
- Added totalFreq and missingFreq to the ContinuousDomain.
- My use case currently does not have categorical feature so I just added the totalFreq and missingFreq to CategoricalDomain to keep things a bit balanced.
I am happy to make more changes if needed. Thanks a lot!
from sklearn2pmml.
Hi vruusmann,
I think your previous comment of adding Quantile is reasonable. In most cases people would use 1st and 3rd quantile so I have added them to the PR.
Please take your time to handle more urgent issues and then take a look at the PR.
from sklearn2pmml.
Related Issues (20)
- Missing dependency `dill` from `setup.py`. HOT 7
- Support for Python power operator `**` HOT 1
- Assembling a pipeline from pre-fitted components HOT 3
- looking for support on quantile_forest for RandomForestQuantileRegressor HOT 2
- Failing to convert `CalibratedClassifierCV` plus `XGBClassifier` combination HOT 12
- Effectively debugging XGBoost pipelines (mis-matching predictions between Python and (J)PMML) HOT 14
- Revisit the missing value handling of custom transformers HOT 1
- Support for AgglomerativeClustering HOT 2
- `ExpressionTransformer` should try to rectify feature type information HOT 5
- `Alias` object has no attribute `transformer_` HOT 8
- Storing LinearSVC as coefficients instead of support vectors HOT 1
- Reshape transformation results from 2-D to 1-D (column vectors) HOT 1
- sklearn2pmml does not work with sklearn version 1.3.0 and newer HOT 6
- sklearn2pmml does not work with xgboost >= 2.0.0 HOT 9
- LookupTransformer TypeError when default value is not exactly the type from the mapping HOT 2
- xgb 生成的pkl 转pmml 失败 HOT 1
- Can sklearn2pmml just for variable transformations be used? HOT 1
- PyPMML is making systematically off predictions with XGBoost PMML documents? HOT 16
- ImportError: cannot import name 'LabelBinarizer' from 'sklearn2pmml.preprocessing' HOT 2
- Compatibility with Scikit-Learn 1.4.0 HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sklearn2pmml.