Here is my code: X_train_1, y_train_1 = load_svmlight_file('test.txt

Why the pmml file is too large? about sklearn2pmml HOT 3 CLOSED

jpmml commented on August 20, 2024

Why the pmml file is too large?

from sklearn2pmml.

Comments (3)

vruusmann commented on August 20, 2024 1

rfcf.compact = True;

That's an outdated "option configuration" syntax. You should be using PMMLPipeline.configure(**pmml_options) now.

The compact = True option, when actually applied, should decrease the size of the PMML file around 50%. Export the same pipeline first with compact = False and then with compact = True.

Also, please note the this 2.7 GB PMML file probably contains 2 GB of "XML markup" and 0.7 GB of whitespace (in the form of tab characters). You may safely strip the latter.

Anyway, the size of the PMML file in local filesystem is quite irrelevant. What really needs to be optimized is RAM consumption when it is parsed/deployed in the production system.

from sklearn2pmml.

vruusmann commented on August 20, 2024

It is a "special property" of tree models (and their ensembles such as Random Forest models) that the size of the model object increases as the size/complexity of the dataset increases.

To solve the problem, you must change the parameterization of RandomForestClassifier object so that the learning algorithm would limit the complexity of tree objects:

max_depth. Assuming binary split, there could be 2^max_depth nodes in the final tree.
min_samples_split
min_samples_leaf
max_leaf_nodes.

For example:

clf = RandomForestClassifier(n_estimators = 100, max_depth = 10, min_samples_split = 100, min_samples_leaf = 5)

I would advise you to increase the number of estimators (ie. set the n_estimators parameter to much greater value than 10), and focus on configuring the size of individual estimator trees.

It's not PMML problem per se. For example, if you simply serialized your RF models in Python's native pickle data format, then you'd see exactly the same happening - a small dataset would give you a small pickle file, whereas a large dataset would give you a much bigger pickle file.

from sklearn2pmml.

raosudhir commented on August 20, 2024

Hello! I am running into the same issue, if I can call it that.....too large a PMML file!
Here is my code with pipeline, and classifier params:

rfcf_pipeline = PMMLPipeline([("classifier", rfcf)])
rfcf_pipeline.fit(x_train, y_train)
rfcf.compact = True;
sklearn2pmml(rfcf_pipeline, "pmml/RandomForestClassifierPipeline.pmml")

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False))

My training dataset has 557K records and the generated PMML file is 2.7G in size. I currently have 20 features and haven't finalized them yet. Likely I'll remove some but add other ones.

I think that the basic issue causing the PMML file size to bloat would be the depth of my trees.

The predicted results are good and wouldn't want to have to lower the quality of the results (who would? :~))

Any recommendations on the parameters to tinker with? I was hoping setting "classifier.compact = True" will help reduce the size of the generated PMML file, but it didn't.

from sklearn2pmml.

Why the pmml file is too large? about sklearn2pmml HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs