GithubHelp home page GithubHelp logo

Comments (3)

vruusmann avatar vruusmann commented on August 20, 2024 1

rfcf.compact = True;

That's an outdated "option configuration" syntax. You should be using PMMLPipeline.configure(**pmml_options) now.

The compact = True option, when actually applied, should decrease the size of the PMML file around 50%. Export the same pipeline first with compact = False and then with compact = True.

Also, please note the this 2.7 GB PMML file probably contains 2 GB of "XML markup" and 0.7 GB of whitespace (in the form of tab characters). You may safely strip the latter.

Anyway, the size of the PMML file in local filesystem is quite irrelevant. What really needs to be optimized is RAM consumption when it is parsed/deployed in the production system.

from sklearn2pmml.

vruusmann avatar vruusmann commented on August 20, 2024

It is a "special property" of tree models (and their ensembles such as Random Forest models) that the size of the model object increases as the size/complexity of the dataset increases.

To solve the problem, you must change the parameterization of RandomForestClassifier object so that the learning algorithm would limit the complexity of tree objects:

  • max_depth. Assuming binary split, there could be 2^max_depth nodes in the final tree.
  • min_samples_split
  • min_samples_leaf
  • max_leaf_nodes.

For example:

clf = RandomForestClassifier(n_estimators = 100, max_depth = 10, min_samples_split = 100, min_samples_leaf = 5)

I would advise you to increase the number of estimators (ie. set the n_estimators parameter to much greater value than 10), and focus on configuring the size of individual estimator trees.

It's not PMML problem per se. For example, if you simply serialized your RF models in Python's native pickle data format, then you'd see exactly the same happening - a small dataset would give you a small pickle file, whereas a large dataset would give you a much bigger pickle file.

from sklearn2pmml.

raosudhir avatar raosudhir commented on August 20, 2024

Hello! I am running into the same issue, if I can call it that.....too large a PMML file!
Here is my code with pipeline, and classifier params:

rfcf_pipeline = PMMLPipeline([("classifier", rfcf)])
rfcf_pipeline.fit(x_train, y_train)
rfcf.compact = True;
sklearn2pmml(rfcf_pipeline, "pmml/RandomForestClassifierPipeline.pmml")

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=True, random_state=None, verbose=0, warm_start=False))

My training dataset has 557K records and the generated PMML file is 2.7G in size. I currently have 20 features and haven't finalized them yet. Likely I'll remove some but add other ones.

I think that the basic issue causing the PMML file size to bloat would be the depth of my trees.

The predicted results are good and wouldn't want to have to lower the quality of the results (who would? :~))

Any recommendations on the parameters to tinker with? I was hoping setting "classifier.compact = True" will help reduce the size of the generated PMML file, but it didn't.

from sklearn2pmml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.