GithubHelp home page GithubHelp logo

Comments (4)

jameslamb avatar jameslamb commented on July 18, 2024

Thanks for your interest in LightGBM.

I think we need a lot more information to consider this.

I can see from PyPI that this project is your project, just noting that here as it's important for understanding this request.

Some initial questions:

  • where is the source code managed? Can you please share that? there is no link to source control on the PyPI page you shared
  • what specifically would you like to see by "integrating" this library here? for example:
    • copying relevant parts of its code into LightGBM
    • having lightgbm (the LightGBM Python package) depend on it and consume it as a library
    • something else

from lightgbm.

Rajat376 avatar Rajat376 commented on July 18, 2024

We manage the source code in our private gitlab servers.
"copying relevant parts of its code into LightGBM" should be the ideal scenario.

from lightgbm.

i-plusplus avatar i-plusplus commented on July 18, 2024

Let me try to rewrite it with some data points.

We are using LightGBM for CTR prediction. Most of the features in our dataset are categorical features with very high cardinality.

We observed LightGBM very intelligently uses only 1-2% of feature values which matters in the final model.
We further observe the Lightgbm model dump, contains all the feature values and they are repeated in all the places where decision is taken based on that feature.

This leads to a very big model dump. This big model dump was created memory issues for us.

We recreated the model string after removing the unused categorical values in "pandas-categorical" & changed "cat_threshold", "cat boundaries" according to new pandas-categorical, also removed "tree_sizes" to account for changed size of tree. This leads to reducing the model size from ~7gb to ~100mb. We further observed a reduction in inference time of about 30% to 60% on different models.

This improvements is very useful for very high cardinality features when Lightgbm use only a small fraction of them.

Even thought the code fairly small(~200 lines of python code) and simple. Please allow us some more time to open source the code. We are in process of make the code opensource.

I believe someone from Lightgbm core team can better decide if they want to include this change in Lightgbm core library or as a extension in python wrapper.

All the above numbers can be easily verified with the following code

from lightgbmmodeloptimizer import Optimizer
model = lgb.Booster(model_file='<path to model>')
optimizer = Optimizer()
model = optimizer.optimize_booster(model)

optimizer = Optimizer()
m = optimizer.optimize_model_file('<path to model>')  #current implementation override the input path with optimized model file. 

from lightgbm.

jameslamb avatar jameslamb commented on July 18, 2024

Thanks for that!

I believe someone from Lightgbm core team can better decide if they want to include this change in Lightgbm core library or as a extension in python wrapper.

I agree with this. If you could share source code we could look through (ideally somewhere on the internet that we could link to), I'd be happy to investigate this and offer an opinion on which parts if any should be integrated into LightGBM.

All the above numbers can be easily verified with the following code

Thanks for trying to provide some code, but this claim is not true.

Your code doesn't include any information about the model or dataset. Based on the features that you described (like somehow condensing high-cardinality categoricals), it sounds to me like the performance gains that might be observed are very dependent on the specific dataset.

from lightgbm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.