GithubHelp home page GithubHelp logo

Comments (2)

richcaruana avatar richcaruana commented on May 29, 2024 1

Because EBMs are a restricted model class so that they remain intelligible, their simplicity means that they do not need large amounts of data compared to some other model types such as neural nets or boosted decision trees. In practice their data complexity is more comparable to linear and logistic regression than it is to deep neural nets, but they do often need/benefit from more data than a linear model would. The more features in the dataset the more data you need to be able to learn an accurate model, and the more complex the function needed for each feature, the more data will be needed to shape those functions accurately, so it is difficult to give numbers without knowing more about the data and problem. Our experience is that useful models with a few dozen features can be trained on data with 500 or more cases if the data is not too imbalanced. I like to look at the size of the smallest important class when I think about data size. If there are 10k training cases, but the data is only 1% positives, then there's only 100 positive cases and this no longer behaves like a large 10k sample. Also there is a difference between classification and regression --- often regression can work with fewer samples because there is more information in the label of each sample compared to Boolean classification where the label is only 0 or 1.

In summary, EBMs are reasonably sample efficient, needing somewhat more data than linear methods, but usually not as much data as more complex black-box methods such as neural nets and unrestricted boosted trees, and EBMs often work well with sample sizes of about 1000 cases or more. If there are very few samples for training, sometimes it helps to play with the EBM hyperparameters to do more outer bagging, fewer bins, and even shorter trees.

from interpret.

basnetpro3 avatar basnetpro3 commented on May 29, 2024 1

Thank you very much sir, I really liked your EBM model. It means if we have 3 or 4 features we can still get insights from EBM using less data. There are some research papers using EBM which have used data around 300 or less. If we read research papers we can't really say how much data is actually required for perticular model because every paper's data varies from 50 -100 and more.

from interpret.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.