Add sklearn's HistGradientBoosting about hummingbird HOT 13 CLOSED

microsoft commented on May 22, 2024 1

Add sklearn's HistGradientBoosting

from hummingbird.

Comments (13)

NicolasHug commented on May 22, 2024 1

I'm one of the authors of the histgradientboosting estimators, feel free to ping me if you have any question related to them!

from hummingbird.

interesaaat commented on May 22, 2024 1

Glad to see you here Nicholas :)

from hummingbird.

NicolasHug commented on May 22, 2024 1

Yup I think you got it right

The array of nodes is initialized with all fields being 0. If a node doesn't have a left child that means it doesn't have a right child either so the left/right fields are 0, and the is_leaf field is True/1.

from hummingbird.

NicolasHug commented on May 22, 2024 1

Why not convert the array "upstream" so that you can rely on the existing code for the non-hist estimators?

lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]
lefts = [idx if idx != 0 else -1 for idx in lefts]

from hummingbird.

ahmedkrmn commented on May 22, 2024

Hi Matteo,
Can you give more context on how can I approach the implementation of this feature?

from hummingbird.

interesaaat commented on May 22, 2024

Hi Ahmed,
thanks for showing interest in Hummingbird! @ksaur has create a branch with some test. If you pull the branch and try to run this test file you should be getting something like the following:

hummingbird.ml.exceptions.MissingConverter: Unable to find converter for model type <class 'sklearn.ensemble._hist_gradient_boosting.gradient_boosting.HistGradientBoostingClassifier'>. It usually means the pipeline being converted contains a transformer or a predictor with no corresponding converter implemented. Please fill an issue at https://github.com/microsoft/hummingbird.

Next, to add a new operator in Hummingbird:

Go into hummingbird.ml.supported and add the HistGradientBoostingClassifier class to the _build_sklearn_operator_list (and to the documentation at the beginning of the file please!). This will basically tell to Hummingbird to recognize this new operator.
Now we need to actually add the operator converter. You can add the converter to hummingbird.ml.operator_converters.gbdt.py (just copy and past the last line in the file, and change "SklearnGradientBoostingClassifier" with "SklearnHistGradientBoostingClassifier"). This will tell Hummingbird that, to convert HistGradientBoostingClassifier it can use the same function of GradientBoostingClassifier. This will probably not work :)
The final step is to make the convert work. This will require some work on your side. Basically you can copy what we have for the GradientBoostingClassifier in convert_sklearn_gbdt_classifier into a new convert_sklearn_hist_gbdt_classifier function and try to map the tree parameters into the format understood by convert_sklearn_gbdt_classifier. You don't need to do anything more than this. convert_sklearn_gbdt_classifier will already pick the pytorch tree implementation for you, so there is no need to go deeper than this or implement anything in PyTorch.

Please share any doubt or question you may have!

from hummingbird.

ahmedkrmn commented on May 22, 2024

Thanks @interesaaat for the detailed introduction and @NicolasHug for offering help!

I've installed the dependencies, and built the library using python setup.py install.
Everything is working fine. I ran the test_sklearn_histgbdt_converters.py that you've mentioned and it indeed provides the error message that you've said.

Anyways, I started implementing the convert_sklearn_hist_gbdt_classifier() function, but there is a problem that I would like to know your thoughts on:

First, I believe that the equivalent of tree_infos = operator.raw_operator.estimators_ from the convert_sklearn_gbdt_classifier() function would be tree_infos = operator.raw_operator._predictors in the new convert_sklearn_hist_gbdt_classifier().

The problem is that estimators_ is an array of DecisionTreeRegressor objects which have a tree_ property. This tree_ object has the following list of properties:
children_left, children_right, feature, threshold and value.

On the other hand, _predictors is an array of TreePredictor objects which has only one property nodes (arrays of nodes).

So my question is, how can we map the tree_infos in HistGradientBoostingClassifier to GradientBoostingClassifier?

from hummingbird.

NicolasHug commented on May 22, 2024

So my question is, how can we map the tree_infos in HistGradientBoostingClassifier to GradientBoostingClassifier?

A tree object has many arrays of size n_nodes, i.e. it has one array per property as you noticed (children_left, children_right, etc)

However the predictor object of the hist-GBDT is different: it's a single structured numpy array, i.e. it's an array whose elements have a specific dtype with multiple entries. It's basically an array of structs, if we were in C.
The dtype is specified here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/_hist_gradient_boosting/common.pyx#L18

For example the threshold property of the root can be accessed via nodes[0]['threshold']. Its left child is in nodes[nodes[0]['left']], etc.

from hummingbird.

ahmedkrmn commented on May 22, 2024

Thanks @NicolasHug for the clarification!

IIUC, the equivalent of:

tree_info = operator.raw_operator.estimators_[0][0]
lefts = tree_info.tree_.children_left

should be:

tree_info = operator.raw_operator._predictors[0][0]
lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]

when using hist-GBDT.

If that is the case, it seems like nodes which don't have left nodes are represented with 0 instead of -1:
lefts for GBDT: [1, 2, -1, -1, 5, -1, -1]
lefts for hist-GBDT: [1, 2, 0, 0, 5, 0, 0]

from hummingbird.

interesaaat commented on May 22, 2024

I think that for left, right and threshold we should have -1 instead of 0 in Hummingbird because the implementation looks for -1 values. (I might be wrong, but I am on the phone and I am having hard time checking the code) Anyway this is not hard :) Thanks Nicolas for the help!

from hummingbird.

ahmedkrmn commented on May 22, 2024

OK great,
But wouldn't that require changing all the conditions with -1 to 0 in

hummingbird/hummingbird/ml/operator_converters/_tree_commons.py

Line 62 in aac965a

def _find_max_depth(tree_parameters):

and

hummingbird/hummingbird/ml/operator_converters/_tree_commons.py

Line 106 in aac965a

def _find_depth(node, current_depth):

from hummingbird.

ahmedkrmn commented on May 22, 2024

@interesaaat would it be a good idea to compare against both -1 and 0 instead of -1 only in _find_max_depth() and leave the base case of recursion -1 as is for _find_depth(), or is there a better approach that you suggest?

from hummingbird.

ahmedkrmn commented on May 22, 2024

Why not convert the array "upstream" so that you can rely on the existing code for the non-hist estimators?
lefts = [tree_info.nodes[x]['left'] for x in range(len(tree_info.nodes))]
lefts = [idx if idx != 0 else -1 for idx in lefts]

Yeah, that's better!

from hummingbird.

Add sklearn's HistGradientBoosting about hummingbird HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs