tensorflow / decision-forests Goto Github PK

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.

License: Apache License 2.0

Starlark 4.03% Python 71.90% JavaScript 1.17% C++ 20.24% Batchfile 0.17% Shell 2.00% PureBasic 0.49%

python machine-learning random-forest tensorflow ml decision-trees gradient-boosting interpretability decision-forest keras

decision-forests's Introduction

TensorFlow Decision Forests (TF-DF) is a library to train, run and interpret decision forest models (e.g., Random Forests, Gradient Boosted Trees) in TensorFlow. TF-DF supports classification, regression and ranking.

TF-DF is powered by Yggdrasil Decision Forest (YDF, a library to train and use decision forests in C++, JavaScript, CLI, and Go. TF-DF models are compatible with YDF' models, and vice versa.

Tensorflow Decision Forests is available on Linux and Mac. Windows users can use the library through WSL+Linux.

Usage example

A minimal end-to-end run looks as follows:

import tensorflow_decision_forests as tfdf
import pandas as pd

# Load the dataset in a Pandas dataframe.
train_df = pd.read_csv("project/train.csv")
test_df = pd.read_csv("project/test.csv")

# Convert the dataset into a TensorFlow dataset.
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="my_label")
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_df, label="my_label")

# Train the model
model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

# Look at the model.
model.summary()

# Evaluate the model.
model.evaluate(test_ds)

# Export to a TensorFlow SavedModel.
# Note: the model is compatible with Yggdrasil Decision Forests.
model.save("project/model")

Google I/O Presentation

Documentation & Resources

The following resources are available:

TF-DF on TensorFlow.org (API Reference, Guides and Tutorials)
Tutorials (on tensorflow.org)
YDF documentation (also applicable to TF-DF)
Issue tracker
Known issues
Changelog
More examples

Installation

To install TensorFlow Decision Forests, run:

pip3 install tensorflow_decision_forests --upgrade

See the installation page for more details, troubleshooting and alternative installation solutions.

Contributing

Contributions to TensorFlow Decision Forests and Yggdrasil Decision Forests are welcome. If you want to contribute, make sure to review the developer manual and contribution guidelines.

Citation

If you us Tensorflow Decision Forests in a scientific publication, please cite the following paper: Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library.

Bibtex

@inproceedings{GBBSP23,
  author       = {Mathieu Guillame{-}Bert and
                  Sebastian Bruch and
                  Richard Stotz and
                  Jan Pfeifer},
  title        = {Yggdrasil Decision Forests: {A} Fast and Extensible Decision Forests
                  Library},
  booktitle    = {Proceedings of the 29th {ACM} {SIGKDD} Conference on Knowledge Discovery
                  and Data Mining, {KDD} 2023, Long Beach, CA, USA, August 6-10, 2023},
  pages        = {4068--4077},
  year         = {2023},
  url          = {https://doi.org/10.1145/3580305.3599933},
  doi          = {10.1145/3580305.3599933},
}

Raw

Yggdrasil Decision Forests: A Fast and Extensible Decision Forests Library, Guillame-Bert et al., KDD 2023: 4068-4077. doi:10.1145/3580305.3599933

Contact

You can contact the core development team at [email protected].

Credits

TensorFlow Decision Forests was developed by:

Mathieu Guillame-Bert (gbm AT google DOT com)
Jan Pfeifer (janpf AT google DOT com)
Richard Stotz (richardstotz AT google DOT com)
Sebastian Bruch (sebastian AT bruch DOT io)
Arvind Srinivasan (arvnd AT google DOT com)

License

Apache License 2.0

decision-forests's People

Contributors

Stargazers

Watchers

Forkers

davindb kareem-negm jianshijim foeinlove org-mars simenglv dumpmemory nanaakwasiabayieboateng stjordanis weibobo2015 mr-memorandum machinelearning147 bookworm-0805 mbatoul nguyenducnhaty luotailong eien9 yajinwuzl chenyp79 willian-zhang lgy-2017 giuseppeinserra jonahzheng akshatguptakgp dinkofranceschi trendingtechnology bannisterhayley rishiraj anilkunwar stevenlol parulnith saduf2019 pacman1199 stoyanstatanasov paulelvers michaelwsherman xssilva sarvex beoy ericzhangcn1980 iankfc ialzyoud romandevjavascript hawk94 hpirlo teslaxhub enformatik renupatelgoogle jc-louis bothdev imvision12 thomastai1666 cheril311 aiwithqasim jack8861 lowlypalace godsnightmare paultimothymooney isabella232 lhwgwg anggadk01 muskanmahajan486 doytsujin 1257813194 memen10 liushin synandi hemanthkumarak pengliangliu vanshhhhh fhossfel stratascale-data-intelligence arnavrneo jsta a-why-not-fork-repositories-good-luck anhmike om-pandey ajunlonglive mohantym seanpm2001 gg-big-org tryolabs k3shavgupta ishaanahuja7 aetheriaxai achoum vaibhav29498 classicvalues lizq66 unbooster sandy4321 mldlstudio oxzoid chavanyukta vineetp6 throwoutofcoffeeexception ethicalsecurity-agency hchiam dearborn-open-ai ypyangypyang

decision-forests's Issues

logit output of RandomForest

Differently from GBT, it seems that RandomForest does not have logit output

The logits are available in v0.1.7, but the signature is different from sklearn:

Trains a Gradient Boosted Trees that returns logits (assuming the dataset is a binary classification)

model = tfdf.keras.GradientBoostedTreesModel(apply_link_function=False)

Name your decision forest models

I would like to be able to explicitly name my model. I've seen that the models have a name attribute but it does not appear to be possible to set this manually.

I've tried:

setattr(model, 'name', 'my_cool_model')

And:

tfdf.keras.RandomForestModel(name = 'my_cool_model')

variable importance option

Hello!

First of all, I highly appreciate your efforts for TFDF
Found that there are multiple options for variable importance such as NUM_AS_ROOT

variable_importance = model.make_inspector().variable_importances()['NUM_AS_ROOT']

Could you let me know which option I should use to get similar importance list as sklearn?
Where can I get the detailed descriptions on those options? (How to use, what they mean)

Thank you!

NotFoundError when load model from different device

i got an error when load model using keras.models.load_model from different devices. This is my complete code:
from tensorflow import keras
model_path = '/content/drive/MyDrive/saved_model/my_model'
imported = keras.models.load_model(model_path)

I got an error like this:
NotFoundError: Op type not registered 'SimpleMLCreateModelResource' in binary running on 135bfc4bd927. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resamplershould be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

The training and evaluation was completed successfully

Does TFDF support lazy dataset loading during training?

I'm asking for this feature because the dataset I'm working on is generally greater than RAM size (>1.5TiB)

For regular Tensorflow tasks, this can be get around via tweaking training loops and dataset API.

As for TFDF, if I understand correctly, is an wrapping over Yggdrasil C API, datasets are either copied or moved to Yggdrasil as a whole,

decision-forests/tensorflow_decision_forests/tensorflow/ops/training/features.h

Lines 381 to 393 in 0114e4a

 // Initialize a dataset (including the dataset's dataspec) from the linked 

 // resource aggregators. 

 tensorflow::Status InitializeDatasetFromFeatures( 

 tensorflow::OpKernelContext* ctx, 

 const ::yggdrasil_decision_forests::dataset::proto:: 

 DataSpecificationGuide& guide, 

 ::yggdrasil_decision_forests::dataset::VerticalDataset* dataset); 

 // Moves the feature values contained in the aggregators into the dataset. 

 // Following this call, the feature aggregators are empty. 

 tensorflow::Status MoveExamplesFromFeaturesToDataset( 

 tensorflow::OpKernelContext* ctx, 

 ::yggdrasil_decision_forests::dataset::VerticalDataset* dataset);

However I'm seeing some interesting codes in Yggdrasil:

https://github.com/google/yggdrasil-decision-forests/blob/52ed2571c46baa9738f81d7341dc27700dbfec73/yggdrasil_decision_forests/utils/filesystem_test.cc#L84-L93
https://github.com/google/yggdrasil-decision-forests/blob/52ed2571c46baa9738f81d7341dc27700dbfec73/yggdrasil_decision_forests/utils/filesystem_test.cc#L132-L140

I wonder if you could clarify a bit on how datasets are handled in and between TFDF and Yggdrasil. Is it even possible to train an large dataset (> RAM size). If that could be achieved via playing around TFRecord, are they relate to how we define TFRecord data layout?

Tensorflow decision forests after update to tf 2.6.0

There is a problem with Tensorflow_decision_forests after updating to version 2.6.0

here is the gist https://colab.research.google.com/gist/lukebor/70f7abd84d547bf39c4a8b47394e7017/beginner_colab.ipynb

I have used tensorflow beginner tutorial and upgraded the tf. If there is other way to import tfdf please let me know

predict_log_proba is missing on 0.1.7

I heard that predict_log_proba will be supported on 0.1.7 but it is still missing
Please take a look at #26

Shape error when using model.evaluate and model.fit(validation_data=validation_ds)

Dear authors,

I used tfdf.pd_dataframe_to_tf_dataset for train and test set respectively after making sure that both train and test had all 4 classes (single label for each data point).

I found that labels in two sets were integer encoded ([0 1 2 3]).
I defined:

train = tfdf.keras.pd_dataframe_to_tf_dataset(df_train, label=label_column_name)
test = tfdf.keras.pd_dataframe_to_tf_dataset(df_test, label=label_column_name)
model = RandomForestModel(num_trees=5)
model.fit(train, validation_data=test)

It raised error:
ValueError: Shapes (None, 4) and (None, 1) are incompatible
Then I move to this code:

model.fit(train)
model.evaluate(test)

It raised error:
ValueError: Shapes (None, 4) and (None, 1) are incompatible
Then, I checked:

pred = model.predict(test)
print(pred[0])
print(np.unique(pred))

Output:

[0. 1. 0. 0.]
[0.  0.2 0.4 0.6 0.8 1. ]

Please help me to fix this error.
Thank you so much.

Getting value error for model.save()

I trained a model successfully. I was also able to use model.evaluate,model.summary, and tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=4)
But when I tried to save it using:
model.save("hypermodels/model")

I am getting the following error:

ValueError: Got non-flat/non-unique argument names for SavedModel signature 'serving_default': more than one argument to '__inference_signature_wrapper_12650' was named 'build_existing_model.geometry_foundation_type_Heated Basement'. Signatures have one Tensor per named input, so to have predictable names Python functions used to generate these signatures should avoid *args and Tensors in nested structures unless unique names are specified for each. Use tf.TensorSpec(..., name=...) to provide a name for a Tensor input.

Performance issues in tensorflow_decision_forests/keras/keras_test.py(P2)

Hello,I found a performance issue in the definition of _synthetic_train_and_test ,
tensorflow_decision_forests/keras/keras_test.py,
compression_type="GZIP").map(parse) was called without num_parallel_calls.
I think it will increase the efficiency of your program if you add this.

The same issues also exist in test_path, compression_type="GZIP").map(parse).batch(50).map(preprocess)

Here is the documemtation of tensorflow to support this thing.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

add a parameter to support "interactive constraints"

The upstream dmlc xgboost has a feature called interaction constraints.

This feature is useful to train highly explainable models for high-risk applications like lending. It would be wonderful if TFDF boosting supported a similar option.

mkdtemp not getting cleaned up

I just ran out of space on /tmp/ after training about 200 decision forests. I think the temporary directory created at

decision-forests/tensorflow_decision_forests/keras/core.py

Line 378 in 58a5eb0

self._temp_directory = tempfile.mkdtemp()

is never cleaned up, even after the python process ends. Each model that I was training required about 20 megs. So after 200 models I had 4 gigs in /tmp/ and my operating system said "what is all that doing there??" and got mad at me.

I have two ideas about this.

It seems like it should be possible to insist that the learner doesn't use disk unless explicitly permitted by the user. But perhaps that is wildly naive.
It looks like the temporary directory is only actually used in _train_model. So we should be able to use a https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryDirectory context manager, just for that call (unless the user explicitly provdes a temporary directory).

Thoughts?

Checkpointing models during training

It seems the Keras ModelCheckpoint call back doesn't work with TFDF. Is there an alternate way to create checkpoints during training? I am training on a data set with tens of millions of samples and it takes several hours to train. I want to save the progress so that it doesn't need to retrain from scratch in case training crashes.

Please support threads or processes

background
If GPU support is difficult (and takes long), multiple threads or processes can speed up inference as well

feature request
Could you support a parameter like n_jobs?

README file Update

Hi
Please , review my small update for the README file here
I know it's not big but it gust a beginning

C5.0 decision tree algorithm implementation request

Hi,

it would be very helpful to have a C5.0 Decision tree algorithm implementation in tfdf as there is none until now for Python and guess there is quite some demand to have this well know and one of the best algorithms at hand in Python!

It is quite different to CART:
-multiple branches,

Information Gain (Entropy) as its splitting criteria,
different pruning technique (Binomial Confidence Limit)
different handling of missing values (estimate missing values as a function of other attributes or apportions the case statistically among the results)

I am sure it would boost the recognition and usability of tfdf library and make it especially useful for when strong and simple models that are directly explainable are needed.

Thank you for taking note!

FileNotFoundError: Op type not registered 'SimpleMLInferenceOpWithHandle'

Hi
An error happens when I try to load the saved model.
The code worked well with other Keras models. Thus, this may be a TFDF bug.

How to reproduce the issue

1) model save

   I saved a random forest model as follows
   model = tfdf.keras.RandomForestModel(num_trees=4000,
                                           max_depth=16,
                                           min_examples=1,
                                           winner_take_all=False,
                                           categorical_algorithm="RANDOM")
   model.fit(x=X, y=y)
   model.save('./random_forest_model')

2) model load

   When I tried to load the saved model in a different file. An error happened as follows
   The issue did not happen, if I tried to load the model in the same file where the model was generated & saved.

error log

  classifier_model_loaded = tf.keras.models.load_model(classifier_model, compile=False)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/keras/saving/save.py", line 206, in load_model
  return saved_model_load.load(filepath, compile, options)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/keras/saving/saved_model/load.py", line 152, in load
  loaded = tf_load.load_partial(path, nodes_to_load, options=options)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/saved_model/load.py", line 775, in load_partial
  return load_internal(export_dir, tags, options, filters=filters)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/saved_model/load.py", line 908, in load_internal
  raise FileNotFoundError(

FileNotFoundError: Op type not registered 'SimpleMLInferenceOpWithHandle' in binary running on c1boes2. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) tf.contrib.resampler should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
If trying to load on a different device from the computational device, consider using setting the experimental_io_device option on tf.saved_model.LoadOptions to the io_device such as '/job:localhost'.

multiprocessing problem

When running a few RF with multiprocessing(in parallel) its working. but when running a few RF with multiprocessing after RF its stuck. I'm running multiprocessing with the class multiprocessing by running the command:

pool = multiprocessing.Pool()
pool.map(func, input)

in func I'm running tensorflow-RF

Any idea why this is happening?

Thanks,
Tsachi

When loaded a saved model from disk, predict is missing

Issue symptom
When loaded a saved model from disk, predict is missing
How to reproduce
1. Create a model
2. Save it to disk
3. load it from disk
4. call model.predict (If I do like model(X, training=False), it works well)
  
  model creation
  
  model = tfdf.keras.RandomForestModel(num_trees=n_trees, max_depth=depth, min_examples=1)
  model.fit(x=x_selected, y=y)
  
  model save
  
  path = './random_forest'
  os.makedirs(path, exist_ok=True)
  file_name = tempfile.TemporaryDirectory(dir=path).name
  model.save(file_name)
  
  model load
  
  loaded_model = tf.saved_model.load(file_name)
  
  inference
  
  score = loaded_model.predict(x_selected)

AttributeError: '_UserObject' object has no attribute 'predict'

Thanks!

TF DF model serving with TF Serving docker

Hi,

I have built the TF DF model and I am trying to serve it using Docker, I am using the following commands:

# Saved the model using the command:
model.save(MODEL_SAVE_PATH)


# Docker commands

docker pull tensorflow/serving

docker run -d --name serv_base_img tensorflow/serving

docker cp $PWD/models/my_classifier1 serv_base_img:/models/my_classifier1

docker commit --change "ENV MODEL_NAME my_classifier1" serv_base_img my_classifier1

docker run -p 8501:8501 --mount type=bind,source=$PWD/models/my_classifier1,target=/models/my_classifier1 -e MODEL_NAME=my_classifier1 -t tensorflow/serving &

I am getting the following issue:

[1] 76832
2021-06-16 13:03:59.138269: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config:  model_name: my_classifier1 model_base_path: /models/my_classifier1
2021-06-16 13:03:59.138494: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2021-06-16 13:03:59.138511: I tensorflow_serving/model_servers/server_core.cc:591]  (Re-)adding model: my_classifier1
2021-06-16 13:03:59.258773: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: my_classifier1 version: 1}
2021-06-16 13:03:59.258814: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: my_classifier1 version: 1}
2021-06-16 13:03:59.258834: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: my_classifier1 version: 1}
2021-06-16 13:03:59.259636: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:38] Reading SavedModel from: /models/my_classifier1/001
2021-06-16 13:03:59.300033: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:90] Reading meta graph with tags { serve }
2021-06-16 13:03:59.300099: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:132] Reading SavedModel debug info (if present) from: /models/my_classifier1/001
2021-06-16 13:03:59.301471: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-16 13:03:59.351039: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:277] SavedModel load for tags { serve }; Status: fail: Not found: Op type not registered 'SimpleMLCreateModelResource' in binary running on de74cefbb44d. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.. Took 91403 microseconds.
2021-06-16 13:03:59.351122: E tensorflow_serving/util/retrier.cc:37] Loading servable: {name: my_classifier1 version: 1} failed: Not found: Op type not registered 'SimpleMLCreateModelResource' in binary running on de74cefbb44d. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.

Any solution for this?
Thank you!!!

TFDF won't load on Google Colab

I can't install TFDF on Google Colab. The minimum working example I've made is here, and the first cell where I install and load the library fails.

The error is

NotFoundError                             Traceback (most recent call last)
<ipython-input-5-4f2d5416ffd4> in <module>()
      1 get_ipython().system('pip install tensorflow_decision_forests')
      2 import tensorflow as tf
----> 3 import tensorflow_decision_forests as tfdf

/usr/local/lib/python3.7/dist-packages/tensorflow_decision_forests/__init__.py in <module>()
     49 __author__ = "Mathieu Guillame-Bert"
     50 
---> 51 from tensorflow_decision_forests import keras
     52 from tensorflow_decision_forests.component import py_tree
     53 from tensorflow_decision_forests.component.builder import builder

/usr/local/lib/python3.7/dist-packages/tensorflow_decision_forests/keras/__init__.py in <module>()
     47 from typing import Callable, List
     48 
---> 49 from tensorflow_decision_forests.keras import core
     50 from tensorflow_decision_forests.keras import wrappers
     51 

/usr/local/lib/python3.7/dist-packages/tensorflow_decision_forests/keras/core.py in <module>()
     58 from tensorflow.python.training.tracking import base as base_tracking  # pylint: disable=g-direct-tensorflow-import
     59 from tensorflow_decision_forests.component.inspector import inspector as inspector_lib
---> 60 from tensorflow_decision_forests.tensorflow import core as tf_core
     61 from tensorflow_decision_forests.tensorflow.ops.inference import api as tf_op
     62 from tensorflow_decision_forests.tensorflow.ops.training import op as training_op

/usr/local/lib/python3.7/dist-packages/tensorflow_decision_forests/tensorflow/core.py in <module>()
     29 import tensorflow as tf
     30 
---> 31 from tensorflow_decision_forests.tensorflow.ops.training import api as training_op
     32 from yggdrasil_decision_forests.dataset import data_spec_pb2
     33 from yggdrasil_decision_forests.learner import abstract_learner_pb2

/usr/local/lib/python3.7/dist-packages/tensorflow_decision_forests/tensorflow/ops/training/api.py in <module>()
     22 from tensorflow.python.framework import load_library
     23 from tensorflow.python.platform import resource_loader
---> 24 tf.load_op_library(resource_loader.get_path_to_datafile("training.so"))

/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename)
     55   Raises:
     56     RuntimeError: when unable to load the library or get the python wrappers.
---> 57   """
     58   lib_handle = py_tf.TF_LoadLibrary(library_filename)
     59   try:

NotFoundError: /usr/local/lib/python3.7/dist-packages/tensorflow_decision_forests/tensorflow/ops/training/training.so: undefined symbol: _ZN10tensorflow14kernel_factory17OpKernelRegistrar12InitInternalEPKNS_9KernelDefEN4absl14lts_2020_09_2311string_viewESt10unique_ptrINS0_15OpKernelFactoryESt14default_deleteIS9_EE

Windows support

Problem:
Installing via pip install tensorflow-decision-forests returns warning message:

Warning message

``` WARNING: The candidate selected for download or install is a yanked version: 'tensorflow-decision-forests' candidate (version 0.0.0 at https://files.pythonhosted.org/packages/9b/c2/5b5d8796ea5cb8c19457cd9b563c71b536ffc3d14936049fa9cf49b31dcf/tensorflow_decision_forests-0.0.0-py3-none-any.whl#sha256=18ed810f415437ef8e8a4848d3cbf2cff51e3ea2b66224b75f9d9c0f689629a7 (from https://pypi.org/simple/tensorflow-decision-forests/)) Reason for being yanked: ```

After installation the library can't be used because of:

Warning message

``` Traceback (most recent call last): File "c:/Users/kluse/Documents/python/how-to-active-learning/main.py", line 5, in import model as mdl File "c:\Users\kluse\Documents\python\how-to-active-learning\model.py", line 12, in import tensorflow_decision_forests as tfdf ModuleNotFoundError: No module named 'tensorflow_decision_forests' ```

Specifying version via ==0.1.3 doesn't help, and reinstalling too.

Training Duration of Penguin Example

Hello all together,
I have a short question regarding the training time needed by this specific model.

For digging into the material I used the example from TensorFlow Website with the Penguin Data and started the training on my Linux Laptop with a NVIDIA GeForce GTX 1050 Ti with GPU support enabled.

Now I am wondering why the model takes more than a hour for only the training of 300 rows of data with 5 features or so...

Have anyone a benchmark value?
I would really appreciate your help guys.
Best regards
Julian

predict_log_proba support

Could you support something like model.predict_log_proba?
The last layer of TFDF should have something like sigmoid(A) or exp(A) to make [0, 1] ranged output. This is good.
However, I also do need A output without sigmoid or exp. This will allow me to have much wider ranged output than sigmoid(A) or exp(A).
Hope that you consider this positively because this is very important to my tasks ;-)
Really look forward to seeing this feature on next release

Thanks a lot!

Clarification on Consuming Text as Categorical Sets

Hello,

The intermediate_colab ("Combine With Other Models") tutorial does a good job at showing how to preprocess a string to a categorical set. This is the example function provided:

def prepare_dataset(example):
  label = (example["label"] + 1) // 2
  return {"sentence" : tf.strings.split(example["sentence"])}, label

train_ds = all_ds["train"].batch(64).map(prepare_dataset)
test_ds = all_ds["validation"].batch(64).map(prepare_dataset)

From my understanding, tf.strings.split isn't the best way of doing this because it wont drop duplicates. For example, a text feature “The TV is the best” would be represented by {"The","TV","is","the","best"} when using tf.string.split. According to this article, it should instead be transformed to the following categorical set: {“best”, “is”, “the”, “TV}."

Is dropping duplicates necessary?

Unable to convert RandomForestModel into tflite form

I was able to fit my RandomForest model, however when I try to convert it into tflite format it throws error.
The error is : InvalidArgumentError: Cannot convert a Tensor of dtype resource to a NumPy array.

Packaging multiple models into one?

Hi, thanks for this package!

I'm looking into converting an existing neural network model to use the decision forests approach, for comparison and/or use in an ensemble.

The existing neural network I've been developing has multiple outputs (mixture of regression and classification), and some of these outputs feed into other outputs (the other outputs also have access to the rest of the training data).

It might be easier to explain via simple example (in the likely case that I'm not putting it into words well!). Let's say I have input features In and 3 target labels A, B, and C. My neural network works something like this:

A = Model(In)
B = Model(In + A)
C = Model(In + A + B)

This gives me a unified model for A, B, and C, which can be trained, saved, and loaded as one entity.

I can see how I might achieve something like this with decision-forests by using the preprocessing argument, passing in the training data and the previous model, and returning the training data with an added column for the prediction of the previous model. The final model would give a single output, but I could write something to load each model and make a list of all the predictions. In a similar vein, I could write something to load and augment the training data before the training of each model, as an alternative to using preprocessing.

Is there a way to obtain multi-output models in a way that is nicer than the (potentially silly) approach above?

Thanks for any help, sorry if this question doesn't make much sense, I can clarify if needed!

Efficient training of models with large number of input features (>10k)

Background

Currently, the training graph contains one tf op for each input features. In case of large number of features (or in the case of multi dimensional features), this can lead to a large overhead (large memory consumption, large training initialization stage).

Features request

Support for multi dimensional features without creating an op for each dimension.

Tflite format for on-device inference

Good day.

Just to check:
Would I be able to save a tfdf model (in, for example, a tflite format) and then load the model to perform on-device inference in smartphones?

Thank you.
Cheers.

Makes it easy to run cross-validations on small datasets

Decision Forests work well on small datasets where cross-validation is commonly used. It would be valuable to easily run cross-validations and report cross-validated related metrics (evaluation metrics, confidence interfaces, statistical tests, etc.).

Unable to import tensorflow_decision_forests

After !pip install tensorflow_decision_forests --upgrade and try to import tensorflow_decision_forests as tfdf,
I found this error NotFoundError: /opt/conda/lib/python3.7/site-packages/tensorflow_decision_forests/tensorflow/ops/training/training.so: undefined symbol: _ZN10tensorflow11GetNodeAttrERKNS_9AttrSliceEN4absl14lts_2020_09_2311string_viewEPSs

I've tried to uninstall the tensorflow and reinstall the tensorflow==2.3.0, but does not work.
Please let me know if you have any comments

pip install does not work on Mac

Hey there,

First of all, congratulations for your effort, this is a great initiative!

I am raising this issue because I have faced a problem with installation. I have created a Python 3.8.6 virtual environment on my Mac and installed tensorflow 2.5.0 successfully. When I ran the installation command for the "Tensorflow Decision Forests" package,
pip3 install tensorflow_decision_forests --upgrade

I got:

ERROR: Could not find a version that satisfies the requirement tensorflow_decision_forests (from versions: none) ERROR: No matching distribution found for tensorflow_decision_forests

It's a bit confusing because the installation command on PyPi (I guess this is the right one) contains dashes ,instead of underscores, in the package name.

Any ideas?

Thanks a lot

builder.close() failing when using GradientBoostedTreeBuilder

I've been following the example posted here to obtain predictions from individual trees within a GradientBoostedTreesModel i.e.

# Train model
model = tfdf.keras.GradientBoostedTreesModel()
model.compile(metrics=["accuracy"])
model.fit(train_ds)

# Extract trees
trees = model.make_inspector().extract_all_trees()

# Build model with one tree
builder =  tfdf.builder.GradientBoostedTreeBuilder(
    path = "model",
    objective=inspector_bt.objective()
)
builder.add_tree(trees[0])
builder.close()

However, it fails when calling builder.close() with the following error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-21-f4a8f4f498e3> in <module>
      7 # Add first tree
      8 builder_bt.add_tree(trees_bt[0])
----> 9 builder_bt.close()

/usr/local/lib/python3.6/site-packages/tensorflow_decision_forests/component/builder/builder.py in close(self)
    737 
    738     # Should be called last.
--> 739     super(GradientBoostedTreeBuilder, self).close()
    740 
    741   def specialized_header(self) -> Any:

/usr/local/lib/python3.6/site-packages/tensorflow_decision_forests/component/builder/builder.py in close(self)
    500 
    501     for tree in self._trees:
--> 502       self._write_branch(tree.root)
    503     self._trees = []
    504 

/usr/local/lib/python3.6/site-packages/tensorflow_decision_forests/component/builder/builder.py in _write_branch(self, node)
    586 
    587     # Converts the node into a proto node.
--> 588     core_node = py_tree.node.node_to_core_node(node, self.dataspec)
    589 
    590     # Write the node to disk.

/usr/local/lib/python3.6/site-packages/tensorflow_decision_forests/component/py_tree/node.py in node_to_core_node(node, dataspec)
    153     condition_lib.set_core_node(node.condition, dataspec, core_node)
    154     if node.value is not None:
--> 155       value_lib.set_core_node(node.value, core_node)
    156 
    157   elif isinstance(node, LeafNode):

/usr/local/lib/python3.6/site-packages/tensorflow_decision_forests/component/py_tree/value.py in set_core_node(value, core_node)
    154     core_node.regressor.top_value = value.value
    155     if value.standard_deviation is not None:
--> 156       dist = core_node.regressor.dist
    157       dist.count = value.num_examples
    158       dist.sum = 0

AttributeError: dist

I've tested a possible fix for this by changing this line (line 156 above) to dist = core_node.regressor.distribution as used elsewhere in the codebase (see here) and it seems to work, but I'd appreciate the eyes of someone that is more familiar with the code than I am.

It's possible that this hasn't been caught previously as none of the tests here seem to include the standard deviation in the RegressionValue.

Getting error at end of training: AbstractFeatureResourceE does not exist. [Op:SimpleMLModelTrainer]

I am getting the following error when I try a simple model.

csv_feature_columns =  ['weekday_weekend'] + weather_columns + building_columns + schedules_columns + encoded_time_columns + ["total_site_electricity_kwh"] 

train_df = pd.read_csv(timeseries_file_path,usecols=csv_feature_columns,nrows=10000)

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_df, label="total_site_electricity_kwh")

model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

157/157 [==============================] - 6s 18ms/step
---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
<ipython-input-6-ce1e05e4d2c8> in <module>
      1 # Train a Random Forest model.
      2 model = tfdf.keras.RandomForestModel()
----> 3 model.fit(train_ds)
      4 

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow_decision_forests/keras/core.py in fit(self, x, y, callbacks, **kwargs)
    743 
    744     history = super(CoreModel, self).fit(
--> 745         x=x, y=y, epochs=1, callbacks=callbacks, **kwargs)
    746 
    747     self._build(x)

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1227           epoch_logs.update(val_logs)
   1228 
-> 1229         callbacks.on_epoch_end(epoch, epoch_logs)
   1230         training_logs = epoch_logs
   1231         if self.stop_training:

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow/python/keras/callbacks.py in on_epoch_end(self, epoch, logs)
    433     logs = self._process_logs(logs)
    434     for callback in self.callbacks:
--> 435       callback.on_epoch_end(epoch, logs)
    436 
    437   def on_train_batch_begin(self, batch, logs=None):

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow_decision_forests/keras/core.py in on_epoch_end(***failed resolving arguments***)
    930     del logs
    931     if epoch == 0:
--> 932       self._model._train_model()  # pylint:disable=protected-access
    933 
    934 

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow_decision_forests/keras/core.py in _train_model(self)
    864         guide=guide,
    865         training_config=self._advanced_arguments.yggdrasil_training_config,
--> 866         deployment_config=self._advanced_arguments.yggdrasil_deployment_config,
    867     )
    868 

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow_decision_forests/tensorflow/core.py in train(input_ids, label_id, model_id, learner, task, generic_hparms, ranking_group, training_config, deployment_config, guide, model_dir, keep_model_in_resource)
    503       training_config=training_config.SerializeToString(),
    504       deployment_config=deployment_config.SerializeToString(),
--> 505       guide=guide.SerializeToString())
    506 
    507 

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow/python/util/tf_export.py in wrapper(*args, **kwargs)
    402           'Please pass these args as kwargs instead.'
    403           .format(f=f.__name__, kwargs=f_argspec.args))
--> 404     return f(**kwargs)
    405 
    406   return tf_decorator.make_decorator(f, wrapper, decorator_argspec=f_argspec)

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow_decision_forests/tensorflow/ops/training/op.py in simple_ml_model_trainer(feature_ids, label_id, weight_id, model_id, model_dir, learner, hparams, task, training_config, deployment_config, guide, name)
    510       return _result
    511     except _core._NotOkStatusException as e:
--> 512       _ops.raise_from_not_ok_status(e, name)
    513     except _core._FallbackException:
    514       pass

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
   6895   message = e.message + (" name: " + name if name is not None else "")
   6896   # pylint: disable=protected-access
-> 6897   six.raise_from(core._status_to_exception(e.code, message), None)
   6898   # pylint: enable=protected-access
   6899 

~/.conda/envs/tensorflow25/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

NotFoundError: Resource decision_forests/ 12-in/N27tensorflow_decision_forests3ops23AbstractFeatureResourceE does not exist. [Op:SimpleMLModelTrainer]

Issue with Max_depth in tfdf

    df_and_nn_model = tfdf.keras.GradientBoostedTreesModel(preprocessing=regmodel_wo_head,
                                                task=tfdf.keras.Task.REGRESSION,
                                                num_trees=500,
                                                max_depth=2,
                                                max_num_nodes=-1,
                                                min_examples=5,
                                                validation_ratio=0.2,
                                                subsample=0.9,
                                                early_stopping='MIN_LOSS_FINAL',
                                                shrinkage=0.001)

and after


df_and_nn_model.compile(metrics=[tf.keras.metrics.RootMeanSquaredError()])
with sys_pipes():
    df_and_nn_model.fit(train_dataset, validation_data=val_dataset)

[INFO kernel.cc:772] Configure learner
[FATAL hyper_parameters.cc:49] Already consumed hyper-parameter "max_depth".

This was working yesterday morning but i made updates on kaggle and it throws this exception, i have no idea what it means.

Please support GPU

Background
My tensorflow codes work on GPU. They have some matrix operations which can be done fast on GPU. If they run with tfdf, the data must be downloaded from GPU & uploaded to GPU when classification is done. In terms of throughput, this is a great loss.

Feature Request
Please support GPU especially for inference like predict function. Training can take times because an user can try various configurations to find the best one. This is understandable. However, applying the trained model must meet the runtime requirement.

Features importance

Hey,

I ran the feature significant and compared the results to Sklearn output. not only the results are different, but also the results that I'm getting using this implementation doesn't make any sense(using the info that I have about my data).
for example a feature that is constant is one of the most significant features(it got the heights value).

maybe I don't know how to read the output properly?
("data:0.33" (1; #27), 235)
this means that feature number 27 got score of 235?

Tsachi

2nd file write error

First of all, thanks a lot. I love TFDT!

Issue symptom
Found the issue when I ran the same code 2 times
If I over-write an existing model, then some issue happens when I load the model from disk

How to reproduce the issue

generate a model, save it to a file, and then, load it to perform inference. This is good

model = tfdf.keras.RandomForestModel(num_trees=n_trees, max_depth=depth, min_examples=1)
model.fit(x=X, y=y)
model.save('./slf_random_forest')
loaded_model = tf.saved_model.load('./random_forest')
Score = loaded_model(X, training=False)

Run the same code again & get an error as follows. If I remove the './random_forest' first, then everything is good

File "/dl_data/users/howardlee/opwi_algo/Post/Selfi/Algo/Classifiers.py", line 76, in decision_function
  Score = self.clf(X, training=False)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/saved_model/load.py", line 670, in _call_attribute
  return instance.__call__(*args, **kwargs)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
  result = self._call(*args, **kwds)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 956, in _call
  return self._concrete_stateful_fn._call_flat(
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1960, in _call_flat
  return self._build_call_outputs(self._inference_function.call(
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 591, in call
  outputs = execute.execute(
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
  tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Unexpected dimension of numerical_features bank.
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/inference_op}}]]
(1) Invalid argument: Unexpected dimension of numerical_features bank.
[[{{node StatefulPartitionedCall/StatefulPartitionedCall/inference_op}}]]
[[StatefulPartitionedCall/StatefulPartitionedCall/inference_op/_4]]
0 successful operations.
0 derived errors ignored. [Op:__inference_restored_function_body_5262]

Function call stack:
restored_function_body -> restored_function_body

GPU support

I just wanted to know if there is a plan for GPU.
My codes are in tensorflow. Therefore, running on GPU is very important in terms of throughput.

Crashed on Colab due to memory hungry

TensorFlow Decision Forests appears being memory hungry. I compared it with PyCaret on Colab. TensorFlow Decision Forests crashed with the message “Your session crashed after using all available RAM.”, while PyCaret completed the work. Is there any feasible way to solve this problem?

Intermediate colab is erroring out. See attached.

How to check the F1 score for multi-class classification task?

I have succussfuly run this Decision Forest Algorithm. However, my data has severe imbalance between categories, in which case the Accuracy is not fair to evaluate the model performance. I would like to ask are there options of f1 , precision , and recall applied as the metrics?

Error when doing model.evaluate

I tried to evaluate the model using:

evaluation = model.evaluate(test_ds, return_dict=True)

But I am getting the following error:
ValueError: SyncOnReadVariable does not supportassign_addin cross-replica context when aggregation is set to tf.VariableAggregation.SUM.

The training was completed successfully.

Using a re-loaded model for prediction or evaluation is giving error

I am able to save and then re-load a model. But when I use the re-loaded model for prediction or evaluation, I get the following error:

model.save("hypermodels/model")
model = tf.keras.models.load_model("hypermodels/model/")
energy_predictions = model.predict(train_ds,verbose=1)

InvalidArgumentError:  Unexpected dimension of numerical_features bank.
	 [[{{node gradient_boosted_trees_model_1/StatefulPartitionedCall/StatefulPartitionedCall/inference_op}}]] [Op:__inference_predict_function_28619]

Function call stack:
predict_function

Support tf.distribute strategies in TF-DF

Which tf.distribute strategy would be most suitable to use with tfdf if we were to use it with multiple nodes of an HPC.

How to get probabilities for all classes?

I want to predict the probabilities of all classes in multiclass classification problem. How do i do it?

Retrieving standard deviation of predictions from ensemble

I'm working on an application where I'd like to retrieve the standard deviation of the predictions made by the trees within an ensemble (currently a tfdf.keras.RandomForestModel) to use as an estimate of the confidence of a given prediction.

It looks like I could do this by running a prediction on each individual tree with inspector.iterate_on_nodes() but is there a better way to do this via the main predict method, and if not would you consider this as an enhancement?

disable early-stopping does not work

https://github.com/google/yggdrasil-decision-forests/blob/main/documentation/user_manual.md#disabling-the-validation-dataset-for-gbt

Tried to disable early-stopping & validation data but it seems like it does not work

Model generation without early-stopping & validation data

model = tfdf.keras.GradientBoostedTreesModel(
num_trees=n_trees,
growing_strategy="BEST_FIRST_GLOBAL",
max_depth=depth,
min_examples=1,
shrinkage=learning_rate,
categorical_algorithm="RANDOM",
use_hessian_gain=True,
validation_ratio=0.0,
early_stopping=None,
temp_directory=tmp_dir_name
)
model.fit(x=x_selected, y=y)

error message

File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow_decision_forests/keras/core.py", line 780, in fit
history = super(CoreModel, self).fit(
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1229, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/keras/callbacks.py", line 435, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow_decision_forests/keras/core.py", line 994, in on_epoch_end
self._model._train_model() # pylint:disable=protected-access
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow_decision_forests/keras/core.py", line 915, in _train_model
tf_core.train(
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow_decision_forests/tensorflow/core.py", line 494, in train
return training_op.SimpleMLModelTrainer(
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/util/tf_export.py", line 404, in wrapper
return f(**kwargs)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow_decision_forests/tensorflow/ops/training/op.py", line 512, in simple_ml_model_trainer
_ops.raise_from_not_ok_status(e, name)
File "/opt/conda/envs/tf2.5.0/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError: TensorFlow: INVALID_ARGUMENT: Early stopping requires a validation set. Either set "validation_set_ratio" to be greater than 0, or disable early stopping. [Op:SimpleMLModelTrainer]

does it support multi_labels classfication?

Can tfdf work with a streaming tf dataset?

My training data is in a multi GB CSV file. I have built a data pipeline using tf.data to stream this data and do some pre-processing,. Can I use these dataset objects in tfdf model.fit (similar to how it is done in Keras) or does tfdf need the dataset to have all the data stored in memory?

	// Initialize a dataset (including the dataset's dataspec) from the linked
	// resource aggregators.
	tensorflow::Status InitializeDatasetFromFeatures(
	tensorflow::OpKernelContext* ctx,
	const ::yggdrasil_decision_forests::dataset::proto::
	DataSpecificationGuide& guide,
	::yggdrasil_decision_forests::dataset::VerticalDataset* dataset);

	// Moves the feature values contained in the aggregators into the dataset.
	// Following this call, the feature aggregators are empty.
	tensorflow::Status MoveExamplesFromFeaturesToDataset(
	tensorflow::OpKernelContext* ctx,
	::yggdrasil_decision_forests::dataset::VerticalDataset* dataset);