GithubHelp home page GithubHelp logo

Comments (10)

janpfeifer avatar janpfeifer commented on June 7, 2024 2

It seems to be something about the shapes of the output to call_get_leaves. I wonder if there is a way for us fixing that in TFDF, but in the meantime, you could force it like this:

...
        #TODO: want to change this to leaves predictions
        tfdf_output = self.tfdf_model.call_get_leaves(inputs)
        tfdf_output = tf.cast(tfdf_output, tf.float32)
        tfdf_output = tf.reshape(tfdf_output, [tf.shape(inputs)[0], 5])
        tfdf_output = tf.stop_gradient(tfdf_output)
...

But you have to know in advance the number of trees (5) and hardcode that into the model.

from decision-forests.

janpfeifer avatar janpfeifer commented on June 7, 2024 1

Btw, I simply converted the leaf numbers to float values, but if I were to combine the models, I'd definitely either embed the leaf numbers (different embedding per tree) or just add an extra NN (Dense) layer on top (which is equivalent).

from decision-forests.

advahadr avatar advahadr commented on June 7, 2024

Updated the colab link with sharable notebook

from decision-forests.

rstz avatar rstz commented on June 7, 2024

Hi, thank you for updating the link to the Colab, I will have a look!

from decision-forests.

advahadr avatar advahadr commented on June 7, 2024

Hi, thank you for your solution!
Implementing it in the colab worked fine, however I tried to implement it in our environment and got an error.

Environment details: (followed the compatibility table here)
Working on sagemaker pipelines, the configuration I used is:
tensorflow_decision_forests==1.5.0
tensorflow==2.13.0
(Can't upgrade to a higher tensorflow version dew to sagemaker limitation)

Got this error on the call_get_leaves (regular prediction worked just fine):

File "/usr/local/lib/python3.10/site-packages/tensorflow_decision_forests/keras/core_inference.py", line 767, in call_get_leaves * assert len(self._models_get_leaves) == 1 TypeError: object of type 'NoneType' has no len()

This was reproduced also in the colab notebook using the mentioned versions.
Would appreciate your help facing these limitations, Thanks!

from decision-forests.

janpfeifer avatar janpfeifer commented on June 7, 2024

hi @advahadr , I'm sorry you are having these difficulties -- the call_get_leaves is not often used, hence not as well tested.

I have 2 hypothesis:

  1. AFAIK, saved models may not work with it -- maybe @rstz could confirm ? You are saying that you are having issues in the inference code, is that correct ?

  2. I took a peak at your colab, and it is missing the tf.reshape line of the fix. Here is my copy of your colab. If that fixes it, then problem solved.

One short-term alternative, that would also work for (1) above: generate the leave values first, as a separate step. And then concatenate the leaf values to the inputs for the Keras model. This is not convenient :(, but it will work, if your environment allows this intermediary step. You could even materialized (save to disk along the input) the leaf values after training the TFDF model.

We'll look into this (most likely tomorrow, there is conference going on today), and get back to you.

If you could provide more details on how you are using it in your environment (Sagemaker), it would be very helpful! Is your pipeline something that reads the model and then runs inference on it, in Python ? Or is it using the TensorFlow C++ API ? etc.

from decision-forests.

advahadr avatar advahadr commented on June 7, 2024

Hi @janpfeifer,
Thanks for your response!

Regarding 2:
I'll start with bullet number 2 because we can eliminate it: I worked on a colab copy also, didn't want to edit the version I shared here so I added the reshape and it worked fine.

Regarding 1:
We use it when trying to train the ensemble (similarly to the colab example) with the pre-trained tfdf model as layer, so in a sense it's technically inference of the tfdf, but in general it's a training step of the ensemble (that's the reason I can't use the predict_get_leaves API):
This is the flow of our training process:

  1. training tfdf model and save it under s3 path
  2. load tfdf model from s3, pass it as and argument to to the nn ensemble initialization
  3. train the ensemble
  4. save the ensemble model

Later on the funnel we also predicting on this ensemble model.
Important to note that this flow worked fine when eliminate the call_get_leaves also while serving real traffic.

Regarding saved model problem:
I assumed that it might related to tfdf model saving and loading so I already checked it with new instance creation and unfortunately faced the same error (checked it both in a colab notebook and in our environment).

Regarding the alternative:
Unfortunately persisting the outcome of the tfdf predictions is not an option for serving (we are limited in latency since we run in real time environment).
In that case I want to do it on fly and than I need to use it again inside @tf.function which limits me to the use of only call_get_leaves again (since higher level functions like predict_get_leave are note available in that context).

Regarding SM environment:
We run it as python code (import the tensorflow_decision_forests) and use the classes as showed in the notebook.

Hope it's a bit more clear, if not I can elaborated more.

Regards,
Adva

from decision-forests.

achoum avatar achoum commented on June 7, 2024

Hi Adva,

Sorry to hear about your troubles. Let me also try to help :).

Regarding the shape, the output of call_get_leaves returns an array of shape [num_examples, num_trees]. As you noticed, the shape is not inferred during the creation of the graph. Instead, the shape is known in TensorFlow when the graph runs. Since we as users know the shape, we can simply set it with "set_shape". Note that "set_shape" is a purely bookkeeping operation. It does not involve any computation. This is different from tf.reshape.

def __init__(self, tfdf_model):
	  ...
  self.tfdf_model = tfdf_model
  self.num_trees = tfdf_model.make_inspector().num_trees()

@tf.function
def call(self, inputs):
  
	  ...
  tfdf_output_leaves = self.tfdf_model.call_get_leaves(inputs)
  tfdf_output_leaves_casted = tf.cast(tfdf_output_leaves, tf.float32)
  tfdf_output_leaves_casted.set_shape((None, self.num_trees))
  concatenated = self.concat_nn_tfdf([x, tfdf_output_leaves_casted])

stop_gradient is not necessary. The TF-DF inference operations do not propagate gradients by default.

About saving your model. Saving a model (e.g. model.save) does not save the predict_get_leaves function, however it saves the "call_get_leaves" function that you are using. For call_get_leaves to be saved, you need to make sure to call either call_get_leaves or predict_get_leaves one before saving the model.

I copied and updated your notebook with those changes. You can find it here: https://colab.research.google.com/drive/1TIPdzDN0UDLAXtcVICmsdh9YEDhW12LO?usp=sharing

Cheers,

from decision-forests.

advahadr avatar advahadr commented on June 7, 2024

Hi @rstz thank you for your great help! I'm getting there but still have some issues:

I tried to use the code you provided on our repo over Sagemaker:

        tfdf_output_leaves = self.tfdf_model.call_get_leaves(inputs_for_tfdf)
        print(f'\ntfdf_output_leaves: {tfdf_output_leaves}')

        tfdf_output_leaves_casted = tf.cast(tfdf_output_leaves, tf.float32)
        tfdf_output_leaves_casted.set_shape((None, 3))
        print(f'\ntfdf_output_leaves_casted: {tfdf_output_leaves_casted}')

        concatenated = self.concat_nn_tfdf([x, tfdf_output_leaves_casted])

The prints log show:

2023-11-20T18:23:15.784+02:00 | tfdf_output_leaves: Tensor("StatefulPartitionedCall:0", shape=(2048, None), dtype=int32)

  | 2023-11-20T18:23:15.784+02:00 | tfdf_output_leaves_casted: Tensor("Cast:0", shape=(2048, 3), dtype=float32)

And the error I got (Incompatible shapes: [80,1] vs. [2048,1]):

ErrorMessage "tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error

Detected at node 'gradient_tape/binary_crossentropy/mul_1/Mul' defined at (most recent call last)
File "/opt/ml/code/sagemaker_training_entrypoint.py", line 430, in
trained_model, preprocessing_layer = TrainFnSageMaker.train_fn(
File "/opt/ml/code/RankingTF/Training/train_nn_fn.py", line 244, in train_fn
model.fit(ds_train_input,
File "/usr/local/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1742, in fit
tmp_logs = self.train_function(iterator)
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1338, in train_function
return step_function(self, iterator)
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1322, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1303, in run_step
outputs = model.train_step(data)
File "/usr/local/lib/python3.10/site-packages/keras/src/engine/training.py", line 1084, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/usr/local/lib/python3.10/site-packages/keras/src/optimizers/optimizer.py", line 543, in minimize
grads_and_vars = self.compute_gradients(loss, var_list, tape)
File "/usr/local/lib/python3.10/site-packages/keras/src/optimizers/optimizer.py", line 276, in compute_gradients
grads = tape.gradient(loss, var_list)
Node: 'gradient_tape/binary_crossentropy/mul_1/Mul'
Incompatible shapes: [80,1] vs. [2048,1]
#11 [[{{node gradient_tape/binary_crossentropy/mul_1/Mul}}]] [Op:__inference_train_function_48268]

One thing to note:
In my repo the inputs (inputs_for_tfdf) are represented as a dict of tensors (the key is the feature name and value is the tensor), so maybe this different representation is causing the issue.

This is non blocker issue for me but when calling:
self.num_trees = tfdf_model.make_inspector().num_trees()
Inspector was not available on the loaded model, I tried an alternative:
tfdf_model.get_config()['num_trees'], but the config object was empty dict
so eventually I set it manually.

Would appreciate your help! thank you, Adva

from decision-forests.

advahadr avatar advahadr commented on June 7, 2024

Hi,
updating that I found the shape mismatch problem, I trained the tfdf with dataset with batch size of 2048, and when trying to train the ensemble the batch size I used was 80, however I'm still not sure why does it matter what was the batch size of the tfdf.

from decision-forests.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.