Hi, Thanks for sharing this great work. I successfully run the evaluating code for

I tried to replace 0.0 with <code class="notranslate"

Got stuck during training,about google-research/deeplab2

Comments (9)

csrhddlam commented on June 17, 2024 1

Just found out why! Finally!

In short, this is caused by a bug in SyncBatchNormalization. The bug was fixed by us during the development of DeepLab2. However, the fix, unfortunately, was not included in the tensorflow 2.5 release, which is used in your docker image.

A quick verification of the bug causing the issue is to insert use_sync_batchnorm: false in the config. The training should run successfully with this quick work around.

Solution: the issue should be resolved if a newer version tensorflow that includes the fix commit is used, e.g., tensorflow 2.6 or tensorflow github master.

Explanation: The bug in SyncBatchNormalization causes nan batch norm outputs in attention layers (as verified by your Panoptic-DeepLab experiment with max_deeplab_s_backbone), and thus the network outputs are mostly nan. These nan outputs caused all the errors in hungarian matching.

from deeplab2.

csrhddlam commented on June 17, 2024

Hi,

Thanks for letting us know this issue.

First of all, the full memory issue is because Tensorflow just occupies all memory while it runs. It doesn't suggest an out-of-memory (OOM) error.

Regarding the stuck issue, are you able to run the basic resnet-50 config on COCO successfully? If yes, could you try replacing the backbone with "max_deeplab_s_backbone" and see if it runs? This gives us more information about where the problem might be.

from deeplab2.

lxa9867 commented on June 17, 2024

Thanks for your reply.

This is the log and GPU utility under resent50 config, I think it runs well.

I0626 03:57:13.319347 139850457837760 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': True, 'backbone_type': 'resnet', 'use_axial_beyond_stride': 0, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'tensorflow.python.keras.layers.normalization_v2.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0626 03:57:13.511222 139850457837760 deeplab.py:96] Setting pooling size to (41, 41)
I0626 03:57:13.511468 139850457837760 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:13.511590 139850457837760 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
WARNING:tensorflow:From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
W0626 03:57:17.840097 139850457837760 deprecation.py:534] From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
######### 100
I0626 03:57:20.698017 139850457837760 controller.py:391] restoring or initializing model...
restoring or initializing model...
I0626 03:57:20.698222 139850457837760 controller.py:397] initialized model.
initialized model.
I0626 03:57:21.889940 139850457837760 api.py:446] Eval with scales ListWrapper([1.0])
I0626 03:57:23.253040 139850457837760 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:23.282651 139850457837760 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:23.310259 139850457837760 api.py:446] Eval scale 1.0; setting pooling size to [41, 41]
I0626 03:57:28.292841 139850457837760 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:28.322828 139850457837760 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:30.499344 139850457837760 controller.py:487] saved checkpoint to output/panoptic_resnet50_os16/ckpt-0.
saved checkpoint to output/panoptic_resnet50_os16/ckpt-0.
I0626 03:57:30.499933 139850457837760 controller.py:236] train | step:      0 | training until step 200000...
train | step:      0 | training until step 200000...
2021-06-26 03:57:30.978694: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-06-26 03:57:30.979509: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2593990000 Hz
2021-06-26 03:58:10.995738: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-06-26 03:58:11.336968: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-26 03:58:11.732933: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-06-26 03:58:12.006126: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
I0626 03:59:31.502029 139850457837760 controller.py:458] train | step:    100 | steps/sec:    0.8 | output:
    {'learning_rate': 2.5000001e-05,
     'losses/train_center_loss': 1.6140642,
     'losses/train_regression_loss': 0.5272915,
     'losses/train_semantic_loss': 5.4412446,
     'losses/train_total_loss': 7.5826}
train | step:    100 | steps/sec:    0.8 | output:
    {'learning_rate': 2.5000001e-05,
     'losses/train_center_loss': 1.6140642,
     'losses/train_regression_loss': 0.5272915,
     'losses/train_semantic_loss': 5.4412446,
     'losses/train_total_loss': 7.5826}

After I change the backbone to "max_deeplab_s", it also runs well.

I0626 03:49:47.413747 139760778813632 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': True, 'backbone_type': 'resnet_beta', 'use_axial_beyond_stride': 16, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'tensorflow.python.keras.layers.normalization_v2.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0626 03:49:47.644548 139760778813632 deeplab.py:96] Setting pooling size to (41, 41)
I0626 03:49:47.644795 139760778813632 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:49:47.644916 139760778813632 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
WARNING:tensorflow:From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
W0626 03:49:52.028648 139760778813632 deprecation.py:534] From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
######### 100
I0626 03:49:54.941907 139760778813632 controller.py:391] restoring or initializing model...
restoring or initializing model...
I0626 03:49:54.942111 139760778813632 controller.py:397] initialized model.
initialized model.
I0626 03:49:56.121497 139760778813632 api.py:446] Eval with scales ListWrapper([1.0])
I0626 03:49:57.515579 139760778813632 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:49:57.544761 139760778813632 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:49:57.572241 139760778813632 api.py:446] Eval scale 1.0; setting pooling size to [41, 41]
I0626 03:50:05.040328 139760778813632 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:50:05.070598 139760778813632 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:50:07.557797 139760778813632 controller.py:487] saved checkpoint to output/panoptic_resnet50_os16/ckpt-0.
saved checkpoint to output/panoptic_resnet50_os16/ckpt-0.
I0626 03:50:07.558384 139760778813632 controller.py:236] train | step:      0 | training until step 200000...
train | step:      0 | training until step 200000...
2021-06-26 03:50:08.046499: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-06-26 03:50:08.047468: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2593990000 Hz
2021-06-26 03:51:25.691272: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-06-26 03:51:26.297113: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-26 03:51:26.674273: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-06-26 03:51:26.948483: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11



I0626 03:53:33.448543 139760778813632 controller.py:458] train | step:    100 | steps/sec:    0.5 | output:
    {'learning_rate': 2.5000001e-05,
     'losses/train_center_loss': nan,
     'losses/train_regression_loss': nan,
     'losses/train_semantic_loss': nan,
     'losses/train_total_loss': nan}
train | step:    100 | steps/sec:    0.5 | output:
    {'learning_rate': 2.5000001e-05,
     'losses/train_center_loss': nan,
     'losses/train_regression_loss': nan,
     'losses/train_semantic_loss': nan,
     'losses/train_total_loss': nan}
I0626 03:55:37.245568 139760778813632 controller.py:458] train | step:    200 | steps/sec:    0.8 | output:
    {'learning_rate': 5.0000002e-05,
     'losses/train_center_loss': nan,
     'losses/train_regression_loss': nan,
     'losses/train_semantic_loss': nan,
     'losses/train_total_loss': nan}
train | step:    200 | steps/sec:    0.8 | output:
    {'learning_rate': 5.0000002e-05,
     'losses/train_center_loss': nan,
     'losses/train_regression_loss': nan,
     'losses/train_semantic_loss': nan,
     'losses/train_total_loss': nan}

from deeplab2.

lxa9867 commented on June 17, 2024

I create a docker image for you to reproduce the issue. It can reproduce the stuck issue for max-deeplab and also successfully run panoptic deeplab resnet50 version. You may pull the docker fromang9867/max_tf.

This docker is built upon tensorflow/tensorflow:2.5.0-gpu. I have installed all required packages. But you still need to setup the path to Orbit and cocoapi.

docker pull ang9867/max_tf
docker run -it --gpus all -v $deeplab2 path host$:$deeplab2 path in docker$ -v $data folder path in host$:$data folder path indocker$ ang9867/max_tf /bin/bash
cd $path to cocoapi$/cocoapi/PythonAPI
make
cd $path to deeplab2$
export PYTHONPATH=$PYTHONPATH:$path to deeplab2$:$path to models$/models:$path to cocoapi$/cocoapi/PythonAPI
python3 trainer/train.py --config_file=configs/coco/max_deeplab/max_deeplab_s_os16_res641_400k.textproto --mode=train --model_dir=output --num_gpus=1

And I observe that once getting stuck, Ctrl+C doesn't work. I will manually kill the process to terminate it.

from deeplab2.

csrhddlam commented on June 17, 2024

The docker image looks great! Thanks for sharing it.

I'll try it on my machine and get back to you later.

from deeplab2.

lxa9867 commented on June 17, 2024

Thx!

I suppose the matching operation makes the stuck (stuck because of Orbit, and should throw up an error). I disabled the loop in Orbit, and run it in single step. It turns out to the following errors. Looking forward to your feedback.

2021-06-26 09:57:53.491916: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ****************************************************************************************************
2021-06-26 09:57:53.492005: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at base_op.h:129 : Invalid argument: input and output shapes/data type sizes are not compatible
Traceback (most recent call last):
  File "trainer/train.py", line 76, in <module>
    app.run(main)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "trainer/train.py", line 72, in main
    FLAGS.num_gpus)
  File "/home/mcg/deeplab2/trainer/train_lib.py", line 187, in run_experiment
    steps=config.trainer_options.solver_options.training_number_of_steps)
  File "/home/mcg/deeplab2/models/orbit/controller.py", line 241, in train
    self._train_n_steps(num_steps)
  File "/home/mcg/deeplab2/models/orbit/controller.py", line 440, in _train_n_steps
    train_output = self.trainer.train(num_steps_tensor)
  File "/home/mcg/deeplab2/models/orbit/standard_runner.py", line 153, in train
    self.train_step(self._train_iter)
  File "/home/mcg/deeplab2/trainer/trainer.py", line 221, in train_step
    self._strategy.run(step_fn, args=(next(iterator),))
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/one_device_strategy.py", line 188, in run
    return super(OneDeviceStrategy, self).run(fn, args, kwargs, options)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1285, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2833, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/one_device_strategy.py", line 396, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 597, in wrapper
    return func(*args, **kwargs)
  File "/home/mcg/deeplab2/trainer/trainer.py", line 219, in step_fn
    self._train_step(inputs)
  File "/home/mcg/deeplab2/trainer/trainer.py", line 238, in _train_step
    loss_dict = self._loss(inputs, outputs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/mcg/deeplab2/model/loss/loss_builder.py", line 207, in call
    loss_dict = multi_term_loss((y_true, y_pred))
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/mcg/deeplab2/model/loss/max_deeplab_loss.py", line 605, in call
    nonsquare_hungarian_matching(hungarian_weights))
  File "/home/mcg/deeplab2/model/loss/max_deeplab_loss.py", line 139, in nonsquare_hungarian_matching
    square_permutation = matchers_ops.hungarian_matching(weights)
  File "/home/mcg/deeplab2/model/loss/matchers_ops.py", line 513, in hungarian_matching
    back_prop=False)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 602, in new_func
    return func(*args, **kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2541, in while_loop_v2
    return_same_structure=True)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2777, in while_loop
    loop_vars = body(*loop_vars)
  File "/home/mcg/deeplab2/model/loss/matchers_ops.py", line 501, in _update_weights_and_match
    adj_matrix = tf.equal(weights, 0.)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 1729, in equal
    return gen_math_ops.equal(x, y, name=name)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 3215, in equal
    _ops.raise_from_not_ok_status(e, name)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: input and output shapes/data type sizes are not compatible [Op:Equal]
2021-06-26 09:57:53.840640: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-26 09:57:53.840712: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted

Thread 0x00007fa83d1620c0 (most recent call first):
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1264 in delete_iterator
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 546 in __del__
Aborted (core dumped)

from deeplab2.

csrhddlam commented on June 17, 2024

Thanks for pinpointing the issue! This is really helpful.

It seems to me that the 0. is not correctly converted to a float32 tensor.

While I am looking more closely into the issue, a fix could be to use an explicit tf.constant(0.0, dtype=tf.float32) instead of the 0. at line 470 and line 483.

from deeplab2.

lxa9867 commented on June 17, 2024

I tried to replace 0.0 with tf.constant(0.0, dtype=tf.float32) but the error keeps the same. Then I tried tf.constant(0.0, dtype=weights.dtype), the error changes. But still have errors in matcher_ops.py, which outputs thousands of lines like

2021-06-26 15:10:21.042821: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 6 Chunks of size 4286720 totalling 24.53MiB
2021-06-26 15:10:21.042831: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 4456704 totalling 4.25MiB

Then it turns out to

2021-06-26 15:20:03.694437: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ****************************************************************************************************
2021-06-26 15:20:03.694569: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at base_op.h:129 : Invalid argument: input and output shapes/data type sizes are not compatible
Traceback (most recent call last):
  File "trainer/train.py", line 76, in <module>
    app.run(main)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "trainer/train.py", line 72, in main
    FLAGS.num_gpus)
  File "/home/mcg/deeplab2/trainer/train_lib.py", line 187, in run_experiment
    steps=config.trainer_options.solver_options.training_number_of_steps)
  File "/home/mcg/deeplab2/models/orbit/controller.py", line 241, in train
    self._train_n_steps(num_steps)
  File "/home/mcg/deeplab2/models/orbit/controller.py", line 440, in _train_n_steps
    train_output = self.trainer.train(num_steps_tensor)
  File "/home/mcg/deeplab2/models/orbit/standard_runner.py", line 153, in train
    self.train_step(self._train_iter)
  File "/home/mcg/deeplab2/trainer/trainer.py", line 221, in train_step
    self._strategy.run(step_fn, args=(next(iterator),))
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/one_device_strategy.py", line 188, in run
    return super(OneDeviceStrategy, self).run(fn, args, kwargs, options)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1285, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2833, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/one_device_strategy.py", line 396, in _call_for_each_replica
    return fn(*args, **kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 597, in wrapper
    return func(*args, **kwargs)
  File "/home/mcg/deeplab2/trainer/trainer.py", line 219, in step_fn
    self._train_step(inputs)
  File "/home/mcg/deeplab2/trainer/trainer.py", line 238, in _train_step
    loss_dict = self._loss(inputs, outputs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/mcg/deeplab2/model/loss/loss_builder.py", line 207, in call
    loss_dict = multi_term_loss((y_true, y_pred))
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
    outputs = call_fn(inputs, *args, **kwargs)
  File "/home/mcg/deeplab2/model/loss/max_deeplab_loss.py", line 605, in call
    nonsquare_hungarian_matching(hungarian_weights))
  File "/home/mcg/deeplab2/model/loss/max_deeplab_loss.py", line 139, in nonsquare_hungarian_matching
    square_permutation = matchers_ops.hungarian_matching(weights)
  File "/home/mcg/deeplab2/model/loss/matchers_ops.py", line 515, in hungarian_matching
    back_prop=False)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 602, in new_func
    return func(*args, **kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2541, in while_loop_v2
    return_same_structure=True)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2777, in while_loop
    loop_vars = body(*loop_vars)
  File "/home/mcg/deeplab2/model/loss/matchers_ops.py", line 503, in _update_weights_and_match
    adj_matrix = tf.equal(weights, tf.constant(0.0, dtype=weights.dtype))
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
    return target(*args, **kwargs)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 1729, in equal
    return gen_math_ops.equal(x, y, name=name)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 3215, in equal
    _ops.raise_from_not_ok_status(e, name)
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: input and output shapes/data type sizes are not compatible [Op:Equal]
2021-06-26 15:20:04.257207: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-26 15:20:04.257269: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted

Thread 0x00007fcf86f650c0 (most recent call first):
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1264 in delete_iterator
  File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 546 in __del__
Aborted (core dumped)

from deeplab2.

lxa9867 commented on June 17, 2024

Thank you!

TF 2.6 works!

from deeplab2.

Got stuck during training about deeplab2 HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs