Comments (9)
Just found out why! Finally!
In short, this is caused by a bug in SyncBatchNormalization. The bug was fixed by us during the development of DeepLab2. However, the fix, unfortunately, was not included in the tensorflow 2.5 release, which is used in your docker image.
A quick verification of the bug causing the issue is to insert use_sync_batchnorm: false
in the config. The training should run successfully with this quick work around.
Solution: the issue should be resolved if a newer version tensorflow that includes the fix commit is used, e.g., tensorflow 2.6 or tensorflow github master.
Explanation: The bug in SyncBatchNormalization causes nan batch norm outputs in attention layers (as verified by your Panoptic-DeepLab experiment with max_deeplab_s_backbone
), and thus the network outputs are mostly nan. These nan outputs caused all the errors in hungarian matching.
from deeplab2.
Hi,
Thanks for letting us know this issue.
First of all, the full memory issue is because Tensorflow just occupies all memory while it runs. It doesn't suggest an out-of-memory (OOM) error.
Regarding the stuck issue, are you able to run the basic resnet-50 config on COCO successfully? If yes, could you try replacing the backbone with "max_deeplab_s_backbone" and see if it runs? This gives us more information about where the problem might be.
from deeplab2.
Thanks for your reply.
This is the log and GPU utility under resent50 config, I think it runs well.
I0626 03:57:13.319347 139850457837760 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': True, 'backbone_type': 'resnet', 'use_axial_beyond_stride': 0, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'tensorflow.python.keras.layers.normalization_v2.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0626 03:57:13.511222 139850457837760 deeplab.py:96] Setting pooling size to (41, 41)
I0626 03:57:13.511468 139850457837760 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:13.511590 139850457837760 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
WARNING:tensorflow:From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
W0626 03:57:17.840097 139850457837760 deprecation.py:534] From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
######### 100
I0626 03:57:20.698017 139850457837760 controller.py:391] restoring or initializing model...
restoring or initializing model...
I0626 03:57:20.698222 139850457837760 controller.py:397] initialized model.
initialized model.
I0626 03:57:21.889940 139850457837760 api.py:446] Eval with scales ListWrapper([1.0])
I0626 03:57:23.253040 139850457837760 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:23.282651 139850457837760 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:23.310259 139850457837760 api.py:446] Eval scale 1.0; setting pooling size to [41, 41]
I0626 03:57:28.292841 139850457837760 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:28.322828 139850457837760 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:57:30.499344 139850457837760 controller.py:487] saved checkpoint to output/panoptic_resnet50_os16/ckpt-0.
saved checkpoint to output/panoptic_resnet50_os16/ckpt-0.
I0626 03:57:30.499933 139850457837760 controller.py:236] train | step: 0 | training until step 200000...
train | step: 0 | training until step 200000...
2021-06-26 03:57:30.978694: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-06-26 03:57:30.979509: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2593990000 Hz
2021-06-26 03:58:10.995738: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-06-26 03:58:11.336968: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-26 03:58:11.732933: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-06-26 03:58:12.006126: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
I0626 03:59:31.502029 139850457837760 controller.py:458] train | step: 100 | steps/sec: 0.8 | output:
{'learning_rate': 2.5000001e-05,
'losses/train_center_loss': 1.6140642,
'losses/train_regression_loss': 0.5272915,
'losses/train_semantic_loss': 5.4412446,
'losses/train_total_loss': 7.5826}
train | step: 100 | steps/sec: 0.8 | output:
{'learning_rate': 2.5000001e-05,
'losses/train_center_loss': 1.6140642,
'losses/train_regression_loss': 0.5272915,
'losses/train_semantic_loss': 5.4412446,
'losses/train_total_loss': 7.5826}
After I change the backbone to "max_deeplab_s", it also runs well.
I0626 03:49:47.413747 139760778813632 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': True, 'backbone_type': 'resnet_beta', 'use_axial_beyond_stride': 16, 'backbone_use_transformer_beyond_stride': 0, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 1.0, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'constant', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'tensorflow.python.keras.layers.normalization_v2.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0626 03:49:47.644548 139760778813632 deeplab.py:96] Setting pooling size to (41, 41)
I0626 03:49:47.644795 139760778813632 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:49:47.644916 139760778813632 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
WARNING:tensorflow:From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
W0626 03:49:52.028648 139760778813632 deprecation.py:534] From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
######### 100
I0626 03:49:54.941907 139760778813632 controller.py:391] restoring or initializing model...
restoring or initializing model...
I0626 03:49:54.942111 139760778813632 controller.py:397] initialized model.
initialized model.
I0626 03:49:56.121497 139760778813632 api.py:446] Eval with scales ListWrapper([1.0])
I0626 03:49:57.515579 139760778813632 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:49:57.544761 139760778813632 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:49:57.572241 139760778813632 api.py:446] Eval scale 1.0; setting pooling size to [41, 41]
I0626 03:50:05.040328 139760778813632 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:50:05.070598 139760778813632 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0626 03:50:07.557797 139760778813632 controller.py:487] saved checkpoint to output/panoptic_resnet50_os16/ckpt-0.
saved checkpoint to output/panoptic_resnet50_os16/ckpt-0.
I0626 03:50:07.558384 139760778813632 controller.py:236] train | step: 0 | training until step 200000...
train | step: 0 | training until step 200000...
2021-06-26 03:50:08.046499: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-06-26 03:50:08.047468: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2593990000 Hz
2021-06-26 03:51:25.691272: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-06-26 03:51:26.297113: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-26 03:51:26.674273: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-06-26 03:51:26.948483: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
I0626 03:53:33.448543 139760778813632 controller.py:458] train | step: 100 | steps/sec: 0.5 | output:
{'learning_rate': 2.5000001e-05,
'losses/train_center_loss': nan,
'losses/train_regression_loss': nan,
'losses/train_semantic_loss': nan,
'losses/train_total_loss': nan}
train | step: 100 | steps/sec: 0.5 | output:
{'learning_rate': 2.5000001e-05,
'losses/train_center_loss': nan,
'losses/train_regression_loss': nan,
'losses/train_semantic_loss': nan,
'losses/train_total_loss': nan}
I0626 03:55:37.245568 139760778813632 controller.py:458] train | step: 200 | steps/sec: 0.8 | output:
{'learning_rate': 5.0000002e-05,
'losses/train_center_loss': nan,
'losses/train_regression_loss': nan,
'losses/train_semantic_loss': nan,
'losses/train_total_loss': nan}
train | step: 200 | steps/sec: 0.8 | output:
{'learning_rate': 5.0000002e-05,
'losses/train_center_loss': nan,
'losses/train_regression_loss': nan,
'losses/train_semantic_loss': nan,
'losses/train_total_loss': nan}
from deeplab2.
I create a docker image for you to reproduce the issue. It can reproduce the stuck issue for max-deeplab and also successfully run panoptic deeplab resnet50 version. You may pull the docker fromang9867/max_tf
.
This docker is built upon tensorflow/tensorflow:2.5.0-gpu
. I have installed all required packages. But you still need to setup the path to Orbit and cocoapi.
docker pull ang9867/max_tf
docker run -it --gpus all -v $deeplab2 path host$:$deeplab2 path in docker$ -v $data folder path in host$:$data folder path indocker$ ang9867/max_tf /bin/bash
cd $path to cocoapi$/cocoapi/PythonAPI
make
cd $path to deeplab2$
export PYTHONPATH=$PYTHONPATH:$path to deeplab2$:$path to models$/models:$path to cocoapi$/cocoapi/PythonAPI
python3 trainer/train.py --config_file=configs/coco/max_deeplab/max_deeplab_s_os16_res641_400k.textproto --mode=train --model_dir=output --num_gpus=1
And I observe that once getting stuck, Ctrl+C doesn't work. I will manually kill the process to terminate it.
from deeplab2.
The docker image looks great! Thanks for sharing it.
I'll try it on my machine and get back to you later.
from deeplab2.
Thx!
I suppose the matching operation makes the stuck (stuck because of Orbit, and should throw up an error). I disabled the loop in Orbit, and run it in single step. It turns out to the following errors. Looking forward to your feedback.
2021-06-26 09:57:53.491916: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ****************************************************************************************************
2021-06-26 09:57:53.492005: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at base_op.h:129 : Invalid argument: input and output shapes/data type sizes are not compatible
Traceback (most recent call last):
File "trainer/train.py", line 76, in <module>
app.run(main)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "trainer/train.py", line 72, in main
FLAGS.num_gpus)
File "/home/mcg/deeplab2/trainer/train_lib.py", line 187, in run_experiment
steps=config.trainer_options.solver_options.training_number_of_steps)
File "/home/mcg/deeplab2/models/orbit/controller.py", line 241, in train
self._train_n_steps(num_steps)
File "/home/mcg/deeplab2/models/orbit/controller.py", line 440, in _train_n_steps
train_output = self.trainer.train(num_steps_tensor)
File "/home/mcg/deeplab2/models/orbit/standard_runner.py", line 153, in train
self.train_step(self._train_iter)
File "/home/mcg/deeplab2/trainer/trainer.py", line 221, in train_step
self._strategy.run(step_fn, args=(next(iterator),))
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/one_device_strategy.py", line 188, in run
return super(OneDeviceStrategy, self).run(fn, args, kwargs, options)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1285, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2833, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/one_device_strategy.py", line 396, in _call_for_each_replica
return fn(*args, **kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 597, in wrapper
return func(*args, **kwargs)
File "/home/mcg/deeplab2/trainer/trainer.py", line 219, in step_fn
self._train_step(inputs)
File "/home/mcg/deeplab2/trainer/trainer.py", line 238, in _train_step
loss_dict = self._loss(inputs, outputs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/home/mcg/deeplab2/model/loss/loss_builder.py", line 207, in call
loss_dict = multi_term_loss((y_true, y_pred))
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/home/mcg/deeplab2/model/loss/max_deeplab_loss.py", line 605, in call
nonsquare_hungarian_matching(hungarian_weights))
File "/home/mcg/deeplab2/model/loss/max_deeplab_loss.py", line 139, in nonsquare_hungarian_matching
square_permutation = matchers_ops.hungarian_matching(weights)
File "/home/mcg/deeplab2/model/loss/matchers_ops.py", line 513, in hungarian_matching
back_prop=False)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 602, in new_func
return func(*args, **kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2541, in while_loop_v2
return_same_structure=True)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2777, in while_loop
loop_vars = body(*loop_vars)
File "/home/mcg/deeplab2/model/loss/matchers_ops.py", line 501, in _update_weights_and_match
adj_matrix = tf.equal(weights, 0.)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 1729, in equal
return gen_math_ops.equal(x, y, name=name)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 3215, in equal
_ops.raise_from_not_ok_status(e, name)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: input and output shapes/data type sizes are not compatible [Op:Equal]
2021-06-26 09:57:53.840640: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-26 09:57:53.840712: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted
Thread 0x00007fa83d1620c0 (most recent call first):
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1264 in delete_iterator
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 546 in __del__
Aborted (core dumped)
from deeplab2.
Thanks for pinpointing the issue! This is really helpful.
It seems to me that the 0. is not correctly converted to a float32 tensor.
While I am looking more closely into the issue, a fix could be to use an explicit tf.constant(0.0, dtype=tf.float32)
instead of the 0. at line 470 and line 483.
from deeplab2.
I tried to replace 0.0
with tf.constant(0.0, dtype=tf.float32)
but the error keeps the same. Then I tried tf.constant(0.0, dtype=weights.dtype)
, the error changes. But still have errors in matcher_ops.py, which outputs thousands of lines like
2021-06-26 15:10:21.042821: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 6 Chunks of size 4286720 totalling 24.53MiB
2021-06-26 15:10:21.042831: I tensorflow/core/common_runtime/bfc_allocator.cc:1054] 1 Chunks of size 4456704 totalling 4.25MiB
Then it turns out to
2021-06-26 15:20:03.694437: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ****************************************************************************************************
2021-06-26 15:20:03.694569: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at base_op.h:129 : Invalid argument: input and output shapes/data type sizes are not compatible
Traceback (most recent call last):
File "trainer/train.py", line 76, in <module>
app.run(main)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "trainer/train.py", line 72, in main
FLAGS.num_gpus)
File "/home/mcg/deeplab2/trainer/train_lib.py", line 187, in run_experiment
steps=config.trainer_options.solver_options.training_number_of_steps)
File "/home/mcg/deeplab2/models/orbit/controller.py", line 241, in train
self._train_n_steps(num_steps)
File "/home/mcg/deeplab2/models/orbit/controller.py", line 440, in _train_n_steps
train_output = self.trainer.train(num_steps_tensor)
File "/home/mcg/deeplab2/models/orbit/standard_runner.py", line 153, in train
self.train_step(self._train_iter)
File "/home/mcg/deeplab2/trainer/trainer.py", line 221, in train_step
self._strategy.run(step_fn, args=(next(iterator),))
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/one_device_strategy.py", line 188, in run
return super(OneDeviceStrategy, self).run(fn, args, kwargs, options)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1285, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2833, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/distribute/one_device_strategy.py", line 396, in _call_for_each_replica
return fn(*args, **kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 597, in wrapper
return func(*args, **kwargs)
File "/home/mcg/deeplab2/trainer/trainer.py", line 219, in step_fn
self._train_step(inputs)
File "/home/mcg/deeplab2/trainer/trainer.py", line 238, in _train_step
loss_dict = self._loss(inputs, outputs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/home/mcg/deeplab2/model/loss/loss_builder.py", line 207, in call
loss_dict = multi_term_loss((y_true, y_pred))
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/keras/engine/base_layer.py", line 1030, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/home/mcg/deeplab2/model/loss/max_deeplab_loss.py", line 605, in call
nonsquare_hungarian_matching(hungarian_weights))
File "/home/mcg/deeplab2/model/loss/max_deeplab_loss.py", line 139, in nonsquare_hungarian_matching
square_permutation = matchers_ops.hungarian_matching(weights)
File "/home/mcg/deeplab2/model/loss/matchers_ops.py", line 515, in hungarian_matching
back_prop=False)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 602, in new_func
return func(*args, **kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2541, in while_loop_v2
return_same_structure=True)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2777, in while_loop
loop_vars = body(*loop_vars)
File "/home/mcg/deeplab2/model/loss/matchers_ops.py", line 503, in _update_weights_and_match
adj_matrix = tf.equal(weights, tf.constant(0.0, dtype=weights.dtype))
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py", line 206, in wrapper
return target(*args, **kwargs)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py", line 1729, in equal
return gen_math_ops.equal(x, y, name=name)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 3215, in equal
_ops.raise_from_not_ok_status(e, name)
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 6897, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.InvalidArgumentError: input and output shapes/data type sizes are not compatible [Op:Equal]
2021-06-26 15:20:04.257207: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-06-26 15:20:04.257269: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted
Thread 0x00007fcf86f650c0 (most recent call first):
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1264 in delete_iterator
File "/home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 546 in __del__
Aborted (core dumped)
from deeplab2.
Thank you!
TF 2.6 works!
from deeplab2.
Related Issues (20)
- Error when compiling custom ops(CPU or GPU) HOT 3
- Unstable numeric output for downstream task (moat 4 w/o pos) HOT 1
- Op type not registered 'MergeSemanticAndInstanceMaps' in binary running on wvmgputprseus
- Logits and scores of semantic prediction
- Code compatibility with python < 3.9
- ValueError: Dimensions must be equal
- The architecture of kMaX Transformer Decoder seems not consistent with Fig.1 in the paper HOT 2
- Towards End-to-End Unified Scene Text Detection and Layout Analysis
- test error in macbook m1
- How to export_model on GPU mode ?
- How to identify model is using GPU ?
- How long it takes to train kMaXDeepLab on Cityscapes with batch size = 32? HOT 1
- Keras.optimizers has not attribute "legacy"
- How to train on Waymo?
- MOAT training code
- How many memory do I need for traning a kmax model when I use the resnet50 as the backbone?
- open source date of PolyMaX HOT 2
- Numpy Issue
- Data loading
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deeplab2.