google-research / deeplab2 Goto Github PK

DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a unified and state-of-the-art TensorFlow codebase for dense pixel labeling tasks.

License: Apache License 2.0

Python 93.53% Shell 0.24% C++ 1.80% Jupyter Notebook 4.43%

deeplab2's Issues

Occlusion Array?

In reference to the following comment in a previously closed issue.

For the occlusion array, how was the forward-backward consistency check implemented?
Was eq. (5) in Sundaram et al., 2010 used? If so, were the same hyperparameters used, or were they tuned, and if so, how?

Weighted loss function?

Hello All,
In order to segment unbalanced dataset, a class-weighted loss function is a must have requirements. Do you have plans to implement class-weighted loss functions? Any hint about how to go for implementing it by myself?
Best,

Error when compiling custom ops (GPU)

When compiling the custom ops GPU support I get the error below from a tensorflow header file

Followed instructions for setting compile and loader flags
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )
OP_NAME='deeplab2/tensorflow_ops/kernels/merge_semantic_and_instance_maps_op'

Compiling for CPU completed without error

Compiling for GPU
nvcc -std=c++14 -c -o ${OP_NAME}_kernel.cu.o ${OP_NAME}_kernel.cu.cc ${TF_CFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPIC --expt-relaxed-constexpr

/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/include/tensorflow/core/platform/file_system.h(556): warning: overloaded virtual function "tensorflow::FileSystem::FilesExist" is only partially overridden in class "tensorflow::WrappedFileSystem"

...
snip
...

/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/include/absl/functional/function_ref.h:\100:29: error: parameter packs not expanded with ‘...’:
template <typename F, typename = EnableIfCompatible<const F&>>
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/include/absl/functional/function_ref.h:\100:29: note: ‘Args’
/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/include/absl/functional/function_ref.h:114:13: error: parameter packs not expanded with ‘...’:
typename F, typename = EnableIfCompatible<F*>,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

tensorflow-gpu 2.5.0

Any suggestions?

AttributeError: 'NoneType' object has no attribute 'dumps'

Hey, I can't solve this problem

ssh://[email protected]:22/data1/nqx/anaconda3/bin/python3 -u /data1/nqx/Max-deeplab/deeplab2/trainer/train.py --config_file=/data1/nqx/Max-deeplab/deeplab2/configs/coco/max_deeplab/max_deeplab_l_os16_res1025_400k.textproto --mode=train --model_dir=/data1/nqx/Max-deeplab/init_checkpoint/max_deeplab_l_backbone_imagenet1k_strong_training_strategy.tar.gz --num_gpus=8
I1012 11:38:41.230723 139697198446400 train.py:65] Reading the config file.
I1012 11:38:41.233097 139697198446400 train.py:69] Starting the experiment.
2021-10-12 11:38:41.234652: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-12 11:38:46.108032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20504 MB memory: -> device: 0, name: GeForce RTX 3090, pci bus id: 0000:1a:00.0, compute capability: 8.6
2021-10-12 11:38:46.110730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 20504 MB memory: -> device: 1, name: GeForce RTX 3090, pci bus id: 0000:1b:00.0, compute capability: 8.6
2021-10-12 11:38:46.113205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 20526 MB memory: -> device: 2, name: GeForce RTX 3090, pci bus id: 0000:3d:00.0, compute capability: 8.6
2021-10-12 11:38:46.115386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 20506 MB memory: -> device: 3, name: GeForce RTX 3090, pci bus id: 0000:3e:00.0, compute capability: 8.6
2021-10-12 11:38:46.117429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:4 with 20504 MB memory: -> device: 4, name: GeForce RTX 3090, pci bus id: 0000:88:00.0, compute capability: 8.6
2021-10-12 11:38:46.119449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:5 with 20506 MB memory: -> device: 5, name: GeForce RTX 3090, pci bus id: 0000:89:00.0, compute capability: 8.6
2021-10-12 11:38:46.121430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 20526 MB memory: -> device: 6, name: GeForce RTX 3090, pci bus id: 0000:b1:00.0, compute capability: 8.6
2021-10-12 11:38:46.123378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 20506 MB memory: -> device: 7, name: GeForce RTX 3090, pci bus id: 0000:b2:00.0, compute capability: 8.6
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I1012 11:38:48.733316 139697198446400 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I1012 11:38:48.734684 139697198446400 train_lib.py:104] Using strategy <class 'tensorflow.python.distribute.mirrored_strategy.MirroredStrategy'> with 8 replicas
I1012 11:38:48.757744 139697198446400 deeplab.py:57] Synchronized Batchnorm is used.
I1012 11:38:48.759390 139697198446400 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 6, 3, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': False, 'backbone_type': 'wider_resnet', 'use_axial_beyond_stride': 16, 'backbone_use_transformer_beyond_stride': 16, 'extra_decoder_use_transformer_beyond_stride': 16, 'backbone_decoder_num_stacks': 1, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 1, 'extra_decoder_blocks_per_stage': 3, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 512, 'base_transformer_expansion': 2.0, 'global_feed_forward_network_channels': 512, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 4, 'drop_path_keep_prob': 0.800000011920929, 'drop_path_beyond_stride': 4, 'drop_path_schedule': 'linear', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 2, 'value_expansion': 4, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'keras.layers.normalization.batch_normalization.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I1012 11:38:49.366679 139697198446400 deeplab.py:96] Setting pooling size to (65, 65)
I1012 11:38:49.366883 139697198446400 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.613164 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.615339 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.620485 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.621790 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.626796 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.628052 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.631933 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.633151 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.638095 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.639389 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.679392 139697198446400 controller.py:391] restoring or initializing model...
restoring or initializing model...
WARNING:tensorflow:From /data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py:1359: NameBasedSaverStatus.init (from tensorflow.python.training.tracking.util) is deprecated and will be removed in a future version.
Instructions for updating:
Restoring a name-based tf.train.Saver checkpoint using the object-based restore API. This mode uses global names to match variables, and so is somewhat fragile. It also adds new restore ops to the graph each time it is called when graph building. Prefer re-encoding training checkpoints in the object-based format: run save() on the object-based saver (the same one this message is coming from) and use that checkpoint in the future.
W1012 11:38:55.741179 139697198446400 deprecation.py:339] From /data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py:1359: NameBasedSaverStatus.init (from tensorflow.python.training.tracking.util) is deprecated and will be removed in a future version.
Instructions for updating:
Restoring a name-based tf.train.Saver checkpoint using the object-based restore API. This mode uses global names to match variables, and so is somewhat fragile. It also adds new restore ops to the graph each time it is called when graph building. Prefer re-encoding training checkpoints in the object-based format: run save() on the object-based saver (the same one this message is coming from) and use that checkpoint in the future.
I1012 11:38:55.788634 139697198446400 controller.py:397] initialized model.
initialized model.
I1012 11:38:56.717390 139697198446400 api.py:446] Eval with scales ListWrapper([1.0])
I1012 11:38:57.717404 139697198446400 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I1012 11:38:57.743686 139697198446400 api.py:446] Eval scale 1.0; setting pooling size to [65, 65]
I1012 11:39:26.300476 139697198446400 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
Traceback (most recent call last):
File "/data1/nqx/Max-deeplab/deeplab2/trainer/train.py", line 76, in
app.run(main)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/data1/nqx/Max-deeplab/deeplab2/trainer/train.py", line 71, in main
train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master,
File "/data1/nqx/Max-deeplab/deeplab2/trainer/train_lib.py", line 188, in run_experiment
controller.save_checkpoint()
File "/data1/nqx/Max-deeplab/models/orbit/controller.py", line 412, in save_checkpoint
self._maybe_save_checkpoint(check_interval=False)
File "/data1/nqx/Max-deeplab/models/orbit/controller.py", line 482, in _maybe_save_checkpoint
ckpt_path = self.checkpoint_manager.save(
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 817, in save
save_path = self._checkpoint.write(prefix)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 2071, in write
output = self._saver.save(file_prefix=file_prefix, options=options)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 1261, in save
file_io.recursive_create_dir(os.path.dirname(file_prefix))
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 499, in recursive_create_dir
recursive_create_dir_v2(dirname)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 514, in recursive_create_dir_v2
_pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.FailedPreconditionError: /data1/nqx/Max-deeplab/init_checkpoint/max_deeplab_l_backbone_imagenet1k_strong_training_strategy.tar.gz is not a directory
Exception ignored in: <function Pool.del at 0x7f0d55dbeca0>
Traceback (most recent call last):
File "/data1/nqx/anaconda3/lib/python3.8/multiprocessing/pool.py", line 268, in del
File "/data1/nqx/anaconda3/lib/python3.8/multiprocessing/queues.py", line 362, in put
AttributeError: 'NoneType' object has no attribute 'dumps'

Process finished with exit code 1

Tfrecords not getting read during training

I'm getting this error while running deeplab2/trainer/train.py' --config_file"deeplab2/configs/coco/max_deeplab/max_deeplab_s_os16_res641_400k.textproto" --mode "train" --model_dir checkpoint/max_deeplab_s_os16_res641_400k_coco_train --num_gpus 0

Traceback (most recent call last):
  File "/volumes2/Other/deeplab2/trainer/train.py", line 76, in <module>
    app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/volumes2/Other/deeplab2/trainer/train.py", line 72, in main
    FLAGS.num_gpus)
  File "/volumes2/Other/deeplab2/trainer/train_lib.py", line 191, in run_experiment
    steps=config.trainer_options.solver_options.training_number_of_steps)
  File "/volumes2/Other/models/orbit/controller.py", line 240, in train
    self._train_n_steps(num_steps)
  File "/volumes2/Other/models/orbit/controller.py", line 439, in _train_n_steps
    train_output = self.trainer.train(num_steps_tensor)
  File "/volumes2/Other/models/orbit/standard_runner.py", line 146, in train
    self._train_loop_fn(self._train_iter, num_steps)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
    return self._stateless_fn(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3040, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1964, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 596, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.FailedPreconditionError:  /volumes2/Other/data; Is a directory
	 [[{{node MultiDeviceIteratorGetNextFromShard}}]]
	 [[RemoteCall]]
	 [[while/body/_1/IteratorGetNext]] [Op:__inference_loop_fn_123475]

Function call stack:
loop_fn


Process finished with exit code 1

/volumes2/Other/data has 3000 .tfrecord files as a result of running build_coco_data.py file. Isn't this file supposed to return just three sharded tfrecord files?

Any help would be much appreciated!

Problems about how to use optical flow in iou_tracker.py

I want to employ the optional parameters: optical_flow while runing iou_tracker.py for evaluating STQ. But it involves two files with different suffixes: _OCCLUSION_EXT = '.occ_forward' and _FLOW_EXT = '.flow_forward'. I want to know the meaning the above two files and how to generate above two files through optical flow diagram.

Panoptic Segmentation COCO training OOM errors

I'm trying to train one of the ResNet50 models into a panoptic segmentation model using COCO data.

I have access to a node with 4 Nvidia V100 32GB GPUs, yet, I'm still getting OOM errors.

Do these models really require more than 4 V100s? Or am I misconfiguring something?

I'm looking for general guidance here.

the speed record

does this project also report basic speed comparasion on model implemented?

I run deeplabdemo ipython notebook, the time is:

93s

Cannot do evaluation on vip-deeplab

Hi, @joe-siyuan-qiao

Thanks for your work on deeplab2.
I am trying to use deeplab2 to run vip-deeplab. Although the training process is good, I met the following error when evaluating:

E tensorflow/core/grappler/optimizers/meta_optimizer.cc:801] layout failed: Invalid argument: Size of values 3 does not match size of permutation 4 @ fanin shape inViPDeepLab/PostProcessor/StatefulPartitionedCall/while/body/_166/while/SelectV2_1-1-TransposeNHWCToNCHW-LayoutOptimizer

By the way, how many gpus should I use to reproduce the results in the paper (with the config in the "resnet50_beta_os32.textproto", batch size = 4)?

Thanks.

Object detection

I was wondering if this library offers a pre-trained model for object detection (I mean by object detection outputting the four coordinates of the detected boxes, and there is an indeterminate number of boxes)?

About pretraining with Mapillary Vistas

Hi, thanks for you codes. And I have a question about pretrain on Mapillary Vistas.
It is general to pretrain with Mapillary Vistas and fine-tune on Cityscapes. The thing and stuff categories are quite different for these two data sets, is there some special treatment required?
Especially, some things (e.g. traffic lights, poles) in Mapillary Vistas are marked as stuff in Cityscapes, how you deal with such categories?

Can't run trainer/train.py well

1, Stopped halfway the trainer/train.py because the learning didn't progress at all.

2021-07-20 06:31:51.027365: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-07-20 06:31:51.032914: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcusolver.so.11'; dlerror: libcusolver.so.11: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/lib64-nvidia
2021-07-20 06:31:51.033924: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1835] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
I0720 06:31:51.034065 140350081660800 train.py:65] Reading the config file.
I0720 06:31:51.036973 140350081660800 train.py:69] Starting the experiment.
2021-07-20 06:31:51.037464: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
I0720 06:31:51.038176 140350081660800 train_lib.py:105] Using strategy <class 'tensorflow.python.distribute.one_device_strategy.OneDeviceStrategy'> with 1 replicas
I0720 06:31:51.042598 140350081660800 deeplab.py:57] Synchronized Batchnorm is used.
I0720 06:31:51.043589 140350081660800 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': False, 'backbone_type': 'resnet_beta', 'use_axial_beyond_stride': 16, 'backbone_use_transformer_beyond_stride': 32, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 0.800000011920929, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'linear', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'keras.layers.normalization.batch_normalization.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0720 06:31:51.322444 140350081660800 deeplab.py:96] Setting pooling size to (65, 65)
I0720 06:31:51.322864 140350081660800 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0720 06:31:57.261267 140350081660800 controller.py:391] restoring or initializing model...
restoring or initializing model...
I0720 06:31:57.299474 140350081660800 controller.py:395] restored model from ../drive/MyDrive/checkpoints/ImageNet_pretrained_checkpoints/max_deeplab_s_os16_res1025_100k_coco_train/ckpt-0.
restored model from ../drive/MyDrive/checkpoints/ImageNet_pretrained_checkpoints/max_deeplab_s_os16_res1025_100k_coco_train/ckpt-0.
I0720 06:31:57.299665 140350081660800 controller.py:217] restored from checkpoint: ../drive/MyDrive/checkpoints/ImageNet_pretrained_checkpoints/max_deeplab_s_os16_res1025_100k_coco_train/ckpt-0
restored from checkpoint: ../drive/MyDrive/checkpoints/ImageNet_pretrained_checkpoints/max_deeplab_s_os16_res1025_100k_coco_train/ckpt-0
I0720 06:31:58.465277 140350081660800 api.py:446] Eval with scales ListWrapper([1.0])
I0720 06:31:59.732715 140350081660800 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0720 06:31:59.757121 140350081660800 api.py:446] Eval scale 1.0; setting pooling size to [65, 65]
I0720 06:32:12.421342 140350081660800 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0720 06:32:13.528391 140350081660800 controller.py:236] train | step: 0 | training until step 3...
train | step: 0 | training until step 3...
2021-07-20 06:32:13.646223: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-07-20 06:32:14.011432: W tensorflow/core/framework/dataset.cc:679] Input of GeneratorDatasetOp::Dataset will not be optimized because the dataset does not implement the AsGraphDefInternal() method needed to apply optimizations.
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206: calling foldl_v2 (from tensorflow.python.ops.functional_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldl(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldl(fn, elems))
W0720 06:32:25.062542 140350081660800 deprecation.py:616] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/util/dispatch.py:206: calling foldl_v2 (from tensorflow.python.ops.functional_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldl(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldl(fn, elems))
WARNING:tensorflow:From /usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py:463: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))
W0720 06:32:25.398839 140350081660800 deprecation.py:616] From /usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py:463: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))
^C

2, Command tried

!python ./deeplab2/trainer/train.py
--config_file=./deeplab2/configs/coco/max_deeplab/max_deeplab_s_os16_res1025_100k.textproto
--mode='train'
--model_dir=../drive/MyDrive/checkpoints/ImageNet_pretrained_checkpoints/
--num_gpus=1

3, Env

On google colab;
tf.version = 2.6.0-rc1

I think the GPU is not working prperly.
I'd appreciate it if anyone could tell me how to solve it.

STEP dataset questions

Hi! Thanks for sharing the code. Is there any validation set for STEP MOTS? Or any suggestions for making the subset from the train set.

Missing config files

https://github.com/google-research/deeplab2/blob/main/g3doc/projects/panoptic_deeplab.md
the links for the config files of MobilenetV3-S and MobilenetV3-L are dead

https://github.com/google-research/deeplab2/tree/main/configs/cityscapes/panoptic_deeplab
and the provided config files with output stride 32 don't match

Missing 'checkpoint' File in the ImageNet Checkpoint Provided for ResNet-50-Beta

I downloaded ImageNet checkpoint for ResNet-50-Beta from here. After extraction, folder structure for this checkpoint looks like this:

resnet50_beta_imagenet1k_strong_training_strategy/
- ckpt-350.data-00000-of-00001
- ckpt-350.index

I adjusted the initial_checkpoint variable in the configs/coco/panoptic_deeplab/resnet50_beta_os32.textproto file such that it points to the downloaded resnet50_beta_imagenet1k_strong_training_strategy folder. However, when I run the train.py script, I receive TypeError: Expected binary or unicode string, got None. while the checkpoint is being loaded with tf.train.load_checkpoint method (within trainer/runner_utils.py: _load_tf_model_garden_vision_checkpoint).

I inspected the problem in debug mode and realized that tf.train.load_checkpoint method internally requires a file named checkpoint under the specified initial_checkpoint path. I made a run without setting initial_checkpoint to generate an example file. Then, it revealed that a checkpoint file like the one below is required by the tensorflow loader to load checkpoints properly. The following file was missing in the provided checkpoint. So, I created it manually:

checkpoint:
model_checkpoint_path: "ckpt-350"
all_model_checkpoint_paths: "ckpt-350"

And placed it under resnet50_beta_imagenet1k_strong_training_strategy folder, resulting in the following folder structure for initial_checkpoint path:

resnet50_beta_imagenet1k_strong_training_strategy/
- checkpoint
- ckpt-350.data-00000-of-00001
- ckpt-350.index

This additional file seems to solve the problem. Briefly, I think such a checkpoint file must be added wherever it is missing.

Cannot be training?

thank you for the great codes!
and when i run training codes "python trainer/train.py --config_file='configs/coco/max_deeplab/max_deeplab_s_os16_res641_100k.textproto' --mode='train' --model_dir='checkpoint/max_deeplab_s_os16_res641_100k' --num_gpus=1"
there error logs:

I0814 02:46:35.469203 139737541474112 train.py:65] Reading the config file.
I0814 02:46:35.472224 139737541474112 train.py:69] Starting the experiment.
2021-08-14 02:46:35.472912: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-14 02:46:36.288440: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30987 MB memory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:88:00.0, compute capability: 7.0
I0814 02:46:36.301358 139737541474112 train_lib.py:105] Using strategy <class 'tensorflow.python.distribute.one_device_strategy.OneDeviceStrategy'> with 1 replicas
I0814 02:46:36.621985 139737541474112 deeplab.py:57] Synchronized Batchnorm is used.
I0814 02:46:36.622916 139737541474112 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': False, 'backbone_type': 'resnet_beta', 'use_axial_beyond_stride': 16, 'backbone_use_transformer_beyond_stride': 32, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 0.800000011920929, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'linear', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'keras.layers.normalization.batch_normalization.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0814 02:46:36.977007 139737541474112 deeplab.py:96] Setting pooling size to (41, 41)
I0814 02:46:36.977362 139737541474112 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
WARNING:tensorflow:AutoGraph could not transform <function resize_to_range at 0x7f16dcac2a70> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output.
Cause: [Errno 28] No space left on device: '/tmp/tmp3onbzwhs.py'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
W0814 02:46:40.066303 139737541474112 ag_logging.py:146] AutoGraph could not transform <function resize_to_range at 0x7f16dcac2a70> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output.
Cause: [Errno 28] No space left on device: '/tmp/tmp3onbzwhs.py'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function get_random_scale at 0x7f16dcac25f0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output.
Cause: [Errno 28] No space left on device: '/tmp/tmpfij6_rpa.py'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
W0814 02:46:40.161577 139737541474112 ag_logging.py:146] AutoGraph could not transform <function get_random_scale at 0x7f16dcac25f0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output.
Cause: [Errno 28] No space left on device: '/tmp/tmpfij6_rpa.py'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function randomly_scale_image_and_label at 0x7f16dcac2680> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output.
Cause: [Errno 28] No space left on device: '/tmp/tmpjt5b9l39.py'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
W0814 02:46:40.261782 139737541474112 ag_logging.py:146] AutoGraph could not transform <function randomly_scale_image_and_label at 0x7f16dcac2680> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, export AUTOGRAPH_VERBOSITY=10) and attach the full output.
Cause: [Errno 28] No space left on device: '/tmp/tmpjt5b9l39.py'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
Traceback (most recent call last):
File "trainer/train.py", line 76, in
app.run(main)
File "/data/anaconda3/lib/python3.7/site-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/data/anaconda3/lib/python3.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "trainer/train.py", line 72, in main
FLAGS.num_gpus)
File "/data/PanopticFCN-main/deeplab2/trainer/train_lib.py", line 137, in run_experiment
trainer = trainer_lib.Trainer(config, deeplab_model, losses, global_step)
File "/data/PanopticFCN-main/deeplab2/trainer/trainer.py", line 137, in init
only_semantic_annotations=not support_panoptic)
File "/data/PanopticFCN-main/deeplab2/trainer/runner_utils.py", line 130, in create_dataset
return reader(dataset_config.batch_size)
File "/data/PanopticFCN-main/deeplab2/data/dataloader/input_reader.py", line 89, in call
self._generator_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 1868, in map
preserve_cardinality=True)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 5024, in init
use_legacy_function=use_legacy_function)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4218, in init
self._function = fn_factory()
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3151, in get_concrete_function
*args, **kwargs)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3116, in _get_concrete_function_garbage_collected
graph_function, _ = self._maybe_define_function(args, kwargs)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3463, in _maybe_define_function
graph_function = self._create_graph_function(args, kwargs)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 3308, in _create_graph_function
capture_by_value=self._capture_by_value),
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/func_graph.py", line 1007, in func_graph_from_py_func
func_outputs = python_func(*func_args, **func_kwargs)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4195, in wrapped_fn
ret = wrapper_helper(*args)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4125, in wrapper_helper
ret = autograph.tf_convert(self._func, ag_ctx)(*nested_args)
File "/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py", line 695, in wrapper
raise e.ag_error_metadata.to_exception(e)
tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: in user code:

/data/PanopticFCN-main/deeplab2/data/sample_generator.py:159 __call__  *
    return self.call(**sample_dict)
/data/PanopticFCN-main/deeplab2/data/sample_generator.py:215 call  *
    resized_image, image, label, prev_image, prev_label, depth = (
/data/PanopticFCN-main/deeplab2/data/preprocessing/input_preprocessing.py:266 preprocess_image_and_label  *
    processed_image, label = preprocess_utils.randomly_scale_image_and_label(
/data/PanopticFCN-main/deeplab2/data/preprocessing/preprocess_utils.py:267 randomly_scale_image_and_label  **
    if scale == 1.0:
/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:900 __bool__
    self._disallow_bool_casting()
/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:507 _disallow_bool_casting
    self._disallow_in_graph_mode("using a `tf.Tensor` as a Python `bool`")
/data/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py:496 _disallow_in_graph_mode
    " this function with @tf.function.".format(task))

OperatorNotAllowedInGraphError: using a `tf.Tensor` as a Python `bool` is not allowed in Graph execution. Use Eager execution or decorate this function with @tf.function.

I'd appreciate it if anyone could tell me how to solve it.

Found 2 (potential) bugs when training on semantic-only datasets.

Thank you for making this amazing work publicly available.

I am using deeplab2 with a semantic-only dataset, i.e. the red channel of the image contains the class label and the other channels are 0.
I followed the instructions, but found 2 issues that appear when using semantic-only datasets:

Issue 1
original:

label = tf.io.decode_image(parsed_tensors[label_key], channels=1)

proposed change:

label = tf.io.decode_image(parsed_tensors[label_key], channels=3)
# Select red channel
label = label[:, :, 0:1]

The original code converts the image to grayscale instead of specifically selecting the red channel, which remaps the class labels.

Issue 2
original:

semantic_label = label // panoptic_label_divisor

proposed change:

semantic_label = label
if dataset_info['is_panoptic_dataset']:
  semantic_label = semantic_label // panoptic_label_divisor

Only divide by the panoptic_label_divisor if the dataset is a panoptic dataset, otherwise the labels of a semantic-only dataset are overwritten.
Do you agree with me regarding these two issues?

Setting panoptic_label_divisor=None gives error

I am trying to run training with the cityscapes dataset with only semantic segmentation annotations.
I have set panoptic_label_divisor=None in dataset.py. I'm using ${DEEPLAB2}/configs/cityscapes/panoptic_deeplab/ resnet50_os32_semseg.textproto, where the instance branch is not initiated.

When running:
python trainer/train.py
--config_file= deeplab2/configs/cityscapes/panoptic_deeplab/resnet50_os32_semseg.textproto
--mode=train
--model_dir=deeplab2/model
--num_gpus=1

I get the following error

Traceback (most recent call last):
File "trainer/train.py", line 81, in
app.run(main)
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "trainer/train.py", line 77, in main
FLAGS.num_gpus)
File "/home/mimi/code/deeplab2_proj/deeplab2/trainer/train_lib.py", line 137, in run_experiment
trainer = trainer_lib.Trainer(config, deeplab_model, losses, global_step)
File "/home/mimi/code/deeplab2_proj/deeplab2/trainer/trainer.py", line 137, in init
only_semantic_annotations=not support_panoptic)
File "/home/mimi/code/deeplab2_proj/deeplab2/trainer/runner_utils.py", line 122, in create_dataset
focus_small_instances=focus_small_instances)
File "/home/mimi/code/deeplab2_proj/deeplab2/data/sample_generator.py", line 127, in init
*self._dataset_info['panoptic_label_divisor'],
TypeError: unsupported operand type(s) for : 'int' and 'NoneType'

Any suggestions of what may be the problem.

Cannot compile protocol buffers

Hi, thanks for your framework. I followed the installation guide but got problems when compiling the protocol buffers. I followed the instructions exactly as you specified, except that I'm working on CentOS whereas you seem to work in Ubuntu.

(deeplab2) [user1@localhost deeplab2]$ pwd
/home/user1/deeplab2/deeplab2
(deeplab2) [user1@localhost deeplab2]$ cd ..
(deeplab2) [user1@localhost deeplab2]$ ls
cocoapi deeplab2 init.sh models
(deeplab2) [user1@localhost deeplab2]$ protoc deeplab2/*.proto --python_out=.
deeplab2/model.proto:178:3: Expected "required", "optional", or "repeated".
deeplab2/model.proto:178:27: Missing field number.
deeplab2/config.proto: Import "deeplab2/model.proto" was not found or had errors.
deeplab2/config.proto:31:12: "ModelOptions" is not defined.

Is it possible that I use the wrong protoc version?

(deeplab2) [user1@localhost deeplab2]$ protoc --version
libprotoc 2.5.0

Cannot execute tests - no module named 'deeplab2'

Hi, I'm trying to execute deeplab_test.py but I get the following error message:

(deeplab2) [user1@localhost deeplab2]$ python deeplab2/model/deeplab_test.py
2021-06-23 11:51:23.528410: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-06-23 11:51:23.528448: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
File "deeplab2/model/deeplab_test.py", line 24, in
from deeplab2 import common
ModuleNotFoundError: No module named 'deeplab2'

I did not chance any files cloned from the repo and there is a folder deeplab2 with a file common.py in it but I guess it is not referenced correctly?

Out of Memory during evaluation

Description

I train the model on a custom dataset, and the training finished without any issues (I needed to decrease the crop size as discussed in the FAQ).
However, when running the evaluation (mode=eval), the compute node runs out of memory. I'm using the same compute node and GPU as during training.
So, my main question is what is the difference between training and evaluation? Are the images not resized to the same sizes as during training, and the original resolution is used?

Here is the experiment config file:

# proto-file: deeplab2/config.proto
# proto-message: ExperimentOptions
#
# Panoptic-DeepLab with ResNet-50-beta model variant and output stride 32.
#
############### PLEASE READ THIS BEFORE USING THIS CONFIG ###############
# Before using this config, you need to update the following fields:
# - experiment_name: Use a unique experiment name for each experiment.
# - initial_checkpoint: Update the path to the initial checkpoint.
# - train_dataset_options.file_pattern: Update the path to the
#   training set. e.g., your_dataset/train*.tfrecord
# - eval_dataset_options.file_pattern: Update the path to the
#   validation set, e.g., your_dataset/eval*.tfrecord
# - (optional) set merge_semantic_and_instance_with_tf_op: true, if you
#   could successfully compile the provided efficient merging operation
#   under the folder `tensorflow_ops`.
#########################################################################
#
# The `resnet50_beta` model variant replaces the first 7x7 convolutions in the
# original `resnet50` with three 3x3 convolutions, which is useful for dense
# prediction tasks.
#
# References:
# For resnet-50-beta, see
# https://github.com/tensorflow/models/blob/master/research/deeplab/core/resnet_v1_beta.py
# For Panoptic-DeepLab, see
# - Bowen Cheng, et al. "Panoptic-DeepLab: A Simple, Strong, and Fast Baseline
#   for Bottom-Up Panoptic Segmentation." In CVPR, 2020.

# Use a unique experiment_name for each experiment.
experiment_name: "air_flange_panoptic_segmentation_resnet50"
model_options {
  # Update the path to the initial checkpoint (e.g., ImageNet
  # pretrained checkpoint).
  initial_checkpoint: "air_flange_panoptic_segmentation_resnet50/ckpt-2000"
  backbone {
    name: "resnet50_beta"
    output_stride: 32
  }
  decoder {
    feature_key: "res5"
    decoder_channels: 256
    aspp_channels: 256
    atrous_rates: 3
    atrous_rates: 6
    atrous_rates: 9
  }
  panoptic_deeplab {
    low_level {
      feature_key: "res3"
      channels_project: 64
    }
    low_level {
      feature_key: "res2"
      channels_project: 32
    }
    instance {
      low_level_override {
        feature_key: "res3"
        channels_project: 32
      }
      low_level_override {
        feature_key: "res2"
        channels_project: 16
      }
      instance_decoder_override {
        feature_key: "res5"
        decoder_channels: 128
        atrous_rates: 3
        atrous_rates: 6
        atrous_rates: 9
      }
      center_head {
        output_channels: 1
        head_channels: 32
      }
      regression_head {
        output_channels: 2
        head_channels: 32
      }
    }
    semantic_head {
      output_channels: 2
      head_channels: 256
    }
  }
}
trainer_options {
  save_checkpoints_steps: 1000
  save_summaries_steps: 100
  steps_per_loop: 100
  loss_options {
    semantic_loss {
      name: "softmax_cross_entropy"
      weight: 1.0
      top_k_percent: 0.2
    }
    center_loss {
      name: "mse"
      weight: 200
    }
    regression_loss {
      name: "l1"
      weight: 0.01
    }
  }
  solver_options {
    base_learning_rate: 0.00025
    # training_number_of_steps: 60000
    training_number_of_steps: 2000
  }
}
train_dataset_options {
  dataset: "air_flange_panoptic"
  # Update the path to training set.
  file_pattern: "train*.tfrecord"
  # Adjust the batch_size accordingly to better fit your GPU/TPU memory.
  # Also see Q1 in g3doc/faq.md.
  batch_size: 8
  crop_size: 641
  crop_size: 641
  min_resize_value: 641
  max_resize_value: 641
  augmentations {
    min_scale_factor: 0.5
    max_scale_factor: 2.0
    scale_factor_step_size: 0.1
    autoaugment_policy_name: "simple_classification_policy_magnitude_scale_0.2"
  }
  increase_small_instance_weights: true
  small_instance_weight: 3.0
}
eval_dataset_options {
  dataset: "air_flange_panoptic"
  # Update the path to validation set.
  file_pattern: "val*.tfrecord"
  batch_size: 1
  crop_size: 641
  crop_size: 641
  min_resize_value: 641
  max_resize_value: 641
  # Add options to make the evaluation loss comparable to the training loss.
  increase_small_instance_weights: true
  small_instance_weight: 3.0
}
evaluator_options {
  continuous_eval_timeout: 43200
  stuff_area_limit: 2048
  center_score_threshold: 0.1
  nms_kernel: 13
  save_predictions: true
  save_raw_predictions: false
  # Use pure tf functions (i.e., no CUDA kernel) to merge semantic and
  # instance maps. For faster speed, compile TensorFlow with provided kernel
  # implementation under the folder `tensorflow_ops`, and set
  # merge_semantic_and_instance_with_tf_op to true.
  merge_semantic_and_instance_with_tf_op: false
  eval_interval: 1000
}

Exporting MaX-DeepLab - iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function.

Hey guys,

Thanks for your awesome work in this repo.

Getting the following error when attempting to export a Max-DeepLab model:

tensorflow.python.framework.errors_impl.OperatorNotAllowedInGraphError: 
iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.

Which is generated from this line whilst attempting to iterate over a Tensor("strided_slice:0", shape=(), dtype=int32)
https://github.com/google-research/deeplab2/blob/main/model/post_processor/max_deeplab.py#L389

I'm using Tensorflow version 2.6.0, this also happens with TF 2.5.0.

I've also attempted to export max_deeplab_s_os16_res641_400k using the provided config and checkpoint and got the same error.

I'll keep investigating if its an issue with my environment.

Execution of trainer/train.py stops in the middle.

I executed the following command, but it stopped in the middle of execution.

python trainer/train.py
--config_file=configs/coco/max_deeplab/max_deeplab_s_os16_res641_400k.textproto
--mode=train
--model_dir=checkpoints/max_deeplab_s_os16_res641_400k_coco_train
--num_gpus=1

execution results

2021-09-15 05:26:54.602208: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic >library libcudart.so.11.0
I0915 05:26:56.178199 139653727340352 train.py:67] Reading the config file.
I0915 05:26:56.181392 139653727340352 train.py:71] Starting the experiment.
2021-09-15 05:26:56.182713: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcuda.so.1
2021-09-15 05:26:56.223570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.224579: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-09-15 05:26:56.224627: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcudart.so.11.0
2021-09-15 05:26:56.229769: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcublas.so.11
2021-09-15 05:26:56.229851: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcublasLt.so.11
2021-09-15 05:26:56.232521: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcufft.so.10
2021-09-15 05:26:56.232933: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcurand.so.10
2021-09-15 05:26:56.233980: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcutensor.so.1
2021-09-15 05:26:56.234695: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcusolver.so.11
2021-09-15 05:26:56.235845: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcusparse.so.11
2021-09-15 05:26:56.236159: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcudnn.so.8
2021-09-15 05:26:56.236387: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.237526: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.238509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-09-15 05:26:56.239783: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.240771: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1734] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-09-15 05:26:56.241016: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.242099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.243114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1872] Adding visible gpu devices: 0
2021-09-15 05:26:56.243177: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcudart.so.11.0
2021-09-15 05:26:56.927719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-09-15 05:26:56.927789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-09-15 05:26:56.927823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-09-15 05:26:56.928101: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.929215: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.930264: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1050] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-09-15 05:26:56.931224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14652 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
2021-09-15 05:26:56.931653: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 8. Tune using inter_op_parallelism_threads for best performance.
I0915 05:26:56.932756 139653727340352 train_lib.py:107] Using strategy <class 'tensorflow.python.distribute.one_device_strategy.OneDeviceStrategy'> with 1 replicas
I0915 05:26:57.241827 139653727340352 deeplab.py:57] Synchronized Batchnorm is used.
I0915 05:26:57.242810 139653727340352 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': False, 'backbone_type': 'resnet_beta', 'use_axial_beyond_stride': 16, 'backbone_use_transformer_beyond_stride': 32, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 0.800000011920929, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'linear', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'tensorflow.python.keras.layers.normalization_v2.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0915 05:26:57.593706 139653727340352 deeplab.py:96] Setting pooling size to (41, 41)
I0915 05:26:57.593983 139653727340352 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0915 05:27:03.713764 139653727340352 controller.py:391] restoring or initializing model...
restoring or initializing model...
I0915 05:27:03.750207 139653727340352 controller.py:395] restored model from checkpoints/max_deeplab_s_os16_res641_400k_coco_train/$Panoptic_Test/ckpt-0.
restored model from checkpoints/max_deeplab_s_os16_res641_400k_coco_train/$Panoptic_Test/ckpt-0.
I0915 05:27:03.750401 139653727340352 controller.py:217] restored from checkpoint: checkpoints/max_deeplab_s_os16_res641_400k_coco_train/$Panoptic_Test/ckpt-0
restored from checkpoint: checkpoints/max_deeplab_s_os16_res641_400k_coco_train/$Panoptic_Test/ckpt-0
I0915 05:27:04.950217 139653727340352 api.py:446] Eval with scales ListWrapper([1.0])
I0915 05:27:06.379980 139653727340352 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0915 05:27:06.411498 139653727340352 api.py:446] Eval scale 1.0; setting pooling size to [41, 41]
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The validate_indices argument has no effect. Indices are always validated on CPU and never validated on GPU.
W0915 05:27:09.029129 139653727340352 deprecation.py:528] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The validate_indices argument has no effect. Indices are always validated on CPU and never validated on GPU.
I0915 05:27:17.629784 139653727340352 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0915 05:27:18.935303 139653727340352 controller.py:236] train | step: 0 | training until step 400000...
train | step: 0 | training until step 400000...
2021-09-15 05:27:19.169520: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-09-15 05:27:19.188923: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2300015000 Hz
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:206: calling foldl_v2 (from tensorflow.python.ops.functional_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldl(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldl(fn, elems))
W0915 05:27:32.651105 139653727340352 deprecation.py:596] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py:206: calling foldl_v2 (from tensorflow.python.ops.functional_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldl(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldl(fn, elems))
WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py:463: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))
W0915 05:27:33.151738 139653727340352 deprecation.py:596] From /usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py:463: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))
2021-09-15 05:28:59.687466: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcudnn.so.8
2021-09-15 05:29:00.411789: I tensorflow/stream_executor/cuda/cuda_dnn.cc:380] Loaded cuDNN version 8202
2021-09-15 05:29:01.223826: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcublas.so.11
2021-09-15 05:29:01.834288: I tensorflow/stream_executor/platform/default/dso_loader.cc:54] Successfully opened dynamic library libcublasLt.so.11`

The following commands in the installation worked well.

python deeplab2/tensorflow_ops/python/kernel_tests/merge_semantic_and_instance_maps_op_test.py
python deeplab2/model/deeplab_test.py
python deeplab2/trainer/evaluator_test.py

I also referred to problem #36 , but it did not work.
Please let me know the solution.

environment
- NVIDIA Driver Version: 470.57.02 CUDA Version: 11.4
- nvidia docker installed on AWS (p3.xlarge)
  The docker file I wrote is as follows.

FROM nvcr.io/nvidia/tensorflow:21.08-tf2-py3

RUN mkdir /projects
WORKDIR /projects

RUN apt-get update
&& apt-get install -y git
cmake
protobuf-compiler
&& apt-get update
&& apt clean

RUN pip install --upgrade pip

RUN pip install numpy
&& pip install Pillow
&& pip install matplotlib
&& pip install cython
&& pip install jupyter

RUN git clone https://github.com/google-research/deeplab2.git

RUN git clone https://github.com/tensorflow/models.git

RUN git clone https://github.com/cocodataset/cocoapi.git
&& cd cocoapi/PythonAPI
&& python setup.py install
&& make

ENV PYTHONPATH $PYTHONPATH:pwd:pwd/models:pwd/cocoapi/PythonAPI
RUN echo $PYTHONPATH

RUN protoc deeplab2/*.proto --python_out=.

MOT-STEP transfer error

2021-07-13 18:53:38.368250: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
I0713 18:53:39.327753 140106723893440 build_step_data.py:292] Starts to processing STEP dataset split train.
Traceback (most recent call last):
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/data/build_step_data.py", line 298, in
app.run(main)
File "/home/ubuntu/anaconda3/envs/deeplab2/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/ubuntu/anaconda3/envs/deeplab2/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/data/build_step_data.py", line 294, in main
FLAGS.use_two_frames)
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/data/build_step_data.py", line 283, in _convert_dataset
use_two_frames, is_testing)
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/data/build_step_data.py", line 247, in _create_panoptic_tfexample
prev_label_data=prev_label_data)
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/data/data_utils.py", line 224, in create_video_tfexample
label_format)
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/data/data_utils.py", line 124, in create_features
raise ValueError('Unsupported image format: %s' % image_format)
ValueError: Unsupported image format: jpg

Process finished with exit code 1

following the guide:https://github.com/google-research/deeplab2/blob/main/g3doc/setup/kitti_step.md, I transfer the KITTI dataset to KITTI_STEP, while I transfer MOT-Challenge following the guide:https://github.com/google-research/deeplab2/blob/main/g3doc/setup/motchallenge_step.md, I get the above error.

Cannot place MergeSemanticAndInstanceMaps op on GPU

OS: Ubuntu 21.04
Docker image: tensorflow/tensorflow:2.6.0-gpu with CUDA 11.2
GCC version 7.5.0
NVCC version V11.2.152
GPU 1: NVIDIA GeForce GTX 1080 Ti 11GB
GPU 2: NVIDIA GeForce RTX 2080 Ti 11GB

I was benchmarking my exported PanopticDeeplab SavedModel on the GPUs (tried both). After stripping all unnecessary outputs from the model (e.g. everything but "panoptic_pred"), I got an inference speed of about 7 FPS. But when I tried the export with the --merge_with_tf_op parameter which uses the MergeSemanticAndInstanceMaps custom op, the speed was cut in half. Then I profiled my runs (100 inferences in a simple loop) and noticed that Tensorflow was always placing this custom op on the CPU while everything else was running on the GPU.

If I try to force GPU placement with tf.config.set_soft_device_placement(False) and with tf.device('/gpu:0') I get an error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Could not satisfy device specification '/job:localhost/replica:0/task:0/device:GPU:0'. enable_soft_placement=0. Supported device types [CPU]. All available devices [/job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:CPU:0]. [Op:MergeSemanticAndInstanceMaps]

I didn't get any errors when compiling the custom op for GPU, and I did not compile the CPU version at all. Here's what the kernels folder looks like:

# ls -lah deeplab2/tensorflow_ops/kernels/
total 848K
drwxrwxr-x 1 1001 1001  520 Sep 16 14:12 .
drwxrwxr-x 1 1001 1001   26 Aug  6 17:01 ..
-rw-rw-r-- 1 1001 1001 3.9K Aug  6 17:01 merge_semantic_and_instance_maps_op.cc
-rwxr-xr-x 1 root root 537K Sep 16 14:12 merge_semantic_and_instance_maps_op.so
-rw-rw-r-- 1 1001 1001  12K Aug  6 17:01 merge_semantic_and_instance_maps_op_kernel.cc
-rw-rw-r-- 1 1001 1001  13K Aug  6 17:01 merge_semantic_and_instance_maps_op_kernel.cu.cc
-rw-r--r-- 1 root root 269K Sep 16 14:10 merge_semantic_and_instance_maps_op_kernel.cu.o
-rw-rw-r-- 1 1001 1001 2.0K Aug  6 17:01 merge_semantic_and_instance_maps_op_kernel.h

Also, the output from the model is correct so the custom op is executed successfully, but very slowly because I can't run it on the GPU. Is there something I'm doing wrong or can I provide some further information (e.g. a Dockerfile and a script that can reproduce the issue) to help track this down?

Question about bounding boxes in labels for MaxDeepLab

Thank you for this amazing work,

I have a doubt regarding MaxDeepLab: as far as I have understood there is no need in the labels for the bounding boxes ground truth, is it correct?
For example in the Coco dataset, the format for the panoptic labels expects, in every segment of each image, a field bbox (annotations[n].segments_info[m].bbox). So, if I want to train MaxDeepLab on the Coco dataset, it makes no difference if I set every bbox field to [0,0,0,0], right?

Some question about the eval codes on on Cityscapes test datasets?

Hello, thank you for you codes,firstly.
Now I want to use this code(axial_deeplab) to generate panoptic segmentation results on unlabeled data such as test set, and then generate test tfRecord, but when I use the following command to test,

python trainer/train.py --config_file="/home/sr6/yi0813.zhou/pingjun/deeplab2/configs/cityscapes/axial_deeplab/axial_swidernet_1_1_4.5_os16.textproto" --mode='eval' --model_dir="/home/sr6/yi0813.zhou/pingjun/deeplab2/init_checkpoints/axial_swidernet_1_1_4_5/" --num_gpus=1

I get the following error:
- ***** TopKCrossEntropyLoss, gt_key: semantic_gt, pred_key: semantic_logits, weight_key: semantic_loss_weight, self._dynamic_weight: False ****** TopKCrossEntropyLoss, gt: (1, 1025, 2049, 19), pred: (1, 1025, 2049, 19), weights: (1, 1025, 2049), _top_k_percent_pixels: 0.20000000298023224 2021-09-16 17:29:34.622250: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found. 2021-09-16 17:29:34.624420: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found. Traceback (most recent call last): File "trainer/train.py", line 83, in <module> 2021-09-16 17:29:34.625410: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found. 2021-09-16 17:29:34.626488: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found. app.run(main) File "/home/sr6/yi0813.zhou/pingjun/python3.7.3/python3.7.3/lib/python3.7/site-packages/absl/app.py", line 312, in run 2021-09-16 17:29:34.627536: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found. 2021-09-16 17:29:34.628701: W tensorflow/core/framework/op_kernel.cc:1692] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
So I think the prediction process for unlabel data shouldn't use this command, could you guide me or provide the correct test code?
Someone have same question or have solved ? thank you teach.

Thank you again!

Mismatch between train and eval loss results

Hello,

I've trained a deeplabv3+ model on a custom dataset. It seems to be training properly, and the loss converges, in this case to about 0.03.
The issue is that when I run the evaluation, on the exact same data (not the evaluation set), the loss value returned is different from the previous one.
More precisely, it goes from 'losses/train_total_loss': 0.032853287 to 'losses/eval_total_loss': 0.3922176
I figure that the train value is computed on just one loop (in this case 100 samples) and not the entire dataset, and there could be extreme cases that could skew the mean loss, but I would have seen huge loss spikes during training if that was the case.
So what I don't understand is how could the loss be so different between training and evaluation, on the same data?

As an additional note, I noticed that the more evaluation steps I make, the higher the returned loss is, but the IoU does not follow the inverse trend.

restored model from /app/data/Modulation_recognition/models/Training_006/ckpt-70000.
restored from checkpoint: /app/data/Modulation_recognition/models/Training_006/ckpt-70000
eval | step: 70000 | running 100 steps of evaluation...
eval | step: 70000 | eval time: 89.5 sec | output:
{'evaluation/iou/IoU': 0.35476902,
'losses/eval_semantic_loss': 0.035643984,
'losses/eval_total_loss': 0.035643984}
restoring or initializing model...
restored model from /app/data/Modulation_recognition/models/Training_006/ckpt-70000.
restored from checkpoint: /app/data/Modulation_recognition/models/Training_006/ckpt-70000
eval | step: 70000 | running 500 steps of evaluation...
eval | step: 70000 | eval time: 194.9 sec | output:
{'evaluation/iou/IoU': 0.25706673,
'losses/eval_semantic_loss': 0.049148366,
'losses/eval_total_loss': 0.049148366}
restoring or initializing model...
restored model from /app/data/Modulation_recognition/models/Training_006/ckpt-70000.
restored from checkpoint: /app/data/Modulation_recognition/models/Training_006/ckpt-70000
eval | step: 70000 | running 1000 steps of evaluation...
eval | step: 70000 | eval time: 249.3 sec | output:
{'evaluation/iou/IoU': 0.5440319,
'losses/eval_semantic_loss': 0.13150834,
'losses/eval_total_loss': 0.13150834}
restoring or initializing model...
restored model from /app/data/Modulation_recognition/models/Training_006/ckpt-70000.
restored from checkpoint: /app/data/Modulation_recognition/models/Training_006/ckpt-70000
eval | step: 70000 | running 10000 steps of evaluation...
eval | step: 70000 | eval time: 1709.6 sec | output:
{'evaluation/iou/IoU': 0.73578227,
'losses/eval_semantic_loss': 0.23771212,
'losses/eval_total_loss': 0.23771212}

If it helps, here is the config file I'm using:

# Use a unique experiment_name for each experiment.
experiment_name: "Training_006"
model_options {
  # Update the path to the initial checkpoint (e.g., ImageNet
  # pretrained checkpoint).
  initial_checkpoint: ""
  backbone {
    name: "resnet50"
    output_stride: 32
  }
  decoder {
    feature_key: "res5"
    decoder_channels: 256
    aspp_channels: 256
    atrous_rates: 3
    atrous_rates: 6
    atrous_rates: 9
  }
  deeplab_v3_plus {
    low_level {
      feature_key: "res2"
      channels_project: 32
    }
    num_classes: 15
  }
}
trainer_options {
  save_checkpoints_steps: 1000
  save_summaries_steps: 100
  steps_per_loop: 100
  loss_options {
    semantic_loss {
      name: "softmax_cross_entropy"
      weight: 1.0
    }
  }
  solver_options {
    base_learning_rate: 0.0006
    training_number_of_steps: 30000
  }
}
train_dataset_options {
  dataset: "spawc_mod_reco"
  # Update the path to training set.
  file_pattern: "/app/data/Modulation_recognition/spawc21_wideband_dataset/tfrecords/train/*.tfrecord"
  # Adjust the batch_size accordingly to better fit your GPU/TPU memory.
  # Also see Q1 in g3doc/faq.md.
  batch_size: 4
  crop_size: 2049
  crop_size: 513
  # Skip resizing.
  min_resize_value: 0
  max_resize_value: 0
  augmentations {
    min_scale_factor: 1
    max_scale_factor: 1
    scale_factor_step_size: 1
  }
}
eval_dataset_options {
  dataset: "spawc_mod_reco"
  # Update the path to validation set.
  file_pattern: "/app/data/Modulation_recognition/spawc21_wideband_dataset/tfrecords/train/*.tfrecord"
  batch_size: 1
  crop_size: 2049
  crop_size: 600
  # Skip resizing.
  min_resize_value: 0
  max_resize_value: 0
} 
evaluator_options {
  eval_steps: 10000
  continuous_eval_timeout: -1
  save_predictions: true
  num_vis_samples: 200
  save_raw_predictions: false
}

Could it be something to do with the way the metrics are aggregated before printing?

Thanks a lot

Issues about KITTI-STEP dataset

Hi @markweberdev ,

Thanks for your work on deeplab2 and the STEP paper.

We are trying to follow the KITTI-STEP dataset in our own project. Howver, we found that for the class ids [12,14,15,16,17,18], the instance id annotations are 0. We acknowldge that they do not have the sequence-consistent annotations, but our question is that:

does the curent annotation means that each object of the above class labels is unique in the specific image?
if 1 is false, do I need to consider [12,14,15,16,17,18] as thing?

Thansk again.

Possible MaX-DeepLab-L-Backbone ImageNet checkpoint issue

Hey guys,

Possible issue with the supplied Max-DeepLab-L backbone Imagenet checkpoint.

On a working Max-DL-S config, I swapped out the following to attempt a Max-DL-L train:

name: "max_deeplab_l"
initial_checkpoint: "max_deeplab_l_backbone_imagenet1k_strong_training_strategy/ckpt-350"

And got the following error: (I think this checkpoint is actually from a specific dataset?)

ValueError: The initial value's shape ((1, 1, 2048, 512)) is not compatible with the explicitly supplied `shape` argument ((1, 1, 2048, 1024)).

Full stack trace below:

I0923 13:32:21.470141 140322436347712 api.py:446] Eval with scales ListWrapper([1.0])
I0923 13:32:22.874432 140322436347712 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0923 13:32:22.898995 140322436347712 api.py:446] Eval scale 1.0; setting pooling size to [21, 21]
Traceback (most recent call last):
  File "trainer/train.py", line 80, in <module>
    app.run(main)
  File "ai-env-deeplab2/lib/python3.6/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "ai-env-deeplab2/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "trainer/train.py", line 76, in main
    FLAGS.num_gpus)
  File "deeplab2/trainer/train_lib.py", line 210, in run_experiment
    build_deeplab_model(deeplab_model, crop_size)
  File "deeplab2/trainer/train_lib.py", line 76, in build_deeplab_model
    tf.keras.Input(input_shape, batch_size=batch_size), training=False)
  File "ai-env-deeplab2/lib/python3.6/site-packages/keras/engine/base_layer.py", line 977, in __call__
    input_list)
  File "ai-env-deeplab2/lib/python3.6/site-packages/keras/engine/base_layer.py", line 1115, in _functional_construction_call
    inputs, input_masks, args, kwargs)
  File "ai-env-deeplab2/lib/python3.6/site-packages/keras/engine/base_layer.py", line 848, in _keras_tensor_symbolic_call
    return self._infer_output_signature(inputs, args, kwargs, input_masks)
  File "ai-env-deeplab2/lib/python3.6/site-packages/keras/engine/base_layer.py", line 888, in _infer_output_signature
    outputs = call_fn(inputs, *args, **kwargs)
  File "ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 695, in wrapper
    raise e.ag_error_metadata.to_exception(e)
ValueError: in user code:

    deeplab2/model/deeplab.py:159 call  *
        pred_dict = self._decoder(
    deeplab2/model/encoder/axial_resnet.py:763 call  *
        memory_feature, high_resolution_outputs, backbone_output, endpoints = (
    deeplab2/model/encoder/axial_resnet.py:665 call_stacked_decoder  *
        activated_output = getattr(self, current_name)(
    deeplab2/model/layers/resized_fuse.py:154 call  *
        feature = getattr(self, current_name)(feature, training=training)
    deeplab2/model/layers/convolutions.py:287 call  *
        x = self._conv(x)
    ai-env-deeplab2/lib/python3.6/site-packages/keras/engine/base_layer.py:1030 __call__  **
        self._maybe_build(inputs)
    ai-env-deeplab2/lib/python3.6/site-packages/keras/engine/base_layer.py:2659 _maybe_build
        self.build(input_shapes)  # pylint:disable=not-callable
    ai-env-deeplab2/lib/python3.6/site-packages/keras/layers/convolutional.py:204 build
        dtype=self.dtype)
    ai-env-deeplab2/lib/python3.6/site-packages/keras/engine/base_layer.py:663 add_weight
        caching_device=caching_device)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/training/tracking/base.py:818 _add_variable_with_custom_getter
        **kwargs_for_getter)
    ai-env-deeplab2/lib/python3.6/site-packages/keras/engine/base_layer_utils.py:129 make_variable
        shape=variable_shape if variable_shape else None)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py:266 __call__
        return cls._variable_v1_call(*args, **kwargs)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py:227 _variable_v1_call
        shape=shape)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py:67 getter
        return captured_getter(captured_previous, **kwargs)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py:2127 creator_with_resource_vars
        created = self._create_variable(next_creator, **kwargs)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py:530 _create_variable
        distribute_utils.VARIABLE_POLICY_MAPPING, **kwargs)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_utils.py:308 create_mirrored_variable
        value_list = real_mirrored_creator(**kwargs)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py:522 _real_mirrored_creator
        v = next_creator(**kwargs)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py:205 <lambda>
        previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py:2626 default_variable_creator
        shape=sha/var/sensen/ai/pe)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/variables.py:270 __call__
        return super(VariableMetaclass, cls).__call__(*args, **kwargs)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1613 __init__
        distribute_strategy=distribute_strategy)
    ai-env-deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py:1753 _init_from_args
        (initial_value.shape, shape))

    ValueError: The initial value's shape ((1, 1, 2048, 512)) is not compatible with the explicitly supplied `shape` argument ((1, 1, 2048, 1024)).

Got stuck during training

Hi,
Thanks for sharing this great work. I successfully run the evaluating code for max-deeplab but have issues during training. I use two P40 GPU to sanity check the training code with batchsize=2. I didn't change other configs. After I ran the code, I got stuck at "shuffle buffer filled".

The GPU utility is so low so I don't know whether or not it is running and tensorborad keeps blank.
I am not familiar with TF2 (especially for this pastiche...), could anyone help to figure out what's the problem? Thank you.

BTW, is there any way to make a progress bar like tqdm in pytorch?

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

I changed _SHUFFLE_BUFFER_SIZE=1000 and set it to 50. The "shuffle buffer filled" is ok now.
But still, GPU utility very low & blank tensorboard

I set the summary writer to work every step (maybe summary writer? I used TF1 many years ago)

save_checkpoints_steps: 1000
save_summaries_steps: 1 #100
steps_per_loop: 1 #100

And...I am very confused that my GPU util is related to the GPU number e.g. 8% for gpu_num=2 and 16% for gpu_num=1...While the GPU memory are fully used no matter what buffer size is

I also tried input size 241x241, it doesn't work. The memory is still full. I think this should be an easy problem, but I am not familiar with TF....

(py37tf) mcg@msratiranda:~/deeplab2$ python3 trainer/train.py --config_file=configs/coco/max_deeplab/max_deeplab_s_os16_res1025_200k.textproto --mode=train --model_dir=output --num_gpus=2
2021-06-25 06:55:27.787843: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
I0625 06:55:29.205785 140604240011456 train.py:65] Reading the config file.
I0625 06:55:29.208885 140604240011456 train.py:69] Starting the experiment.
2021-06-25 06:55:29.210546: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2021-06-25 06:55:31.068027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2021-06-25 06:55:31.069245: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
pciBusID: 0002:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2021-06-25 06:55:31.069291: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-06-25 06:55:31.072768: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-06-25 06:55:31.072829: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-06-25 06:55:31.074202: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2021-06-25 06:55:31.074513: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2021-06-25 06:55:31.077970: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2021-06-25 06:55:31.078721: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2021-06-25 06:55:31.078880: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-06-25 06:55:31.083367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
2021-06-25 06:55:31.083814: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-06-25 06:55:31.468479: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0001:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2021-06-25 06:55:31.469669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 1 with properties:
pciBusID: 0002:00:00.0 name: Tesla P40 computeCapability: 6.1
coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 22.38GiB deviceMemoryBandwidth: 323.21GiB/s
2021-06-25 06:55:31.474170: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0, 1
2021-06-25 06:55:31.474253: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2021-06-25 06:55:32.357293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-06-25 06:55:32.357388: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 1
2021-06-25 06:55:32.357413: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N N
2021-06-25 06:55:32.357428: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 1:   N N
2021-06-25 06:55:32.363370: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22149 MB memory) -> physical GPU (device: 0, name: Tesla P40, pci bus id: 0001:00:00.0, compute capability: 6.1)
2021-06-25 06:55:32.365513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 22149 MB memory) -> physical GPU (device: 1, name: Tesla P40, pci bus id: 0002:00:00.0, compute capability: 6.1)
WARNING:tensorflow:Collective ops is not configured at program startup. Some performance features may not be enabled.
W0625 06:55:32.369957 140604240011456 mirrored_strategy.py:379] Collective ops is not configured at program startup. Some performance features may not be enabled.
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
I0625 06:55:32.867475 140604240011456 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
I0625 06:55:32.868017 140604240011456 train_lib.py:105] Using strategy <class 'tensorflow.python.distribute.mirrored_strategy.MirroredStrategy'> with 2 replicas
I0625 06:55:32.875228 140604240011456 deeplab.py:57] Synchronized Batchnorm is used.
I0625 06:55:32.876093 140604240011456 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 4, 6, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': False, 'backbone_type': 'resnet_beta', 'use_axial_beyond_stride': 16, 'backbone_use_transformer_beyond_stride': 32, 'extra_decoder_use_transformer_beyond_stride': 32, 'backbone_decoder_num_stacks': 0, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 0, 'extra_decoder_blocks_per_stage': 1, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 256, 'base_transformer_expansion': 1.0, 'global_feed_forward_network_channels': 256, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 2, 'drop_path_keep_prob': 0.800000011920929, 'drop_path_beyond_stride': 16, 'drop_path_schedule': 'linear', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 1, 'value_expansion': 2, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'tensorflow.python.keras.layers.normalization_v2.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I0625 06:55:33.157844 140604240011456 deeplab.py:96] Setting pooling size to (65, 65)
I0625 06:55:33.158083 140604240011456 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
decode finish
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.530962 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.532213 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.534660 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.535581 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.538797 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.539653 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.541773 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.542600 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.545866 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I0625 06:55:42.546801 140604240011456 cross_device_ops.py:621] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
######### 100
I0625 06:55:42.571589 140604240011456 controller.py:391] restoring or initializing model...
restoring or initializing model...
I0625 06:55:42.608021 140604240011456 controller.py:395] restored model from output/Eval/ckpt-0.
restored model from output/Eval/ckpt-0.
I0625 06:55:42.608137 140604240011456 controller.py:217] restored from checkpoint: output/Eval/ckpt-0
restored from checkpoint: output/Eval/ckpt-0
I0625 06:55:43.796573 140604240011456 api.py:446] Eval with scales ListWrapper([1.0])
I0625 06:55:45.063524 140604240011456 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0625 06:55:45.090902 140604240011456 api.py:446] Eval scale 1.0; setting pooling size to [65, 65]
WARNING:tensorflow:From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
W0625 06:55:48.688872 140604240011456 deprecation.py:534] From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py:5049: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
I0625 06:56:01.794970 140604240011456 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I0625 06:56:03.112913 140604240011456 controller.py:236] train | step:      0 | training until step 200000...
train | step:      0 | training until step 200000...
2021-06-25 06:56:04.121265: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-06-25 06:56:04.122489: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2593990000 Hz
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:05.927121 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:05.949938 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:05.972526 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:06.089528 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:06.111567 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:06.133234 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:06.252249 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:06.278362 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:06.300985 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
I0625 06:56:06.431849 140604240011456 cross_device_ops.py:903] batch_all_reduce: 1 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py:206: calling foldl_v2 (from tensorflow.python.ops.functional_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldl(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldl(fn, elems))
W0625 06:56:43.346125 140596987537152 deprecation.py:601] From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/util/dispatch.py:206: calling foldl_v2 (from tensorflow.python.ops.functional_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.foldl(fn, elems, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.foldl(fn, elems))
WARNING:tensorflow:From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py:463: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))
W0625 06:56:43.658312 140596987537152 deprecation.py:601] From /home/mcg/miniconda3/envs/py37tf/lib/python3.7/site-packages/tensorflow/python/autograph/impl/api.py:463: calling while_loop_v2 (from tensorflow.python.ops.control_flow_ops) with back_prop=False is deprecated and will be removed in a future version.
Instructions for updating:
back_prop=False is deprecated. Consider using tf.stop_gradient instead.
Instead of:
results = tf.while_loop(c, b, vars, back_prop=False)
Use:
results = tf.nest.map_structure(tf.stop_gradient, tf.while_loop(c, b, vars))
2021-06-25 07:01:32.667195: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2021-06-25 07:01:33.971927: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-25 07:01:34.548444: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2021-06-25 07:01:34.911529: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2021-06-25 07:01:36.659327: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2021-06-25 07:01:46.261119: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 561 of 1000
2021-06-25 07:02:00.735113: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 593 of 1000
2021-06-25 07:02:02.728721: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 619 of 1000
2021-06-25 07:02:15.017214: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 625 of 1000
2021-06-25 07:02:22.714957: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 668 of 1000
2021-06-25 07:02:34.510389: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 718 of 1000
2021-06-25 07:02:42.780139: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 776 of 1000
2021-06-25 07:02:52.867365: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 813 of 1000
2021-06-25 07:03:04.207901: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 874 of 1000
2021-06-25 07:03:12.664182: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 923 of 1000
2021-06-25 07:03:23.321355: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 971 of 1000
2021-06-25 07:03:28.421338: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:230] Shuffle buffer filled.

How to use the PNG format in my own dataset?

On the "Convert your own dataset" page it says that the values in the panoptic map should use this format:
panoptic_label = semantic_label * label_divisor + instance_id

I assume this applies to both the "raw" and "png" format? If so, do I need to create single channel 32-bit PNG images, and how do I do that?

Of course it's straightforward to use the "raw" format but that results in massive TFRecord dataset size.

Cocoapi doesnt support windows

E:\BraTS Challenge\vessel_env\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_api.h(14) : Warning Msg: Using deprecated NumPy API, disable it with #define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION
pycocotools\_mask.c(3280): warning C4244: '=': conversion from 'Py_ssize_t' to 'siz', possible loss of data
pycocotools\_mask.c(3690): warning C4244: 'function': conversion from 'npy_intp' to 'siz', possible loss of data
pycocotools\_mask.c(3690): warning C4244: 'function': conversion from 'npy_intp' to 'siz', possible loss of data
pycocotools\_mask.c(3690): warning C4244: 'function': conversion from 'npy_intp' to 'siz', possible loss of data
pycocotools\_mask.c(6697): warning C4244: '=': conversion from 'npy_intp' to 'siz', possible loss of data
pycocotools\_mask.c(7455): warning C4244: '=': conversion from 'Py_ssize_t' to 'siz', possible loss of data
creating E:\BraTS Challenge\RedTinSaintBernard-for-BraTS2021-challenge\cocoapi\PythonAPI\build\lib.win-amd64-3.9
creating E:\BraTS Challenge\RedTinSaintBernard-for-BraTS2021-challenge\cocoapi\PythonAPI\build\lib.win-amd64-3.9\pycocotools
d:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\bin\HostX86\x64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO /LIBPATH:E:\BraTS Challenge\vessel_env\libs /LIBPATH:C:\Python39\libs /LIBPATH:C:\Python39 /LIBPATH:E:\BraTS Challenge\vessel_env\PCbuild\amd64 /LIBPATH:d:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.28.29910\lib\x64 /LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.19041.0\ucrt\x64 /LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.19041.0\um\x64 /EXPORT:PyInit__mask build\temp.win-amd64-3.9\Release\../common/maskApi.obj build\temp.win-amd64-3.9\Release\pycocotools\_mask.obj /OUT:build\lib.win-amd64-3.9\pycocotools\_mask.cp39-win_amd64.pyd /IMPLIB:build\temp.win-amd64-3.9\Release\../common\_mask.cp39-win_amd64.lib
   Creating library build\temp.win-amd64-3.9\Release\../common\_mask.cp39-win_amd64.lib and object build\temp.win-amd64-3.9\Release\../common\_mask.cp39-win_amd64.exp
Generating code
Finished generating code
copying build\lib.win-amd64-3.9\pycocotools\_mask.cp39-win_amd64.pyd -> pycocotools
rm -rf build
process_begin: CreateProcess(NULL, rm -rf build, ...) failed.
make (e=2): The system cannot find the file specified.
make: *** [Makefile:4: all] Error 2

cocodataset/cocoapi#9

Questions about batch_norm_on_an_extra_axis

Hi,

Thanks for your great work~
I just have some questions about the "batch_norm_on_an_extra_axis" function in the "max_deeplab.py" file. I am wondering about how much this operation affects the results. And could you please have a check that whether the following Pytorch-based implementation is able to achieve the same effect as the current deeplab2 version? Thanks a lot~

self.bn = nn.BatchNorm2d(1)

batch_size, slot_num = pixel_space_mask_logits.size(0), pixel_space_mask_logits.size(1)
pixel_space_mask_logits = self.bn(rearrange(pixel_space_mask_logits, "b l h w -> (b l) a h w", a=1))
pixel_space_mask_logits = rearrange(pixel_space_mask_logits, "(b l) a h w -> b l h w", b=batch_size, l=slot_num)

Best Regards,

About the performance on cityscapes panoptic segmentation of MaX-DeepLab

Hi,

Thanks for your great work on creating this codebase. I am trying to use the MaX-DeepLab in my own project. I want to ask about what the results of MaX-DeepLab on the cityscapes panoptic segmentation val set would be like if you have tested before. I would really appreciate it if it is possible to report this even if it is just an unofficial approximate result.

Thanks.

How exactly are the predictions made for ignore_label handled in the panoptic deeplab model?

While analyzing the files related to panoptic deeplab model for my research, I noticed that the output_channels variable of the segmentation head of the panoptic deeplab models is configured so that the resulting semantic map has dimensions equal to the num_classes of the respective dataset, which includes the ignore_label. From that, it seems like the model is actually computing a score for the ignore_label as well.

To make it more concrete, if we take the panoptic deeplab model with the COCO panoptic dataset as an example, there are overall 134 classes in this dataset, including the 'unknown' class with ignore_label=0, and the semantic head of the model is configured to give 134-channel output in the proto files.

I understand how the loss is configured to ignore the 'unknown' class during training but I couldn't find where the label prediction for the ignore_label is explicitly avoided. Again, in the example of panoptic deeplab, as far as I see, the model has a postprocessing stage that converts the logits to semantic label predictions by taking argmax on the logits as usual. However, I checked the dimensions of this logits vector and it also has num_channels dimensions. Based on this observation, it seems to me like it is actually possible to predict the ignore_label but when I check the semantic predictions made by the model, I never observed such a thing. So, I believe the predictions for class 'unknown' is avoided somehow but I wonder how this is done. Is this happening during the training procedure implicitly and automatically (by simply learning not to assign high scores to this class)? Or, is there particular post-processing that you apply explicitly (e.g. if the label is ignore_label, take the second-highest value from the argmax)?

Is this tensorflow lite compatible?

Hi there,
just wondering whether one could convert this to a tflite model and also use hardware acceleration, eg no SELECT OPS from tensorflow.

error during retraining of the custom dataset(config file)

Hello, thanks for sharing the code. I'm currently trying to retrain the model using custom dataset. However, I got stuck due to the error shown in the following, could you please help me to figure out what the problem could be? It seems that something goes wrong in the config file.

python trainer/train.py \ --config_file=/home/caixiaoni/Desktop/project/deeplab2/configs/metal_part/panoptic_deeplab/resnet50_os32_semseg.textproto \ --mode=eval \ --model_dir=/home/caixiaoni/Desktop/project/metal_part_retrain_1 \ --num_gpus=0

2022-01-18 23:36:30.240123: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
I0118 23:36:31.325175 139693994784576 train.py:65] Reading the config file.
Traceback (most recent call last):
  File "trainer/train.py", line 76, in <module>
    app.run(main)
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "trainer/train.py", line 67, in main
    config = text_format.ParseLines(proto_file, config_pb2.ExperimentOptions())
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/google/protobuf/text_format.py", line 759, in ParseLines
    return parser.ParseLines(lines, message)
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/google/protobuf/text_format.py", line 812, in ParseLines
    self._ParseOrMerge(lines, message)
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/google/protobuf/text_format.py", line 835, in _ParseOrMerge
    tokenizer = Tokenizer(str_lines)
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/google/protobuf/text_format.py", line 1255, in __init__
    self._SkipWhitespace()
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/google/protobuf/text_format.py", line 1283, in _SkipWhitespace
    self._PopLine()
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/google/protobuf/text_format.py", line 1272, in _PopLine
    self._current_line = next(self._lines)
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/google/protobuf/text_format.py", line 832, in <genexpr>
    str_lines = (
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 206, in __next__
    retval = self.readline()
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 170, in readline
    self._preread_check()
  File "/home/caixiaoni/anaconda3/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 79, in _preread_check
    self._read_buf = _pywrap_file_io.BufferedInputStream(
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. tensorflow.python.lib.io._pywrap_file_io.BufferedInputStream(filename: str, buffer_size: int, token: tensorflow.python.lib.io._pywrap_file_io.TransactionToken = None)

Invoked with: None, 524288

The config file looks like:

experiment_name: "metal_part_retrain_1"
model_options {
  # Update the path to the initial checkpoint (e.g., ImageNet
  # pretrained checkpoint).
  initial_checkpoint: "/home/caixiaoni/Desktop/project/resnet50_imagenet1k/ckpt-100"
  backbone {
    name: "resnet50"
    output_stride: 32
  }
  decoder {
    feature_key: "res5"
    decoder_channels: 256
    aspp_channels: 256
    atrous_rates: 3
    atrous_rates: 6
    atrous_rates: 9
  }
  panoptic_deeplab {
    low_level {
      feature_key: "res3"
      channels_project: 64
    }
    low_level {
      feature_key: "res2"
      channels_project: 32
    }
    instance {
      enable: false
    }
    semantic_head {
      output_channels: 19
      head_channels: 256
    }
  }
}
trainer_options {
  save_checkpoints_steps: 1000
  save_summaries_steps: 100
  steps_per_loop: 100
  loss_options {
    semantic_loss {
      name: "softmax_cross_entropy"
      weight: 1.0
      top_k_percent: 0.2
    }
  }
  solver_options {
    base_learning_rate: 0.0005
    training_number_of_steps: 60000
  }
}
train_dataset_options {
  dataset: "metal_part"
  # Update the path to training set.
  file_pattern: "/home/caixiaoni/Desktop/project/part-TFRecord/train*.tfrecord"
  # Adjust the batch_size accordingly to better fit your GPU/TPU memory.
  # Also see Q1 in g3doc/faq.md.
  batch_size: 8
  crop_size: 513
  crop_size: 513
  # Skip resizing.
  min_resize_value: 0
  max_resize_value: 0
  augmentations {
    min_scale_factor: 0.5
    max_scale_factor: 2.0
    scale_factor_step_size: 0.1
  }
}
eval_dataset_options {
  dataset: "metal_part"
  # Update the path to validation set.
  file_pattern: "/home/caixiaoni/Desktop/project/part-TFRecord/val*.tfrecord"
  batch_size: 1
  crop_size: 513
  crop_size: 513
  # Skip resizing.
  min_resize_value: 0
  max_resize_value: 0
}
evaluator_options {
  continuous_eval_timeout: -1
  save_predictions: true
  save_raw_predictions: false
}

In dataset.py for registering the dataset:

_METAL_PART = 'metal_part'

METAL_PART_INFORMATION = DatasetDescriptor(
    dataset_name = _METAL_PART,
    splits_to_sizes={'train': 1200, 'val':300, 'test': 255},
    num_classes = 2, #including background + metal_part, only 2 classes
    ignore_label=255, 
    panoptic_label_divisor= 20, #values should be larger than the max. num of instances that could appear per image in your dataset
    class_has_instances_list = (0,),  # specifies which class belongs to the thing class (i.e., countable objects such as people, cars).
    colormap=_COCO_COLORMAP,
    is_video_dataset=False, 
    is_depth_dataset=False, 
    ignore_depth=None,     
)

MAP_NAME_TO_DATASET_INFO = {
    _CITYSCAPES_PANOPTIC: CITYSCAPES_PANOPTIC_INFORMATION,
    _KITTI_STEP: KITTI_STEP_INFORMATION,
    _MOTCHALLENGE_STEP: MOTCHALLENGE_STEP_INFORMATION,
    _CITYSCAPES_DVPS: CITYSCAPES_DVPS_INFORMATION,
    _COCO_PANOPTIC: COCO_PANOPTIC_INFORMATION,
    _SEMKITTI_DVPS: SEMKITTI_DVPS_INFORMATION,
    _METAL_PART: METAL_PART_INFORMATION,
}

I highly appreciate for any hints! Thanks in advance!

DeepLabv3+ weights

In the example notebooks there are lists with checkpoints, but those lists do not include DeepLabv3+.
Do you have weights for DeepLabv3+ with MobilenetV3 publicly available?
If so, can you give me the URL?

Thank you

Conda compatibility issues and how to fix

Hi, I tried to setup deeplab2 with anaconda (I recognized your note that there are dependency issues but it felt like a challenge :) ).

The only issue I found is the following:
ModuleNotFoundError: No module named 'pycocotools._mask' when running the all in one compile script.

Fix: execute make in cocoapi/PythonAPI

My environment is a python 3.8 env on conda 4.10.1

Max-DeepLab - [thing-ID mask here must not contain stuff or crowd region]

Hi, Thanks so much for the awesome work and thank you for the new deeplab2 repository.
I've been trying to train Max-DeepLab with my own custom dataset for a few days now and i'm a little bit stuck.

As an initial test I've limited the number of classes to a single class person.
I've then created a tf-record conversion script that builds a 2D Panoptic mask as follows.

Build an initial mask of some pixel value i.e 255 (same value as ignore_label ) with equal dimensions to the input image.
Then for every instance of person in the image I build a submask as follows semantic_label * label_divisor + instance_id
I then convert this mask to panoptic_mask.astype(np.int32).tostring() and then follow a similar work flow as defined in the example conversion scripts that have been provided, saving this as a raw label_format

In dataset.py I add my DatasetDescriptor as follows (Note: the semantic label for person is 1)

    num_classes=2,
    ignore_label=255,
    panoptic_label_divisor=256,
    class_has_instances_list=(1,),

This presumes background is a class as well as person. If i do this i am met with the following assertion.

tensorflow.python.framework.errors_impl.InvalidArgumentError:  assertion failed: [thing-ID mask here must not contain stuff or crowd region.]
	 [[{{node while_1/body/_103/while_1/cond_2/then/_328/while_1/cond_2/Assert/Assert}}]]
	 [[MultiDeviceIteratorGetNextFromShard]]
	 [[RemoteCall]]
	 [[while/body/_1/IteratorGetNext]] [Op:__inference_loop_fn_127631]

If i make class_has_instances_list = (2,), assuming person is the second class then i am met with the following error.

ValueError: Dimensions must be equal, but are 3 and 2 for '{{node DeepLabFamilyLoss/MaXDeepLabLoss/einsum_3/Einsum}} = Einsum[N=2, T=DT_FLOAT, equation="bij,bkj->bik"](DeepLabFamilyLoss/MaXDeepLabLoss/GatherV2_1, DeepLabFamilyLoss/MaXDeepLabLoss/strided_slice_2)' with input shapes: [4,130,3], [4,128,2].

However I also tried the following. Which assumes num_classes means the number of panoptic classes.

    num_classes=1,
    ignore_label=255,
    panoptic_label_divisor=256,
    class_has_instances_list=(1,),

This gives a similar error as above.

` ValueError: Dimensions must be equal, but are 2 and 1 for '{{node DeepLabFamilyLoss/MaXDeepLabLoss/einsum_3/Einsum}} = Einsum[N=2, T=DT_FLOAT, equation="bij,bkj->bik"](DeepLabFamilyLoss/MaXDeepLabLoss/GatherV2_1, DeepLabFamilyLoss/MaXDeepLabLoss/strided_slice_2)' with input shapes: [4,129,2], [4,128,1].

Finally i have gone to this assertion. And read the comment that has been placed around it.

      # Filter out IDs that are not thing instances (i.e., IDs for ignore_label,
      # stuff classes or crowd). Stuff classes and crowd regions both have IDs
      # of the form panoptic_id = semantic_id * label_divisor (i.e., instance id
      # = 0)

I've also ensured that the instance id for every object in an image starts at at least 1.

Any advice/help on the matter would be most welcome.

Thanks.

Panoptic-DeepLab: unable to load pretrained checkpoints

Hi,

I am trying to evaluate a pre-trained checkpoint file given here. But the checkpoint file cannot be loaded:

tensorflow.python.framework.errors_impl.DataLossError: Unable to open table file 
/media/xxx/ExtremeSSD/resnet50_os32_panoptic_deeplab_cityscapes_trainfine.tar.gz: Data loss: not an sstable (bad magic number): perhaps your file is in a different file format and you need to use a different restore operator?

The configuration in config/cityscapes/panoptic_deeplab/resnet50_os32_merge_with_pure_tf_func.textproto is as follows:

experiment_name: "panoptic-cs"
model_options {
  # Update the path to the initial checkpoint (e.g., ImageNet
  # pretrained checkpoint).
  initial_checkpoint: "/media/xxx/ExtremeSSD/resnet50_os32_panoptic_deeplab_cityscapes_trainfine.tar.gz"
...

The checkpoint tar file is downloaded using the link in the guide panoptic_deeplab.md.

Thank you for the help

Error for eval for KITTI-STEP Video Panoptic Segmentation

Follow the deeplab2/g3doc/projects/motion_deeplab.md, I try eval for kitti step dataset in panoptic deeplab. The order is a follows:'python train.py --config_file=../configs/kitti/panoptic_deeplab/resnet50_os32_trainval.textproto --mode=eval
--model_dir=/home/ubuntu/dev/sda1/model/model_yunjian/motion_deeplab/single_frame_ckpt --num_gpus=0'

while got the following error;
Traceback (most recent call last):
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/trainer/train.py", line 76, in
app.run(main)
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/absl/app.py", line 312, in run
2021-07-07 21:58:00.839820: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
_run_main(main, args)
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/trainer/train.py", line 72, in main
FLAGS.num_gpus)
File "/home/ubuntu/yunjian/motion_deeplab/deeplab2/trainer/train_lib.py", line 200, in run_experiment
controller.evaluate(steps=config.evaluator_options.eval_steps)
File "/home/ubuntu/yunjian/motion_deeplab/models/orbit/controller.py", line 282, in evaluate
2021-07-07 21:58:00.840081: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
eval_output = self.evaluator.evaluate(steps_tensor)
File "/home/ubuntu/yunjian/motion_deeplab/models/orbit/standard_runner.py", line 344, in evaluate
eval_iter, num_steps, state=outputs, reduce_fn=self.eval_reduce)
File "/home/ubuntu/yunjian/motion_deeplab/models/orbit/utils/loop_fns.py", line 74, in loop_fn
outputs = step_fn(iterator)
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 780, in call
2021-07-07 21:58:00.840692: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
result = self._call(*args, **kwds)
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 840, in _call
return self._stateless_fn(*args, **kwds)
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2829, in call
2021-07-07 21:58:00.841029: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.841384: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1848, in _filtered_call
2021-07-07 21:58:00.841919: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.842237: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
cancellation_manager=cancellation_manager)
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1924, in _call_flat
2021-07-07 21:58:00.842551: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.842904: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
ctx, args, cancellation_manager=cancellation_manager))
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 550, in call
2021-07-07 21:58:00.843215: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
ctx=ctx)
File "/home/ubuntu/anaconda3/envs/motion_deeplab/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
[[{{node ParseSingleExample/ParseExample/ParseExampleV2}}]]
[[MultiDeviceIteratorGetNextFromShard]]
[[RemoteCall]]
[[IteratorGetNext]] [Op:__inference_eval_step_11410]

Function call stack:
eval_step

2021-07-07 21:58:00.843554: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.843883: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.844309: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.844679: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.844835: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.845372: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.845480: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.845700: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.846071: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.846429: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.846806: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.847188: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.847566: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.847957: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.848308: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.848663: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.849024: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.849314: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.849651: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.850048: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.850433: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.850796: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.851174: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.851542: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.851872: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.852309: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.852655: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.853005: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.853372: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.853777: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.854121: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.854466: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.854830: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.855111: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.855504: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.855864: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.856161: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.856533: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.856867: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.857240: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.857590: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.857939: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.858282: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.858656: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.859006: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.859322: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.859705: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.860125: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.860466: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.860779: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.861175: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.861490: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.861879: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.862221: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.862602: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.862925: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.
2021-07-07 21:58:00.863304: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at example_parsing_ops.cc:94 : Invalid argument: Feature: image/segmentation/class/encoded (data type: string) is required but could not be found.

Process finished with exit code 1

The computer configuration is as follows：
cuda=10.0.130 cudnn=7.6.5
tensorflow=2.3.0
while I run 'deeplab2/compile.sh gpu'，It returns 'Done with configuration!'.

How to config when using my own dataset with diffrent class numbers?

Thank you for the great work.
I use it to train my own datasets. My config file is:

experiment_name: "bzjs"
model_options {
initial_checkpoint: "/tf/bzjs/deeplab2/mzoo/resnet50_os32_panoptic_deeplab_coco_train_2/ckpt-200000"
backbone {
name: "resnet50"
output_stride: 32
}
decoder {
feature_key: "res5"
decoder_channels: 256
aspp_channels: 256
atrous_rates: 3
atrous_rates: 6
atrous_rates: 9
}
panoptic_deeplab {
low_level {
feature_key: "res3"
channels_project: 64
}
low_level {
feature_key: "res2"
channels_project: 32
}
instance {
enable: false
}
semantic_head {
output_channels: 6
head_channels: 256
}
}
}
trainer_options {
save_checkpoints_steps: 1000
save_summaries_steps: 100
steps_per_loop: 100
loss_options {
semantic_loss {
name: "softmax_cross_entropy"
weight: 1.0
top_k_percent: 0.2
}
center_loss {
name: "mse"
weight: 200
}
regression_loss {
name: "l1"
weight: 0.01
}
}
solver_options {
base_learning_rate: 0.0005
training_number_of_steps: 200000
warmup_steps: 2000
}
}
train_dataset_options {
dataset: "bzjs"

file_pattern: "/tf/bzjs/bzjsdata/tfrecords/train-**-of-00010.tfrecord"

batch_size: 64
crop_size: 257
crop_size: 257
min_resize_value: 257
max_resize_value: 257
augmentations {
min_scale_factor: 0.5
max_scale_factor: 1.5
scale_factor_step_size: 0.1
autoaugment_policy_name: "simple_classification_policy_magnitude_scale_0.2"
}
increase_small_instance_weights: true
small_instance_weight: 3.0
}
eval_dataset_options {
dataset: "bzjs"

file_pattern: "/tf/bzjs/bzjsdata/tfrecords/val-**-of-00010.tfrecord"
batch_size: 1
crop_size: 257
crop_size: 257
min_resize_value: 257
max_resize_value: 257

increase_small_instance_weights: true
small_instance_weight: 3.0
}
evaluator_options {
continuous_eval_timeout: -1
stuff_area_limit: 4096
center_score_threshold: 0.1
nms_kernel: 41
save_predictions: true
save_raw_predictions: false

merge_semantic_and_instance_with_tf_op: true
}

What i have changed is : output_channels: 6 , because my datasets has 6 classes.

When I train the model, I got an error:

ValueError: Tensor's shape (1, 1, 256, 134) is not compatible with supplied shape (1, 1, 256, 6)

I want to know how to avoid this error.

Thank you!

What is your environment for testing your model?

Dear authors:
Hi! Thanks for opensourcing this repo.
I meet several problems for runing this repo.

I got stuck when performing evaluation process according to the issue. #58

I make such change to run you model ().

I download the MotionDeeplab ckpt from this repo. I perform the evaluation process.
but the results are nearly zero.

Has anyone successfully run this repo????

I doubt it maybe enviroment problems.

I use RTX-3090 with tf2.5 cuda 11.1.

whitout config_pd2

Hi
i download the source code, but i don't find the config_pd2 file.

CUDA_ERROR_ILLEGAL_ADDRESS error when running cityscapes training

I am trying to run the cityscapes training with the information provided. Prepared the data as indicated.
Using resnet50_beta_os32.textproto with cityscapes_panoptic.

tensorflow 2.5.0
tensorflow-gpu 2.5.0
NVIDIA-SMI 460.84
CUDA Version: 11.2
cuDNN version 8.1.01GeForce RTX 2080 (10MiB / 11019MiB)

I changed the batch size from 32 to 8 in case I was running out of memory, but I got the same error
CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Any suggestions?

Commad:
python trainer/train.py \ --config_file="configs/cityscapes/panoptic_deeplab/resnet50_beta_os32.textproto"\ --mode=train \ --model_dir="output" \ --num_gpus=1

Output:
....
2021-07-28 12:18:26.707841: I tensorflow/core/common_runtime/bfc_allocator.cc:1066] Stats:
Limit: 10117644288
InUse: 9929398016
MaxInUse: 9995489536
NumAllocs: 1320
MaxAllocSize: 2153779200
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0

2021-07-28 12:18:26.707916: W tensorflow/core/common_runtime/bfc_allocator.cc:467] ***************************************************************************************************_
2021-07-28 12:18:26.707968: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:662 : Resource exhausted: OOM when allocating tensor with shape[8,256,257,513] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
restoring or initializing model...
restored model from /home/mimi/code/deeplab2_proj/deeplab2/output/CITY_PANOPTIC_EXAMPLE/ckpt-0.
restored from checkpoint: /home/mimi/code/deeplab2_proj/deeplab2/output/CITY_PANOPTIC_EXAMPLE/ckpt-0
train | step: 0 | training until step 60000...
Traceback (most recent call last):
File "trainer/train.py", line 81, in
app.run(main)
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "trainer/train.py", line 77, in main
FLAGS.num_gpus)
File "/home/mimi/code/deeplab2_proj/deeplab2/trainer/train_lib.py", line 191, in run_experiment
steps=config.trainer_options.solver_options.training_number_of_steps)
File "/home/mimi/code/deeplab2_proj/deeplab2/models/orbit/controller.py", line 240, in train
self._train_n_steps(num_steps)
File "/home/mimi/code/deeplab2_proj/deeplab2/models/orbit/controller.py", line 439, in _train_n_steps
train_output = self.trainer.train(num_steps_tensor)
File "/home/mimi/code/deeplab2_proj/deeplab2/models/orbit/standard_runner.py", line 146, in train
self._train_loop_fn(self._train_iter, num_steps)
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 889, in call
result = self._call(*args, **kwds)
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 3024, in call
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1961, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 596, in call
ctx=ctx)
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[8,64,513,1025] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node while/body/_1/DeepLab/resnet50_beta/stage2/block1/conv1_bn_act/conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[while/merge/_973/_13]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[8,64,513,1025] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node while/body/_1/DeepLab/resnet50_beta/stage2/block1/conv1_bn_act/conv/Conv2D}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_loop_fn_46399]

Function call stack:
loop_fn -> loop_fn

2021-07-28 12:18:28.097617: E tensorflow/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2021-07-28 12:18:28.097660: F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1
Fatal Python error: Aborted

Thread 0x00007f85d6c43740 (most recent call first):
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 1264 in delete_iterator
File "/home/mimi/Envs/deeplab2/lib/python3.6/site-packages/tensorflow/python/data/ops/iterator_ops.py", line 546 in del

VertexAi integration

Hello team,

As part of an internal project, we successfully adapted deeplab2 to be integrated into VertexAI, allowing massively parallel hyperparameter model tuning sessions on Google Cloud, and at the same time we produced a custom container with the whole preconfigured packages, useful creation and rapid recycling of development and production environments.

In case this is integration could be helpful, we could talk about producing an special scope PR.

Many thanks in advance.

Random input order

Hi,

I found a bug in the dataloader which results in random input order.

The dataset is initialized by dataset = tf.data.Dataset.list_files(self._file_pattern) while it is decorated by eval_dataset = orbit.utils.make_distributed_dataset(self._strategy, eval_dataset). In distributed running, the input order will be broken. To keep correct order, it should be initialized by dataset = tf.data.Dataset.list_files(self._file_pattern, shuffle=False).

cudnn issue

HI, I can't solve this problem, I have already upgrade the cudnn to 8.1.0, but it don't work.
2021-10-13 17:19:29.385716: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.4 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.
2021-10-13 17:19:29.390987: E tensorflow/stream_executor/cuda/cuda_dnn.cc:362] Loaded runtime CuDNN library: 8.0.4 but source was compiled with: 8.1.0. CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library. If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

I would appreciate it very much if someboay can slove this problem!!!

google-research / deeplab2 Goto Github PK

deeplab2's Issues

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs