Hey, I can't solve this problem
ssh://[email protected]:22/data1/nqx/anaconda3/bin/python3 -u /data1/nqx/Max-deeplab/deeplab2/trainer/train.py --config_file=/data1/nqx/Max-deeplab/deeplab2/configs/coco/max_deeplab/max_deeplab_l_os16_res1025_400k.textproto --mode=train --model_dir=/data1/nqx/Max-deeplab/init_checkpoint/max_deeplab_l_backbone_imagenet1k_strong_training_strategy.tar.gz --num_gpus=8
I1012 11:38:41.230723 139697198446400 train.py:65] Reading the config file.
I1012 11:38:41.233097 139697198446400 train.py:69] Starting the experiment.
2021-10-12 11:38:41.234652: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-10-12 11:38:46.108032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 20504 MB memory: -> device: 0, name: GeForce RTX 3090, pci bus id: 0000:1a:00.0, compute capability: 8.6
2021-10-12 11:38:46.110730: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 20504 MB memory: -> device: 1, name: GeForce RTX 3090, pci bus id: 0000:1b:00.0, compute capability: 8.6
2021-10-12 11:38:46.113205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 20526 MB memory: -> device: 2, name: GeForce RTX 3090, pci bus id: 0000:3d:00.0, compute capability: 8.6
2021-10-12 11:38:46.115386: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 20506 MB memory: -> device: 3, name: GeForce RTX 3090, pci bus id: 0000:3e:00.0, compute capability: 8.6
2021-10-12 11:38:46.117429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:4 with 20504 MB memory: -> device: 4, name: GeForce RTX 3090, pci bus id: 0000:88:00.0, compute capability: 8.6
2021-10-12 11:38:46.119449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:5 with 20506 MB memory: -> device: 5, name: GeForce RTX 3090, pci bus id: 0000:89:00.0, compute capability: 8.6
2021-10-12 11:38:46.121430: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:6 with 20526 MB memory: -> device: 6, name: GeForce RTX 3090, pci bus id: 0000:b1:00.0, compute capability: 8.6
2021-10-12 11:38:46.123378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:7 with 20506 MB memory: -> device: 7, name: GeForce RTX 3090, pci bus id: 0000:b2:00.0, compute capability: 8.6
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I1012 11:38:48.733316 139697198446400 mirrored_strategy.py:369] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3', '/job:localhost/replica:0/task:0/device:GPU:4', '/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
I1012 11:38:48.734684 139697198446400 train_lib.py:104] Using strategy <class 'tensorflow.python.distribute.mirrored_strategy.MirroredStrategy'> with 8 replicas
I1012 11:38:48.757744 139697198446400 deeplab.py:57] Synchronized Batchnorm is used.
I1012 11:38:48.759390 139697198446400 axial_resnet_instances.py:144] Axial-ResNet final config: {'num_blocks': [3, 6, 3, 3], 'backbone_layer_multiplier': 1.0, 'width_multiplier': 1.0, 'stem_width_multiplier': 1.0, 'output_stride': 16, 'classification_mode': False, 'backbone_type': 'wider_resnet', 'use_axial_beyond_stride': 16, 'backbone_use_transformer_beyond_stride': 16, 'extra_decoder_use_transformer_beyond_stride': 16, 'backbone_decoder_num_stacks': 1, 'backbone_decoder_blocks_per_stage': 1, 'extra_decoder_num_stacks': 1, 'extra_decoder_blocks_per_stage': 3, 'max_num_mask_slots': 128, 'num_mask_slots': 128, 'memory_channels': 512, 'base_transformer_expansion': 2.0, 'global_feed_forward_network_channels': 512, 'high_resolution_output_stride': 4, 'activation': 'relu', 'block_group_config': {'attention_bottleneck_expansion': 4, 'drop_path_keep_prob': 0.800000011920929, 'drop_path_beyond_stride': 4, 'drop_path_schedule': 'linear', 'positional_encoding_type': None, 'use_global_beyond_stride': 0, 'use_sac_beyond_stride': -1, 'use_squeeze_and_excite': False, 'conv_use_recompute_grad': False, 'axial_use_recompute_grad': True, 'recompute_within_stride': 0, 'transformer_use_recompute_grad': False, 'axial_layer_config': {'query_shape': (129, 129), 'key_expansion': 2, 'value_expansion': 4, 'memory_flange': (32, 32), 'double_global_attention': False, 'num_heads': 8, 'use_query_rpe_similarity': True, 'use_key_rpe_similarity': True, 'use_content_similarity': True, 'retrieve_value_rpe': True, 'retrieve_value_content': True, 'initialization_std_for_query_key_rpe': 1.0, 'initialization_std_for_value_rpe': 1.0, 'self_attention_activation': 'softmax'}, 'dual_path_transformer_layer_config': {'num_heads': 8, 'bottleneck_expansion': 2, 'key_expansion': 1, 'value_expansion': 2, 'feed_forward_network_channels': 2048, 'use_memory_self_attention': True, 'use_pixel2memory_feedback_attention': True, 'transformer_activation': 'softmax'}}, 'bn_layer': functools.partial(<class 'keras.layers.normalization.batch_normalization.SyncBatchNormalization'>, momentum=0.9900000095367432, epsilon=0.0010000000474974513), 'conv_kernel_weight_decay': 0.0}
I1012 11:38:49.366679 139697198446400 deeplab.py:96] Setting pooling size to (65, 65)
I1012 11:38:49.366883 139697198446400 aspp.py:135] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.613164 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.615339 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.620485 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.621790 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.626796 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.628052 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.631933 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.633151 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.638095 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.639389 139697198446400 cross_device_ops.py:619] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1012 11:38:55.679392 139697198446400 controller.py:391] restoring or initializing model...
restoring or initializing model...
WARNING:tensorflow:From /data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py:1359: NameBasedSaverStatus.init (from tensorflow.python.training.tracking.util) is deprecated and will be removed in a future version.
Instructions for updating:
Restoring a name-based tf.train.Saver checkpoint using the object-based restore API. This mode uses global names to match variables, and so is somewhat fragile. It also adds new restore ops to the graph each time it is called when graph building. Prefer re-encoding training checkpoints in the object-based format: run save() on the object-based saver (the same one this message is coming from) and use that checkpoint in the future.
W1012 11:38:55.741179 139697198446400 deprecation.py:339] From /data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py:1359: NameBasedSaverStatus.init (from tensorflow.python.training.tracking.util) is deprecated and will be removed in a future version.
Instructions for updating:
Restoring a name-based tf.train.Saver checkpoint using the object-based restore API. This mode uses global names to match variables, and so is somewhat fragile. It also adds new restore ops to the graph each time it is called when graph building. Prefer re-encoding training checkpoints in the object-based format: run save() on the object-based saver (the same one this message is coming from) and use that checkpoint in the future.
I1012 11:38:55.788634 139697198446400 controller.py:397] initialized model.
initialized model.
I1012 11:38:56.717390 139697198446400 api.py:446] Eval with scales ListWrapper([1.0])
I1012 11:38:57.717404 139697198446400 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
I1012 11:38:57.743686 139697198446400 api.py:446] Eval scale 1.0; setting pooling size to [65, 65]
I1012 11:39:26.300476 139697198446400 api.py:446] Global average pooling in the ASPP pooling layer was replaced with tiled average pooling using the provided pool_size. Please make sure this behavior is intended.
Traceback (most recent call last):
File "/data1/nqx/Max-deeplab/deeplab2/trainer/train.py", line 76, in
app.run(main)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/data1/nqx/Max-deeplab/deeplab2/trainer/train.py", line 71, in main
train_lib.run_experiment(FLAGS.mode, config, combined_model_dir, FLAGS.master,
File "/data1/nqx/Max-deeplab/deeplab2/trainer/train_lib.py", line 188, in run_experiment
controller.save_checkpoint()
File "/data1/nqx/Max-deeplab/models/orbit/controller.py", line 412, in save_checkpoint
self._maybe_save_checkpoint(check_interval=False)
File "/data1/nqx/Max-deeplab/models/orbit/controller.py", line 482, in _maybe_save_checkpoint
ckpt_path = self.checkpoint_manager.save(
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/checkpoint_management.py", line 817, in save
save_path = self._checkpoint.write(prefix)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 2071, in write
output = self._saver.save(file_prefix=file_prefix, options=options)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/training/tracking/util.py", line 1261, in save
file_io.recursive_create_dir(os.path.dirname(file_prefix))
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 499, in recursive_create_dir
recursive_create_dir_v2(dirname)
File "/data1/nqx/anaconda3/lib/python3.8/site-packages/tensorflow/python/lib/io/file_io.py", line 514, in recursive_create_dir_v2
_pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.FailedPreconditionError: /data1/nqx/Max-deeplab/init_checkpoint/max_deeplab_l_backbone_imagenet1k_strong_training_strategy.tar.gz is not a directory
Exception ignored in: <function Pool.del at 0x7f0d55dbeca0>
Traceback (most recent call last):
File "/data1/nqx/anaconda3/lib/python3.8/multiprocessing/pool.py", line 268, in del
File "/data1/nqx/anaconda3/lib/python3.8/multiprocessing/queues.py", line 362, in put
AttributeError: 'NoneType' object has no attribute 'dumps'
Process finished with exit code 1