GithubHelp home page GithubHelp logo

xiuqhou / salience-detr Goto Github PK

View Code? Open in Web Editor NEW
105.0 3.0 7.0 6.49 MB

[CVPR 2024] Official implementation of the paper "Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement"

Home Page: https://arxiv.org/abs/2403.16131

License: Apache License 2.0

Python 46.82% Jupyter Notebook 50.42% Cuda 2.76%
detr salience-detr attention detection object-detection transformer transformers

salience-detr's Introduction

salience-detr's People

Contributors

xiuqhou avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

salience-detr's Issues

关于论文中Salience-guidance Supervision的问题

Question

首先,感谢楼主分享项目完整的代码!
看了论文和代码后,关于Salience Supervision有两个问题:
1.Salience-guidance Supervision的提出和原理很好理解,但是我没明白与后面提出方法的联系在哪里,比如Heirachical query filtering中未提到根据Salience-guidance Supervision的得分判断是否过滤
2.其次,在代码中我没找到在哪里计算Salience confidence,按照论文结构框图的意思,应该是在level filtering之后编码器的预处理中,但我似乎没有找到,能否告知我一下在哪个文件的哪块实现的,谢谢!
再次感谢你们的工作!

补充信息

No response

谁来救救我。。。运行时报错#EOFError: Ran out of input

for epoch in range(cfg.starting_epoch, cfg.num_epochs):
    train_one_epoch_acc(
        model=model,
        optimizer=optimizer,
        data_loader=train_loader,
        epoch=epoch,
        print_freq=cfg.print_freq,
        max_grad_norm=cfg.max_norm,
        accelerator=accelerator,
    )
    lr_scheduler.step()

main.py脚本进入循环train_one_epoch_acc报错。
错误信息如下:
_pickle.PicklingError: Can't pickle <function at 0x0000026F6BAAF790>: attribute lookup on transforms.presets failed
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

关于注意力热力图可视化

Bug

非常感谢作者提供这个优秀的项目,我想问一下代码应该如何可视化注意力的热力图,能否提供一个例子,以便了解图片中各个区域对模型作出预测的影响有多大。期待作者的回复!

Uploading 35bfb787d0c252b0831a92517d85091.png…

环境信息

No response

补充信息

No response

训练的时候出现问题:RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Question

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 8, 1092, 1092]], which is output 0 of ReluBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

补充信息

Traceback (most recent call last):
File "main.py", line 222, in
train()
File "main.py", line 195, in train
train_one_epoch_acc(
File "/home/xx/DETR/Relation-DETR/util/engine.py", line 58, in train_one_epoch_acc
accelerator.backward(losses)
File "/home/xx/.conda/envs/rdetr/lib/python3.8/site-packages/accelerate/accelerator.py", line 2151, in backward
loss.backward(**kwargs)
File "/home/xx/.conda/envs/rdetr/lib/python3.8/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/xx/.conda/envs/rdetr/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass

训练在自己的数据集上shape形状不匹配

作者,您好!非常感谢您提供的这个优秀的代码,我在自己的数据集上训练的时候发现一些问题,有一些层无法正确加载,我看了一下应该是类别数不同,我也尝试改变代码中的num_class,这个问题依旧没有解决。

还有一个警告就是在我自己的使用的一个模块,显示没有weight和bias,请问这两个问题如何解决呢,非常期待您的回复

001adf3179e2bbe083b58ad33deffa3

image_size修改

Question

请问我想把image_size修改为1248x832,我应该在哪里修改

补充信息

No response

Learning rate for training

Question

Hi, @xiuqhou
Thanks for your enlightening work. I came across some questions while reproducing your work.

  1. How many gpus did you use to train the model?

  2. Need I change the initial learning rate if I adopt different total batchsize (num_gpus * batchsize_per_gpu). Is there a policy to make final performance impermeable to the total batchsize?

In my personal experiments, model performance are not consistent among different total_batch_size. I experimented with 1x2 (1gpu, 2 images per gpu) and 4x4 (4gpus, 4 images per gpu) settings, and the initial learning rate is same. But results show that there is a non-trivial gaps between them. (4x4 setting lags behind 1x2 setting with 2 AP)

Best regards

Additional

No response

原文中的训练配置

您好,请问您原本训练这些使用不同backbone的模型时分别用了怎样的硬件配置? 另外请问batch size 是怎样设置的? 是否用了混合精度?
我使用V100(32GB)训练FocalNet时设置单卡batch size=2,但只能使用fp16来避免OOM,请问这是否正常?

我在自己的数据上做微调,想试试沿用最佳的学习率,向您提问是为了根据batch size对lr进行调整。

[Howto] From ONNX to TensorRT if it's possible , thanks.

Question

Hi there,

Thanks for all these great sharing.

We're trying to evaluate TensorRT performance with Salience-DETR (based on FocalNet) so far.
Here're the experiments we have now , please kindly advise if any and thanks for all your great help.

Using ONNX packages

onnx                    1.16.1
onnx-graphsurgeon       0.3.12
onnxruntime-gpu         1.18.1
onnxsim                 0.4.36

Python      3.10.9
cuda     11.7

Based on #17 and Section [Export an ONNX model] , we have messages like below ;

================ Diagnostic Run torch.onnx.export version 2.0.0 ================
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

WARNING: failed to run "SequenceEmpty" op (name is "/transformer/SequenceEmpty"), skip...
WARNING: failed to run "SequenceEmpty" op (name is "/transformer/SequenceEmpty"), skip...
Traceback (most recent call last):
  File "/workspace/Salience-DETR/tools/pytorch2onnx.py", line 133, in <module>
    pytorch2onnx()
  File "/workspace/Salience-DETR/tools/pytorch2onnx.py", line 108, in pytorch2onnx
    model_ops, check_ok = onnxsim.simplify(args.save_file)
  File "/opt/conda/lib/python3.10/site-packages/onnxsim/onnx_simplifier.py", line 199, in simplify
    model_opt_bytes = C.simplify(
onnx.onnx_cpp2py_export.checker.ValidationError: Nodes in a graph must be topologically sorted, however input '/transformer/SequenceEmpty_output_0' of node: 
name: /transformer/Loop OpType: Loop
 is not output of any previous nodes.

If we skip --simplify during export , seems like we have ONNX exported with warning ;

================ Diagnostic Run torch.onnx.export version 2.0.0 ================
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================

Successfully exported ONNX model: torchconverted.onnx
2024-07-22 18:06:28.479503543 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 328 Memcpy nodes are added to the graph torch_jit for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-07-22 18:06:28.481116649 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 1 Memcpy nodes are added to the graph torch_jit13 for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-07-22 18:06:31.742070878 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
2024-07-22 18:06:31.746102699 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
2024-07-22 18:06:31.746114767 [W:onnxruntime:Default, scatter_nd.h:51 ScatterNDWithAtomicReduction] ScatterND with reduction=='none' only guarantees to be correct if indices are not duplicated.
Traceback (most recent call last):
  File "/workspace/Salience-DETR/tools/pytorch2onnx.py", line 133, in <module>
    pytorch2onnx()
  File "/workspace/Salience-DETR/tools/pytorch2onnx.py", line 128, in pytorch2onnx
    np.testing.assert_allclose(onnx_res, pytorch_res, rtol=1e-3, atol=1e-5, err_msg=err_msg)
  File "/opt/conda/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/opt/conda/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.10/site-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=0.001, atol=1e-05
The numerical values are different between Pytorch and ONNXBut it does not necessarily mean the exported ONNX is problematic.
Mismatched elements: 19 / 300 (6.33%)
Max absolute difference: 3.3614924e-05
Max relative difference: 0.01449186
 x: array([0.002411, 0.002407, 0.0024  , 0.002384, 0.002363, 0.002363,
       0.002358, 0.002353, 0.002345, 0.002333, 0.002331, 0.002331,
       0.002323, 0.00232 , 0.002318, 0.002309, 0.002307, 0.0023  ,...
 y: array([0.002411, 0.00241 , 0.002398, 0.00239 , 0.002369, 0.002367,
       0.002363, 0.00235 , 0.002349, 0.002349, 0.002346, 0.002336,
       0.002325, 0.002325, 0.002323, 0.002322, 0.002321, 0.00232 ,...

But when we're trying to convert ONNX to TensorRT by trtexec for quick-test ;

[07/22/2024-18:07:37] [E] Error[4]: /eval_transform/eval_transform.0/If_OutputLayer: IIfConditionalOutputLayer inputs must have the same shape. Shapes are [3,-1,-1] and [1,3,-1,-1].
[07/22/2024-18:07:37] [E] [TRT] ModelImporter.cpp:771: While parsing node number 113 [If -> "/eval_transform/eval_transform.0/If_output_0"]:
[07/22/2024-18:07:37] [E] [TRT] ModelImporter.cpp:772: --- Begin node ---
[07/22/2024-18:07:37] [E] [TRT] ModelImporter.cpp:773: input: "/eval_transform/eval_transform.0/Equal_output_0"
output: "/eval_transform/eval_transform.0/If_output_0"
name: "/eval_transform/eval_transform.0/If"
op_type: "If"
attribute {
  name: "then_branch"
  g {
    node {
      output: "/eval_transform/eval_transform.0/Constant_15_output_0"
      name: "/eval_transform/eval_transform.0/Constant_15"
      op_type: "Constant"
      attribute {
        name: "value"
        t {
          dims: 1
          data_type: 7
          raw_data: "\000\000\000\000\000\000\000\000"
        }
        type: TENSOR
      }
    }
    node {
      input: "/eval_transform/eval_transform.0/Resize_output_0"
      input: "/eval_transform/eval_transform.0/Constant_15_output_0"
      output: "/eval_transform/eval_transform.0/Squeeze_2_output_0"
      name: "/eval_transform/eval_transform.0/Squeeze_2"
      op_type: "Squeeze"
    }
    name: "torch_jit1"
    output {
      name: "/eval_transform/eval_transform.0/Squeeze_2_output_0"
      type {
        tensor_type {
          elem_type: 1
          shape {
            dim {
              dim_param: "Squeeze/eval_transform/eval_transform.0/Squeeze_2_output_0_dim_0"
            }
            dim {
              dim_param: "Squeeze/eval_transform/eval_transform.0/Squeeze_2_output_0_dim_1"
            }
            dim {
              dim_param: "Squeeze/eval_transform/eval_transform.0/Squeeze_2_output_0_dim_2"
            }
          }
        }
      }
    }
  }
  type: GRAPH
}
attribute {
  name: "else_branch"
  g {
    node {
      input: "/eval_transform/eval_transform.0/Resize_output_0"
      output: "/eval_transform/eval_transform.0/Identity_output_0"
      name: "/eval_transform/eval_transform.0/Identity"
      op_type: "Identity"
    }
    name: "torch_jit2"
    output {
      name: "/eval_transform/eval_transform.0/Identity_output_0"
      type {
        tensor_type {
          elem_type: 1
          shape {
            dim {
              dim_param: "Identity/eval_transform/eval_transform.0/Identity_output_0_dim_0"
            }
            dim {
              dim_param: "Squeeze/eval_transform/eval_transform.0/Squeeze_2_output_0_dim_0"
            }
            dim {
              dim_param: "Squeeze/eval_transform/eval_transform.0/Squeeze_2_output_0_dim_1"
            }
            dim {
              dim_param: "Squeeze/eval_transform/eval_transform.0/Squeeze_2_output_0_dim_2"
            }
          }
        }
      }
    }
  }
  type: GRAPH
}

[07/22/2024-18:07:37] [E] [TRT] ModelImporter.cpp:774: --- End node ---
[07/22/2024-18:07:37] [E] [TRT] ModelImporter.cpp:777: ERROR: ModelImporter.cpp:195 In function parseGraph:
[6] Invalid Node - /eval_transform/eval_transform.0/If
/eval_transform/eval_transform.0/If_OutputLayer: IIfConditionalOutputLayer inputs must have the same shape. Shapes are [3,-1,-1] and [1,3,-1,-1].
[07/22/2024-18:07:37] [E] Failed to parse onnx file
[07/22/2024-18:07:37] [I] Finished parsing network model. Parse time: 1.04033
[07/22/2024-18:07:37] [E] Parsing model failed
[07/22/2024-18:07:37] [E] Failed to create engine from model or file.
[07/22/2024-18:07:37] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=torchconverted.onnx

Refer to tools/benchmark_model.py , we assign model.eval_transform = None before model.eval().to(args.device)

[07/22/2024-18:24:34] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[07/22/2024-18:24:35] [E] Error[4]: [graph.cpp::symbolicExecute::539] Error Code 4: Internal Error ((Unnamed Layer* 10573) [LoopOutput]: an ILoopOutputLayer cannot be used to compute a shape tensor)
[07/22/2024-18:24:35] [E] [TRT] ModelImporter.cpp:771: While parsing node number 4882 [Slice -> "/transformer/Slice_21_output_0"]:
[07/22/2024-18:24:35] [E] [TRT] ModelImporter.cpp:772: --- Begin node ---
[07/22/2024-18:24:35] [E] [TRT] ModelImporter.cpp:773: input: "/transformer/enc_output_norm/LayerNormalization_output_0"
input: "/transformer/Unsqueeze_68_output_0"
input: "/transformer/Constant_289_output_0"
input: "/transformer/Constant_290_output_0"
input: "/transformer/Constant_291_output_0"
output: "/transformer/Slice_21_output_0"
name: "/transformer/Slice_21"
op_type: "Slice"

[07/22/2024-18:24:35] [E] [TRT] ModelImporter.cpp:774: --- End node ---
[07/22/2024-18:24:35] [E] [TRT] ModelImporter.cpp:777: ERROR: ModelImporter.cpp:195 In function parseGraph:
[6] Invalid Node - /transformer/Slice_21
[graph.cpp::symbolicExecute::539] Error Code 4: Internal Error ((Unnamed Layer* 10573) [LoopOutput]: an ILoopOutputLayer cannot be used to compute a shape tensor)
[07/22/2024-18:24:35] [E] Failed to parse onnx file
[07/22/2024-18:24:35] [I] Finished parsing network model. Parse time: 1.77338
[07/22/2024-18:24:35] [E] Parsing model failed
[07/22/2024-18:24:35] [E] Failed to create engine from model or file.
[07/22/2024-18:24:35] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=torchconverted.onnx

Additional

No response

推理的过程中报错

/home/yjd/anaconda3/envs/python3.8/bin/python3.8 /home/yjd/yjd_software/item/Salience-DETR-main/inference.py
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
Using /home/yjd/.cache/torch_extensions/py38_cu121 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/yjd/.cache/torch_extensions/py38_cu121/MultiScaleDeformableAttention/build.ninja...
Building extension module MultiScaleDeformableAttention...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module MultiScaleDeformableAttention...
ninja: no work to do.
[2024-05-13 11:39:56 det.models.backbones.base_backbone]: Backbone architecture: resnet50
[2024-05-13 11:39:57 det.util.utils]:
/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/overrides.py:110: UserWarning: 'has_cuda' is deprecated, please use 'torch.backends.cuda.is_built()'
torch.has_cuda,
/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/overrides.py:111: UserWarning: 'has_cudnn' is deprecated, please use 'torch.backends.cudnn.is_available()'
torch.has_cudnn,
/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/overrides.py:117: UserWarning: 'has_mps' is deprecated, please use 'torch.backends.mps.is_built()'
torch.has_mps,
/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/overrides.py:118: UserWarning: 'has_mkldnn' is deprecated, please use 'torch.backends.mkldnn.is_available()'
torch.has_mkldnn,
[2024-05-13 11:39:58 det.util.utils]:
0%| | 0/1 [00:00<?, ?it/s]/home/yjd/yjd_software/item/Salience-DETR-main/models/bricks/position_encoding.py:50: UserWarning: cumsum_cuda_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True, warn_only=True)'. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation. (Triggered internally at ../aten/src/ATen/Context.cpp:71.)
y_embed = not_mask.cumsum(1, dtype=torch.float32)
/home/yjd/yjd_software/item/Salience-DETR-main/models/bricks/position_encoding.py:51: UserWarning: cumsum_cuda_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True, warn_only=True)'. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation. (Triggered internally at ../aten/src/ATen/Context.cpp:71.)
x_embed = not_mask.cumsum(2, dtype=torch.float32)
/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/nn/modules/linear.py:114: UserWarning: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True) or at::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility (Triggered internally at ../aten/src/ATen/Context.cpp:156.)
return F.linear(input, self.weight, self.bias)
/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/nn/functional.py:5405: UserWarning: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True) or at::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility (Triggered internally at ../aten/src/ATen/Context.cpp:156.)
attn_output_weights = torch.bmm(q_scaled, k.transpose(-2, -1))
/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/nn/functional.py:5410: UserWarning: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True) or at::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility (Triggered internally at ../aten/src/ATen/Context.cpp:156.)
attn_output = torch.bmm(attn_output_weights, v)
/home/yjd/yjd_software/item/Salience-DETR-main/models/bricks/basic.py:52: UserWarning: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True) or at::Context::setDeterministicAlgorithms(true), but this operation is not deterministic because it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this case, you must set an environment variable before running your PyTorch application: CUBLAS_WORKSPACE_CONFIG=:4096:8 or CUBLAS_WORKSPACE_CONFIG=:16:8. For more information, go to https://docs.nvidia.com/cuda/cublas/index.html#cublasApi_reproducibility (Triggered internally at ../aten/src/ATen/Context.cpp:156.)
context = torch.matmul(input_x, context_mask)
100%|██████████| 1/1 [00:04<00:00, 4.10s/it]
0%| | 0/1 [00:00<?, ?it/s]/home/yjd/yjd_software/item/Salience-DETR-main/util/visualize.py:126: FutureWarning: The input object of type 'Tensor' is an array-like implementing one of the corresponding protocols (__array__, __array_interface__ or __array_struct__); but not a sequence (or 0-D). In the future, this object will be coerced as if it was first converted using np.array(obj). To retain the old behaviour, you have to either modify the type 'Tensor', or assign to an empty array created with np.empty(correct_shape, dtype=object).
boxes = np.array(boxes, dtype=np.int32)
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/yjd/yjd_software/item/Salience-DETR-main/inference.py", line 150, in
inference()
File "/home/yjd/yjd_software/item/Salience-DETR-main/inference.py", line 146, in inference
[None for _ in tqdm(data_loader)]
File "/home/yjd/yjd_software/item/Salience-DETR-main/inference.py", line 146, in
[None for _ in tqdm(data_loader)]
File "/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/tqdm/std.py", line 1181, in iter
for obj in iterable:
File "/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/accelerate/data_loader.py", line 454, in iter
current_batch = next(dataloader_iter)
File "/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 630, in next
data = self._next_data()
File "/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/_utils.py", line 694, in reraise
raise exception
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/yjd/anaconda3/envs/python3.8/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/yjd/yjd_software/item/Salience-DETR-main/inference.py", line 145, in
data_loader.collate_fn = lambda x: visualize_single_image(**x[0])
File "/home/yjd/yjd_software/item/Salience-DETR-main/inference.py", line 125, in visualize_single_image
image = plot_bounding_boxes_on_image_cv2(
File "/home/yjd/yjd_software/item/Salience-DETR-main/util/visualize.py", line 126, in plot_bounding_boxes_on_image_cv2
boxes = np.array(boxes, dtype=np.int32)
ValueError: only one element tensors can be converted to Python scalars

Process finished with exit code 1

为什么不使用foreground_score来获取top K的初始reference_points,而是使用enc_outputs_class?

Question

感谢作者开源这个精彩工作的代码!但是根据我对论文的理解,有一个关于筛选初始reference_points(enc_outputs_coord)的疑问,烦请拨冗解答,感谢!

根据论文的描述,通过Salience-guided supervision的训练获取的foreground_score(salience_score)具有较强的分辨前后景query的能力——但是看代码在salience_transformer.py的实现,进入decoder的初始reference_points仍然是根据encoder_class_head预测出的enc_outputs_class来选择reference_points的,为什么不根据foreground_score的大小来选择top k的reference_points呢?

期待您的解答。

补充信息

No response

inference.py:推理过程如何改用cpu进行

Question

我想使用cpu运行inference.py,但一直报错,除了采用规定CUDA_VISIBLE_DEVICES=-1的方式,是否有其他方式可以直接更改相关代码从而能够在cpu上正常运行。
这是现在的报错:RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

补充信息

No response

程序卡住---不报错、不进行训练

程序运行几个batch会卡住,显存被占用,但是不进行计算
环境log如下:


sys.platform linux
Python 3.8.18 | packaged by conda-forge | (default, Dec 23 2023, 17:21:28) [GCC 12.3.0]
numpy 1.24.4
PyTorch 1.12.1+cu113 @/home/ubuntu22/anaconda3/envs/sl/lib/python3.8/site-packages/torch
PyTorch debug build False
torch._C._GLIBCXX_USE_CXX11_ABI False
GPU available Yes
GPU 0 NVIDIA GeForce RTX 3090 (arch=8.6)
Driver version 546.17
CUDA_HOME /usr/local/cuda-11.3
Pillow 10.3.0
torchvision 0.13.1+cu113 @/home/ubuntu22/anaconda3/envs/sl/lib/python3.8/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.9.0


PyTorch built with:

  • GCC 9.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • LAPACK is enabled (usually provided by MKL)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.3
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.9.7 (built against CUDA 11.8)
    • Built with CuDNN 8.3.2
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

很多说是互锁或者内存溢出了,但是num_works=1, batch_size=2仍然会出现卡住的情况,请问有什么办法解决?

你好,验证数据集python tools/visualize_datasets.py --coco-img data/coco/val2017 --coco-ann data/coco/annotations/instances_val2017.json --show-dir /tools/visualize_dataset,报错了,求教

PS D:\GitGit\Salience-DETR> python tools/visualize_datasets.py --coco-img data/coco/val2017 --coco-ann data/coco/annotations/instances_val2017.json --show-dir /tools/visualize_dataset
loading annotations into memory...
Done (t=0.74s)
creating index...
index created!
0%| | 0/5000 [00:02<?, ?it/s]
Traceback (most recent call last):
File "tools/visualize_datasets.py", line 96, in
visualize_datasets()
File "tools/visualize_datasets.py", line 72, in visualize_datasets
visualize_coco_bounding_boxes(
File "D:\GitGit\Salience-DETR\util\visualize.py", line 243, in visualize_coco_bounding_boxes
[None for _ in tqdm(data_loader)]
File "D:\GitGit\Salience-DETR\util\visualize.py", line 243, in
[None for _ in tqdm(data_loader)]
File "D:\Anaconda3\envs\salience_detr\lib\site-packages\tqdm\std.py", line 1181, in iter
for obj in iterable:
File "D:\Anaconda3\envs\salience_detr\lib\site-packages\torch\utils\data\dataloader.py", line 368, in iter
return self._get_iterator()
File "D:\Anaconda3\envs\salience_detr\lib\site-packages\torch\utils\data\dataloader.py", line 314, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "D:\Anaconda3\envs\salience_detr\lib\site-packages\torch\utils\data\dataloader.py", line 927, in init
w.start()
File "D:\Anaconda3\envs\salience_detr\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "D:\Anaconda3\envs\salience_detr\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "D:\Anaconda3\envs\salience_detr\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "D:\Anaconda3\envs\salience_detr\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "D:\Anaconda3\envs\salience_detr\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'visualize_coco_bounding_boxes..'
Traceback (most recent call last):
File "", line 1, in
File "D:\Anaconda3\envs\salience_detr\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "D:\Anaconda3\envs\salience_detr\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

[Bug]: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

Bug

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 8, 1092, 1092]], which is output 0 of ReluBackward0, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Environment

No response

Additional

No response

程序环境配置

按照readme进行到 conda install --file requirements.txt这一步的时候,程序一直在Solving environment:/
持续跳
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.

尝试换清华源或默认源还是这样
请问这正常么?有没有什么好的解决办法??

Exporting the operator 'aten::_upsample_bilinear2d_aa' to ONNX opset version 17 is not supported

raise errors.UnsupportedOperatorError(

torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::_upsample_bilinear2d_aa' to ONNX opset version 17 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub: https://github.com/pytorch/pytorch/issues.
============= Diagnostic Run torch.onnx.export version 2.0.1+cu118 =============
verbose: False, log level: Level.ERROR
======================= 0 NONE 0 NOTE 0 WARNING 1 ERROR ========================
ERROR: missing-standard-symbolic-function

Exporting the operator 'aten::_upsample_bilinear2d_aa' to ONNX opset version 17 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub: https://github.com/pytorch/pytorch/issues.
None

导出onnx模型时报错,这个怎么解决

selected index k out of range

Question

Hi,

I encountered an error while training the model on a custom dataset:

select_tgt_index = torch.topk(mc_score, self.topk_sa, dim=1)[1]
RuntimeError: selected index k out of range

Additional

No response

可视化特征图

非常感谢作者贡献这个优秀的项目,我想请问一下作者可否更新一些关于可视化模型特征图的代码~

raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

第一次运行accelerate main.py后,程序加载到下载resnet50的预训练模型,但是没下载完,然后报错RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
可能是网络问题,但我退出终端,想再次运行的时候,程序不下载文件了,转而报错

[2024-05-08 08:47:13 det.models.backbones.base_backbone]: Backbone architecture: resnet50
Loading extension module MultiScaleDeformableAttention...
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
[2024-05-08 08:47:31,260] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5937 closing signal SIGTERM
[2024-05-08 08:47:31,261] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5938 closing signal SIGTERM
[2024-05-08 08:47:31,877] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 5939) of binary: /home/ubuntu/anaconda3/envs/salience_detr/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/salience_detr/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2024-05-08_08:47:31
host : ubuntu-X640-G30
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 5940)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-05-08_08:47:31
host : ubuntu-X640-G30
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 5939)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请问现在该怎么办?

[Bug]: ModuleNotFoundError: No module named 'fvcore.common'

Bug

配置好环境后报错ModuleNotFoundError: No module named 'fvcore.common'
misc.py文件报错在'init_.py'中找不到引用'common':16未解析的引用'PathManager' :16未解析的引用'pygments' :113未解析的引用'Terminal256Formatter' :113未解析的引用'pygments':114未解析的引用'Python3Lexer':114未解析的引用'YamlLexer' :114

环境信息

环境按照官方引导配置在RTX4060 Windows系统上

补充信息

No response

如何进行断点续训

您好,非常感谢您开源这项工作,目前我在训练silence detr with focalnet as the backbone, 发现两个问题:1 cuda out of memory. 我采用3090来训练,并且batch_size=2。2 如何进行断点续训, 我尝试在 train_config文件中设置“resume_from_checkpoint = /hy-tmp/2024-05-09-22_34_34/best_ap.pth”提示语法错误。谢谢您的帮助。

训练自己的数据集

Question

官方数据集结构为:

coco/
├── train2017/
├── val2017/
└── annotations/
├── instances_train2017.json
└── instances_val2017.json`
json文件存放了所有图片对应的标注信息;

我自己的数据集是每个图片对应一个json文件,结构目录为
coco/
├── train/
├── val/
└── annotations/
├── instances_train
├── 1.json
├── 2.json
├── instances_val
└── a.json
└── a.json

怎么修改原来的代码呢?

补充信息

backbone shape

感谢作者提供的优秀项目,我有一个疑问,我想输出resnet 第一个卷积的shape,打印结果是Proxy(getattr_1),无法显示(B,C,H,W)的形状,请问这需要如何查看呢,非常期待作者的回复
9e4274436820f87af00cc39288d0fea

raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

第一次运行accelerate main.py后,程序加载到下载resnet50的预训练模型,但是没下载完,然后报错
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

可能是网络问题,但我退出终端,想再次运行的时候,程序不下载文件了,转而报错

[2024-05-08 08:47:13 det.models.backbones.base_backbone]: Backbone architecture: resnet50
Loading extension module MultiScaleDeformableAttention...
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
Traceback (most recent call last):
File "main.py", line 205, in
train()
File "main.py", line 124, in train
model = Config(cfg.model_path).model
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/lazy_load.py", line 24, in init
exec(code, name_space)
File "", line 34, in
File "/home/ubuntu/Zhu/python project/Salience-DETR/models/backbones/resnet.py", line 412, in new
weights = load_checkpoint(default_weight if weights is None else weights)
File "/home/ubuntu/Zhu/python project/Salience-DETR/util/utils.py", line 373, in load_checkpoint
return torch.hub.load_state_dict_from_url(file_name, map_location=map_location)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/hub.py", line 770, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location, weights_only=weights_only)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 1005, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/serialization.py", line 457, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
[2024-05-08 08:47:31,260] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5937 closing signal SIGTERM
[2024-05-08 08:47:31,261] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 5938 closing signal SIGTERM
[2024-05-08 08:47:31,877] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 2 (pid: 5939) of binary: /home/ubuntu/anaconda3/envs/salience_detr/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/salience_detr/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2024-05-08_08:47:31
host : ubuntu-X640-G30
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 5940)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-05-08_08:47:31
host : ubuntu-X640-G30
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 5939)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

请问现在该怎么办?

dist._broadcast_coalesced(

Question

Modify resume_from_checkpoint to 'checkpoints/salines_detr_desnet50_800_1333_com_2x. pth'

When I use the dual card training command: CUDA_VISIBLEDEVICES=0,1 accelerate launch main. py
RuntimeError: Tensors must be CUDA and dense
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 96188 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 96189) of binary: /home/kb535/anaconda3/envs/salience_detr/bin/python

But when I use the single card training command: CUDA_VISIBLEDEVICES=0 accelerate launch mainpy
No errors occurred, normal training

Additional

No response

Augmentation

你好,
在presets.py中我没有找到和几何增强相关的标注补偿。
请问训练时 在数据增强过程中,对图片进行几何变换后 是否在对应的标注上使用了补偿机制?
感谢。

分布式多机多卡训练卡住,超时后报错

程序跑完1个epoch之后,在第二轮训练过程中卡住,超时报错了
请问这个问题大概出现在哪里?
[2024-05-09 01:12:34 accelerate.tracking]: Successfully logged to TensorBoard
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601336 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601338 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 600851 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 600851 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f7657580d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f7604ac04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f7604ac3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f7604ac4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7f76506dbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7f7659e94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7f7659f26850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E ProcessGroupNCCL.cpp:1182] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601338 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fb489380d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fb4386c04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fb4386c3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fb4386c4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7fb4842dbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7fb48da94ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7fb48db26850 in /lib/x86_64-linux-gnu/libc.so.6)

[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2665572, OpType=ALLREDUCE, NumelIn=11100676, NumelOut=11100676, Timeout(ms)=600000) ran for 601336 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403382592/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7ff8f5980d87 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7ff8a2ec04d6 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7ff8a2ec3a2d in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7ff8a2ec4629 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xdbbf4 (0x7ff8eeadbbf4 in /home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: + 0x94ac3 (0x7ff8f8294ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: + 0x126850 (0x7ff8f8326850 in /lib/x86_64-linux-gnu/libc.so.6)

[2024-05-09 01:28:29,323] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4024 closing signal SIGTERM
[2024-05-09 01:28:31,494] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 4022) of binary: /home/ubuntu/anaconda3/envs/salience_detr/bin/python
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/salience_detr/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/anaconda3/envs/salience_detr/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:
[1]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 4023)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4023
[2]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 4025)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4025

Root Cause (first observed failure):
[0]:
time : 2024-05-09_01:28:29
host : ubuntu-X640-G30
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 4022)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 4022

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.