shoufachen / diffusiondet Goto Github PK

View Code? Open in Web Editor NEW

2.0K 2.0K 154.0 931 KB

[ICCV2023 Oral] PyTorch implementation of DiffusionDet (https://arxiv.org/abs/2211.09788)

License: Other

Python 100.00%

diffusiondet's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes johndpope ricklentz marenan ismarou codeaudit chaytonmin william3johnson inferno-inc tianhaofu jlqzzz muchhair omkarthawakar sailfish009 wangmj6 stjordanis dl-diffusion cv-det dingchenyang99 guttappa1238 hongbo123467 zeroday0utplay arslan-z bruinxiong ywyue zhys513 kcetskcaz slapjoe08 momopusheen shengstar hongyunnchen imabackstabber akankushjnvku fangdol888 yangfukui qiuweibin2005 therock90421 shuowang-ai williamllee ytep-zhi tsiper amitrana001 shadowkun muharremokutan ado5 nlgrf diegorfband topwoo kyuhyoung yongjingli zhang291 jingzhang617 dfqytcom swag-lay lwaekfjlk peterzs rishirelan metavai sophiezhou sergeykadiyevskiy laborer123 yaboliudotug alligatorsky1017 t1masavin baptisteabeles garnieradrian shadowxzt kanshichao zlapp 345ishaan artificial-intelligence-club-at-yorku kgh6784 lyan-ing wwchy pan2017 showmake waeldlz georgeliu233 ajunlonglive v15hv4 tfius tanmodong immocreat srzxdragon zc0in hungrygeek16 bigwangyudong skywalker0803r vadimkantorov zequanliu 12tqian deanofthewebb pauldhalluin licheer-jian jbr97 coohyuan drryanhuang tensorfly-gpu caikw0602 aryan-at-ul

diffusiondet's Issues

Sample step

DiffusionDet/diffusiondet/config.py

Line 52 in 1efb36d

cfg.MODEL.DiffusionDet.SAMPLE_STEP = 1

Why the time step is set to be 1, the time pairs becomes {999, -1}, as shown below. In the report, it is set to [{T-1, T-2}, {T-2, T-3} ... ]. Please explain it.

About the fps reported in the paper

Hi, thank you for your excellent work.
I have some questions about the inference time you report in the paper. I have tried evaluation with the 100 and 300 boxes and observed a big difference in the inference time. However, the FPS is almost the same in the paper.
Could you give more details about how you benchmark the FPS in the paper?
Thanks

training on custom dataset

Really loving this breakthrough! Can you kindly explain the steps or make a Colab notebook for training DiffusionDet on a custom dataset? (maybe you can use the balloon dataset as used in detectron2). This will really help beginners like me. Thanks :)

Question about `fc_feature = fc_feature * (scale + 1) + shift` in head.py.

Can you explain what dose this mean?

DiffusionDet/diffusiondet/head.py

Line 274 in 1efb36d

fc_feature = fc_feature * (scale + 1) + shift

how to have a train on my own data

Questions about DDIM's performance

Hi,

Thanks for sharing your wonderful work. I have trouble figuring out the effectiveness of the DDIM process discussed in the paper. Since there is no related ablation study in the paper, I have conducted the experiments according to the instructions and used the provided checkpoints. For example, I choose the diffdet.coco.res50.yaml config and the COCO Res50 checkpoint.

The model is evaluated with four iterations, and the results are 46.34 mAP.
diffdet_step4.log
The model is evaluated with four iterations, with the time step fixed to initial values (999, for example). This setting gives 46.32 mAP.
diffdet_step4_fix999_749.log
The model is evaluated with four iterations, with totally new random boxes in each iteration. This setting gives 46.29 mAP.
diffdet_step4_fix999_749_random.log

The modified detector.py is available in detector_FixandRand.zip

It seems that the performance gain introduced by the DDIM process is less than 0.05. It seems not significant in object detection.

I further use six iterations with the initial values fixed as in the four iterations (time=999, time_next=749). The results are 46.44 mAP. However, using the DDIM process with dynamic time steps, the results are worse than using fixed time steps and just 46.35 mAP.
diffdet_step6.log
diffdet_step6_fix999_2749.log

Please correct me if there is something wrong with these experiments. It really confuses me a lot.
Many thanks!

Stuck in training

Thanks for your excellent work!
I've found that during training it will stauck, and as long as it's stuck it basically stays still, and it happens randomly. I use a V100-SXM2-32GB single card to run and it will get stuck, and two RTX 3090s will also get stuck. How should I solve this problem?

traine model AttributeError: 'DiffusionDet' object has no attribute 'q_sample'

AttributeError: 'DiffusionDet' object has no attribute 'q_sample'

[Fixed] Clone in DETR Is Fully Wrong!

Hi~
Thanks for your excellent work. But I think what you do in Fig. 4 for DETR is fully wrong.

In Fig. 4, you conduct dynamic experiments with DETR, and you use clone as the padding method.

Actually, cloning query embedding means cloning output bounding boxes. The drop of mAP is because all repeated bounding boxes are treated as false positive, since the ground-truth could only be hit once. In other words, the predicted bounding boxes are the same as before, just repeated some times!

In this case, I think you could not say that the performance of DETR is degenerated.

Also, for random pad, there may be also some bounding boxes that are nearly repeated as before, too. If you pad query embedding for DETR, it is fair to do MMS as post-processing method.

Attached:
Clone does not change the results of nn.MultiheadAttention. Also, obviously, clone does not change the results of MLP. Therefore, clone does change the results of DETR.

q = torch.rand((100, 256))
k = torch.rand((16, 256))
v = k
attn = nn.MultiheadAttention(embed_dim=256, num_heads=8)
res1 = attn(q,k,v)
q2 = torch.cat([q] * 2, dim=0)
res2 = attn(q2,k,v)
print((res1[0] - res2[0][:100]).max()) # tensor(-5.3644e-07, grad_fn=<MinBackward1>)
print((res1[0] - res2[0][100:]).max()) # tensor(-5.9605e-07, grad_fn=<MinBackward1>)

How to use different steps during inference?

Thanks for your excellent work!

I notice that your experiment result shows the detection performance can be improved by increasing sample steps from 1 to 8. However, it seems there's not an option to change sample steps in config files. What should I modify if I want to use different steps in inference?

Questions about once-for-all property

Hi~
Thanks for your excellent work. I still have some questions about the once-for-all property that needs to be explained in the paper.

In terms of progressive refinement, I assume that the gains in performance come from the model ensemble rather than DDIM or diffusion training. We can regard each step in progressive refinement as an instantiation of a fixed-initial-box model, as the initial boxes in each step are totally random after DDIM and box renewal. This hypothesis can be validated by simply throwing away all boxes in box renewal and we gain 46.0AP with the released checkpoint diffdet_coco_res50_300boxes.pth and 5 refinement steps, which is the same as the one without modifying. The discussion in Issue 16 also shows useless of DDIM.

As for dynamic boxes, you'd better compare Deformable DETR + iterative refinement + two stage with yours since such kind of Deformable DETR does not use learnable queries. In my experiments, this Deformable DETR variant achieves 46.2AP, 46.9AP, 47.0AP, and 47.0AP with 100, 300, 500, and 1000 topk queries. Although the gains are minor, dynamic boxes do not degrade performance.

Please correct me if there is something wrong with these experiments. Hoping more insight analyses will be provided in the future. Many thanks!

A bug ? Seems you use `x_start` of the last instance prediction as the prediction `x_start` of the whole batch

DiffusionDet/diffusiondet/detector.py

Line 177 in 1efb36d

 x_start = outputs_coord[-1] # (batch, num_proposals, 4) predict boxes: absolute coordinates (x1, y1, x2, y2) 

x_start = outputs_coord[-1] makes me confused, and you guys predict z_t using the last prediction in the batch, why?

Some question about equation (2) in your paper

Thanks for your excellent and inspiring work. I have something confusing.

As we all know, after the diffusion process, the common diffusion model like DDPM would use the posterior distribution q(z_{t-1} | z_t, z_0) based on the bayesian theorem to guide the learning of prior distribution p(z_{t-1} | z_t). However, equation (2) in your paper shows you seem to directly use z_0 as the optimization target to guide the learning of prior distribution. In my opinion, such a difference leads to the proposed DiffusionDet just using the diffusion model to perform data augmentation, which raises a new question of "whether choosing another data-augmentation technique also can bring a similar performance?"

Sincerely hope to receive your reply.

About training loss

In DDIM or DDPM, there are losses (KL-Divergence) to constrain the diffused outputs during training steps to be Gaussian distributions. I thought it is the base for DDIM sampling (the reverse process). However, in DiffusionDet, only set prediction loss is used. So how can DDIM work without training constrain?

DDIM sampling eta setting

Hi, thanks for your great work.
I noticed that you applied DDIM sampling method in your code, with the setting:

DiffusionDet/diffusiondet/detector.py

Line 97 in 1efb36d

self.ddim_sampling_eta = 1.

I found ETA controls the scale of the variance (0 is DDIM, and 1 is one type of DDPM). from https://github.com/ermongroup/ddim
It seems that eta=1 means DDPM.
May I ask if your eta is the same as the one from ermongroup?

Could you tell me the performance gap between OT and Hungarian?

What's the usage of GaussianFourierProjection?

Hi, thank you for your interesting work!

I wonder that's the usage of this function.

I notice SinusoidalPositionEmbeddings is used to encode time embeddings.

then, if I want to apply time embeddings into the spatial feature maps (ie, NCH*W)， how can I achieve it ? It seems I cannot directly use SinusoidalPositionEmbeddings.

Log / training curves

Hi! Could you please publish logs / training curves / wall-clock time? Did you use 8 x A100 gpus for training?

This is helpful for follow-up works and having smaller repro setups.

Thank you!

A question about proposal boxes initialize.

Thank you for your brilliant work! I still have a question about initialising proposal boxes.

During training, have you ever tried using completely random suggestion boxes instead of generating noisy boxes form gt by diffusion way? I have done a simila experiment on other task, I found that fully random boxes may not work the best, but it still work.

So I am wondering if you have made a similar attempt.

What may cause the AP result difference

I run the evaluate command, but get the different AP results,
note that I only change the command arg --num-gpus 8 to --num-gpus 1 since I only have one gpu device
But the results are slight different compared to the paper experiments.
I evaluate it with your pretrained weight, any ideas on why the difference occurred?

the scores in "()" are from paper, and the scores out of "()" are my evaluate results.

model	AP	AP50	AP75	APs	APm	APl
coco.res50	45.776(45.5)	65.417(65.1)	49.313(48.7)	27.809(27.5)	48.298(48.1)	61.726(61.2)
coco.res101	46.528(46.6)	66.290(66.3)	49.969(50.0)	29.977(30.0)	49.468(49.3)	62.079(62.8)
coco.swinbase	52.301(52.3)	72.812(72.7)	56.473(56.3)	35.481(34.8)	56.004(56.0)	68.613(68.5)

log about training

Thanks for your great work!
I want to reproduce the result but I don't know what the proper loss (l1, giou...) will be trained to at the final epoch. Could you please provide your log file during the training？Thanks!

what does this "3sigma == 1/2 , ---> sigma: 1/6" mean?

DiffusionDet/diffusiondet/detector.py

Line 385 in 1efb36d

device=self.device) / 6. + 0.5 # 3sigma = 1/2 --> sigma: 1/6

what does this "3sigma == 1/2 , ---> sigma: 1/6" mean?
can you help for explaining this?

Performance about DiffusionDet with SwinLarge

Hi, I wander to know whether I can perform diffusiondet with SwinLarge as backbone? How or where do you get the pretrained wegihts of SwinBase?
Thanks a lot !

`clip_x_start` is not used

DiffusionDet/diffusiondet/detector.py

Line 170 in 1efb36d

 def model_predictions(self, backbone_feats, images_whwh, x, t, x_self_cond=None, clip_x_start=False): 

And here https://github.com/ShoufaChen/DiffusionDet/blob/main/diffusiondet/detector.py#L205, clip_denoised=True by default.

Detectron2 Installation

Thank you for the great work. I was trying to run LVIS model training and getting the following error,

ImportError: cannot import name 'get_fed_loss_cls_weights' from 'detectron2.data.detection_utils'

I am using PyTorch==1.10.0 and Detectron2==0.6 with CUDA 11.1. I installed Detectron2 from the pre-built libraries.

Regarding demo collab

Can you share a Google colab notebook demo for inference with sample image.

checkpoint

How do I change the save frequency of checkpoint

Question about confidence

First of all thanks for your work, the paper is super interesting and I would like to employ something similar to your box renewal in my research. But from the paper and the code to me is not intuitive how it works.
My intuition is that GT bboxes have a high confidence at the beginning, while padded bboxes should have a low confidence.
What is not clear is how to handle the confidence in the diffusion process, does also the confidence get noised by the process and do you try to reverse it? If this is the case, could it be that in the reverse process, due to the fluctuations of it, you reject some bboxes which could be ok?

Thanks in advance!

The signal scaling in the training stage

Hi,

Thanks for sharing your wonderful work. I can't understand the signal scaling equation pb = (pb * 2 - 1) * scale in the training stage, can you explain the reason why transform pb by this equation in more detail?

Many thanks!

question about sample step

In the pseudo-code infer part of DiffusionDet paper, it shows

times = reversed(torch.linspace(-1, 1000, steps))

and according to the setting of the experiment, the steps is set to be 1, so I tried this:

times = reversed(torch.linspace(-1, 1000, steps=1))
print(times)
time_pairs = list(zip(times[:-1], times[1:]))
print(time_pairs)

and the output print:

tensor([-1.])
[]

I change the step to 4:

times = reversed(torch.linspace(-1, 1000, steps=4))
print(times)
time_pairs = list(zip(times[:-1], times[1:]))
print(time_pairs)

Then I get:

tensor([1000.0000,  666.3334,  332.6667,   -1.0000])
[(tensor(1000.), tensor(666.3334)), (tensor(666.3334), tensor(332.6667)), (tensor(332.6667), tensor(-1.))]

So when the step is 1, we only get tensor([-1.]), that's kind of weird, or I misunderstood the process?

Does any words about `self.use_ensemble` shown in the paper?

DiffusionDet/diffusiondet/detector.py

Line 239 in 1efb36d

if self.use_ensemble and self.sampling_timesteps > 1:

Can you explain what's the propose of use_ensemble and how it works for better performance? Many thanks!

Thought about Box corruption.

Looks like only scaling operation is done in Box corruption for training stage, however by only scaling them, the centers of the boxes remain the same, but in inference stage, the boxes are generated totally randomly, which means that you have to adjust the center(maybe drastically) to make a box center right.
Is there any ablation studies about the Corruption method? (or they're already in the paper, i just missed it?)

cannot import name 'create_ddp_model' from 'detectron2.engine'

I encountered an error in training:
ImportError: cannot import name 'create_ddp_model' from 'detectron2.engine' (/home/jddx/wxp/sizecnn/detectron2/engine/init.py)
can you help me ?

Training time

Hi,

Thanks for sharing your great work. May I know the training time of each model if possible?

Many thanks!

how to train diffusiondet on my own datasets？

Thanks for sharing this amazing work，but i have some confusion about training this model on my own datasets？
If my data format is just like coco， which parameters shou i change？ Could please show the command line of training diffusiondet for us？？ I did not see any quick start of training this model??

A question about the time in Sparse head

Thank you for your brilliant work!
The model uses 6 sparse heads, each head predicts an offset. After a head (e.g. the 1st head) the noise level of the boxes should decrease. Do you think it is reasonable or better to send a different and smaller time to the next head?

OS Error occured when save model pth file using 8GPU

Hello.

Thanks for excellent detection algorighm.

I faced OSError when saving the model when using 8 GPU training.

Is that a problem of my environment or known error?

[12/20 18:16:09 d2.utils.events]: eta: 19:46:55 iter: 4959 total_loss: 10.31 loss_ce: 0.8514 loss_bbox: 0.3865 loss_giou: 0.4012 loss_ce_0: 0.902 loss_bbox_0: 0.4758 loss_giou_0: 0.4936 loss_ce_1: 0.8713 loss_bbox_1: 0.4325 loss_giou_1: 0.4167 loss_ce_2: 0.8526 loss_bbox_2: 0.4106 loss_giou_2: 0.4151 loss_ce_3: 0.8237 loss_bbox_3: 0.409 loss_giou_3: 0.391 loss_ce_4: 0.8287 loss_bbox_4: 0.4046 loss_giou_4: 0.4085 time: 0.2815 data_time: 0.0115 lr: 2.5e-05 max_mem: 5418M
[12/20 18:16:14 d2.utils.events]: eta: 19:46:57 iter: 4979 total_loss: 9.924 loss_ce: 0.7974 loss_bbox: 0.3912 loss_giou: 0.3875 loss_ce_0: 0.8377 loss_bbox_0: 0.4803 loss_giou_0: 0.448 loss_ce_1: 0.8489 loss_bbox_1: 0.3942 loss_giou_1: 0.3934 loss_ce_2: 0.8181 loss_bbox_2: 0.4009 loss_giou_2: 0.3799 loss_ce_3: 0.7867 loss_bbox_3: 0.3836 loss_giou_3: 0.38 loss_ce_4: 0.7726 loss_bbox_4: 0.3905 loss_giou_4: 0.3784 time: 0.2814 data_time: 0.0108 lr: 2.5e-05 max_mem: 5418M
[12/20 18:16:20 fvcore.common.checkpoint]: Saving checkpoint to ./output/model_0004999.pth
ERROR [12/20 18:16:20 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/torch/serialization.py", line 423, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/torch/serialization.py", line 650, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 7] Argument list too long

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 128, in save
torch.save(data, f)
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/torch/serialization.py", line 424, in save
return
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/torch/serialization.py", line 299, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:325] . unexpected pos 361728768 vs 361728656

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 150, in train
self.after_step()
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 180, in after_step
h.after_step()
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/detectron2/engine/hooks.py", line 206, in after_step
self.step(self.trainer.iter)
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 430, in step
self.checkpointer.save(
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 128, in save
torch.save(data, f)
OSError: [Errno 7] Argument list too long
[12/20 18:16:20 d2.engine.hooks]: Overall training speed: 4997 iterations in 0:23:26 (0.2815 s / it)
[12/20 18:16:20 d2.engine.hooks]: Total training time: 0:23:38 (0:00:11 on hooks)
[12/20 18:16:20 d2.utils.events]: eta: 19:46:44 iter: 4999 total_loss: 9.931 loss_ce: 0.8132 loss_bbox: 0.3718 loss_giou: 0.3842 loss_ce_0: 0.881 loss_bbox_0: 0.476 loss_giou_0: 0.4806 loss_ce_1: 0.859 loss_bbox_1: 0.407 loss_giou_1: 0.4112 loss_ce_2: 0.8139 loss_bbox_2: 0.3961 loss_giou_2: 0.399 loss_ce_3: 0.8251 loss_bbox_3: 0.3842 loss_giou_3: 0.3866 loss_ce_4: 0.818 loss_bbox_4: 0.3894 loss_giou_4: 0.3854 time: 0.2814 data_time: 0.0117 lr: 2.5e-05 max_mem: 5418M

How the detection heads work?

Why using so many rcnn heads(default 6 heads) here? If I only use less head or just one head, would it work well? Any references about this part can I learn?

Why permute pro_features and why `[0]` here?

DiffusionDet/diffusiondet/head.py

Lines 253 to 254 in 1efb36d

 pro_features = pro_features.view(N, nr_boxes, self.d_model).permute(1, 0, 2) 

 pro_features2 = self.self_attn(pro_features, pro_features, value=pro_features)[0]

Before passing proposal feature into self.attn, you permute the feature and make the size of it be nr_boxes*N*C, it confused me.

For normal transformer block, the feature should be with size N*num_tokens*C however yours nr_boxes*N*C, Why ?

[0] makes the feature be N*C by dragging out the first bbox ?

Feature Request: Adding prediction(detect.py) code

Hello,

I'm curious about the prediction results of this project. Will you share the detect.py file?

show your command line

Thanks for sharing this amazing wok. If I use the offical coco datasets to train this model，could please show me your command line and which config file you choose???

Inference Speed

Hi,
I have used pretrained model but its inference speed is not good. (using COLAB) (Tesla T4 GPU)
video dimensions: 4096 × 1080
getting speed (1.20s/it) 1.20 sec per frame.

Using this:

!python3 demo.py --config-file configs/diffdet.coco.res50.yaml \
    --video-input input.mp4 \
    --output output \
    --opts MODEL.WEIGHTS checkpoints/diffdet_coco_res50.pth

Is there any way to increase the inference speed?
What should be the dim of the video, to get high speed?
Thanks

[A Bug?] The diffused boxes as input may have negative coordinates.

DiffusionDet/diffusiondet/detector.py

Lines 400 to 403 in 1efb36d

 x = torch.clamp(x, min=-1 * self.scale, max=self.scale) 

 x = ((x / self.scale) + 1) / 2. 

 diff_boxes = box_cxcywh_to_xyxy(x)

Though you clamped the coordinates (x) at line400, they may become negative number when converted from (cx, cy, w, h) to (x, y, x, y) mode. Here is an example:

It happens when cx or cy is close to zero (or clamped to zero), if conducting cx - w/2 or cy - h/2, it becomes negative number.

Is it acceptable to pass the negative coordinates to the RCNN head? Will it cause any unexpected behavior extracting roi feature?

Question about the targets format / prepare_targets / self.head

Hi!

Could you please elaborate a bit on the format of outputs_coord returned by self.head? and the format of targets returned by prepare_targets?

In training, it seems that it's predicting the new boxes coordinates in absolute coordinates, given previous iteration's estimate. The loss is applied on those outputs_coord, and the loss seems to compare it not to noises, but always to ground truth boxes: https://github.com/ShoufaChen/DiffusionDet/blob/main/diffusiondet/loss.py#L180-L181. noises return value from prepare_targets seems discarded

Am I correct? Basically trying to understand how the box prediction is parametrized. Is it because the RCNNhead reapplies the predicted deltas/shift before returning its value?

Another question is on the DynamicHead: am I correct that at every step of diffusion inside the head it already does NUM_HEADS = 6 rounds of refinement via self.head_series which consists of RCNNHead instances?

Thanks!

the effect of torch.clamp

Hi,
I wonder if the process "torch. clamp" is necessary. what's its function?

Have you tried Hungarian matching instead of optimal transport assignment?

`self.objective` seems not used

DiffusionDet/diffusiondet/detector.py

Line 190 in 1efb36d

 total_timesteps, sampling_timesteps, eta, objective = self.num_timesteps, self.sampling_timesteps, self.ddim_sampling_eta, self.objective 

	pro_features = pro_features.view(N, nr_boxes, self.d_model).permute(1, 0, 2)
	pro_features2 = self.self_attn(pro_features, pro_features, value=pro_features)[0]

	x = torch.clamp(x, min=-1 * self.scale, max=self.scale)
	x = ((x / self.scale) + 1) / 2.

	diff_boxes = box_cxcywh_to_xyxy(x)

shoufachen / diffusiondet Goto Github PK

diffusiondet's People

Contributors

Stargazers

Watchers

Forkers

diffusiondet's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs