GithubHelp home page GithubHelp logo

diffusiondet's People

Contributors

shoufachen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

diffusiondet's Issues

Sample step

cfg.MODEL.DiffusionDet.SAMPLE_STEP = 1

Why the time step is set to be 1, the time pairs becomes {999, -1}, as shown below. In the report, it is set to [{T-1, T-2}, {T-2, T-3} ... ]. Please explain it.
image

About the fps reported in the paper

Hi, thank you for your excellent work.
I have some questions about the inference time you report in the paper. I have tried evaluation with the 100 and 300 boxes and observed a big difference in the inference time. However, the FPS is almost the same in the paper.
Could you give more details about how you benchmark the FPS in the paper?
Thanks

training on custom dataset

Really loving this breakthrough! Can you kindly explain the steps or make a Colab notebook for training DiffusionDet on a custom dataset? (maybe you can use the balloon dataset as used in detectron2). This will really help beginners like me. Thanks :)

Questions about DDIM's performance

Hi,

Thanks for sharing your wonderful work. I have trouble figuring out the effectiveness of the DDIM process discussed in the paper. Since there is no related ablation study in the paper, I have conducted the experiments according to the instructions and used the provided checkpoints. For example, I choose the diffdet.coco.res50.yaml config and the COCO Res50 checkpoint.

  1. The model is evaluated with four iterations, and the results are 46.34 mAP.
    diffdet_step4.log
  2. The model is evaluated with four iterations, with the time step fixed to initial values (999, for example). This setting gives 46.32 mAP.
    diffdet_step4_fix999_749.log
  3. The model is evaluated with four iterations, with totally new random boxes in each iteration. This setting gives 46.29 mAP.
    diffdet_step4_fix999_749_random.log

The modified detector.py is available in detector_FixandRand.zip

It seems that the performance gain introduced by the DDIM process is less than 0.05. It seems not significant in object detection.

I further use six iterations with the initial values fixed as in the four iterations (time=999, time_next=749). The results are 46.44 mAP. However, using the DDIM process with dynamic time steps, the results are worse than using fixed time steps and just 46.35 mAP.
diffdet_step6.log
diffdet_step6_fix999_2749.log

Please correct me if there is something wrong with these experiments. It really confuses me a lot.
Many thanks!

Stuck in training

Thanks for your excellent work!
I've found that during training it will stauck, and as long as it's stuck it basically stays still, and it happens randomly. I use a V100-SXM2-32GB single card to run and it will get stuck, and two RTX 3090s will also get stuck. How should I solve this problem?

[Fixed] Clone in DETR Is Fully Wrong!

Hi~
Thanks for your excellent work. But I think what you do in Fig. 4 for DETR is fully wrong.

In Fig. 4, you conduct dynamic experiments with DETR, and you use clone as the padding method.

Actually, cloning query embedding means cloning output bounding boxes. The drop of mAP is because all repeated bounding boxes are treated as false positive, since the ground-truth could only be hit once. In other words, the predicted bounding boxes are the same as before, just repeated some times!

In this case, I think you could not say that the performance of DETR is degenerated.

Also, for random pad, there may be also some bounding boxes that are nearly repeated as before, too. If you pad query embedding for DETR, it is fair to do MMS as post-processing method.

Attached:
Clone does not change the results of nn.MultiheadAttention. Also, obviously, clone does not change the results of MLP. Therefore, clone does change the results of DETR.

q = torch.rand((100, 256))
k = torch.rand((16, 256))
v = k
attn = nn.MultiheadAttention(embed_dim=256, num_heads=8)
res1 = attn(q,k,v)
q2 = torch.cat([q] * 2, dim=0)
res2 = attn(q2,k,v)
print((res1[0] - res2[0][:100]).max()) # tensor(-5.3644e-07, grad_fn=<MinBackward1>)
print((res1[0] - res2[0][100:]).max()) # tensor(-5.9605e-07, grad_fn=<MinBackward1>)

How to use different steps during inference?

Thanks for your excellent work!

I notice that your experiment result shows the detection performance can be improved by increasing sample steps from 1 to 8. However, it seems there's not an option to change sample steps in config files. What should I modify if I want to use different steps in inference?

Questions about once-for-all property

Hi~
Thanks for your excellent work. I still have some questions about the once-for-all property that needs to be explained in the paper.

In terms of progressive refinement, I assume that the gains in performance come from the model ensemble rather than DDIM or diffusion training. We can regard each step in progressive refinement as an instantiation of a fixed-initial-box model, as the initial boxes in each step are totally random after DDIM and box renewal. This hypothesis can be validated by simply throwing away all boxes in box renewal and we gain 46.0AP with the released checkpoint diffdet_coco_res50_300boxes.pth and 5 refinement steps, which is the same as the one without modifying. The discussion in Issue 16 also shows useless of DDIM.

As for dynamic boxes, you'd better compare Deformable DETR + iterative refinement + two stage with yours since such kind of Deformable DETR does not use learnable queries. In my experiments, this Deformable DETR variant achieves 46.2AP, 46.9AP, 47.0AP, and 47.0AP with 100, 300, 500, and 1000 topk queries. Although the gains are minor, dynamic boxes do not degrade performance.

Please correct me if there is something wrong with these experiments. Hoping more insight analyses will be provided in the future. Many thanks!

Some question about equation (2) in your paper

Thanks for your excellent and inspiring work. I have something confusing.

As we all know, after the diffusion process, the common diffusion model like DDPM would use the posterior distribution q(z_{t-1} | z_t, z_0) based on the bayesian theorem to guide the learning of prior distribution p(z_{t-1} | z_t). However, equation (2) in your paper shows you seem to directly use z_0 as the optimization target to guide the learning of prior distribution. In my opinion, such a difference leads to the proposed DiffusionDet just using the diffusion model to perform data augmentation, which raises a new question of "whether choosing another data-augmentation technique also can bring a similar performance?"

Sincerely hope to receive your reply.

About training loss

In DDIM or DDPM, there are losses (KL-Divergence) to constrain the diffused outputs during training steps to be Gaussian distributions. I thought it is the base for DDIM sampling (the reverse process). However, in DiffusionDet, only set prediction loss is used. So how can DDIM work without training constrain?

What's the usage of GaussianFourierProjection?

Hi, thank you for your interesting work!

I wonder that's the usage of this function.

I notice SinusoidalPositionEmbeddings is used to encode time embeddings.

then, if I want to apply time embeddings into the spatial feature maps (ie, NCH*W), how can I achieve it ? It seems I cannot directly use SinusoidalPositionEmbeddings.

Log / training curves

Hi! Could you please publish logs / training curves / wall-clock time? Did you use 8 x A100 gpus for training?

This is helpful for follow-up works and having smaller repro setups.

Thank you!

A question about proposal boxes initialize.

Thank you for your brilliant work! I still have a question about initialising proposal boxes.

During training, have you ever tried using completely random suggestion boxes instead of generating noisy boxes form gt by diffusion way? I have done a simila experiment on other task, I found that fully random boxes may not work the best, but it still work.

So I am wondering if you have made a similar attempt.

What may cause the AP result difference

I run the evaluate command, but get the different AP results,
note that I only change the command arg --num-gpus 8 to --num-gpus 1 since I only have one gpu device
But the results are slight different compared to the paper experiments.
I evaluate it with your pretrained weight, any ideas on why the difference occurred?

the scores in "()" are from paper, and the scores out of "()" are my evaluate results.

model AP AP50 AP75 APs APm APl
coco.res50 45.776(45.5) 65.417(65.1) 49.313(48.7) 27.809(27.5) 48.298(48.1) 61.726(61.2)
coco.res101 46.528(46.6) 66.290(66.3) 49.969(50.0) 29.977(30.0) 49.468(49.3) 62.079(62.8)
coco.swinbase 52.301(52.3) 72.812(72.7) 56.473(56.3) 35.481(34.8) 56.004(56.0) 68.613(68.5)

log about training

Thanks for your great work!
I want to reproduce the result but I don't know what the proper loss (l1, giou...) will be trained to at the final epoch. Could you please provide your log file during the training?Thanks!

Detectron2 Installation

Thank you for the great work. I was trying to run LVIS model training and getting the following error,

ImportError: cannot import name 'get_fed_loss_cls_weights' from 'detectron2.data.detection_utils'

I am using PyTorch==1.10.0 and Detectron2==0.6 with CUDA 11.1. I installed Detectron2 from the pre-built libraries.

checkpoint

How do I change the save frequency of checkpoint

Question about confidence

First of all thanks for your work, the paper is super interesting and I would like to employ something similar to your box renewal in my research. But from the paper and the code to me is not intuitive how it works.
My intuition is that GT bboxes have a high confidence at the beginning, while padded bboxes should have a low confidence.
What is not clear is how to handle the confidence in the diffusion process, does also the confidence get noised by the process and do you try to reverse it? If this is the case, could it be that in the reverse process, due to the fluctuations of it, you reject some bboxes which could be ok?

Thanks in advance!

The signal scaling in the training stage

Hi,

Thanks for sharing your wonderful work. I can't understand the signal scaling equation pb = (pb * 2 - 1) * scale in the training stage, can you explain the reason why transform pb by this equation in more detail?

Many thanks!

question about sample step

In the pseudo-code infer part of DiffusionDet paper, it shows

times = reversed(torch.linspace(-1, 1000, steps))

and according to the setting of the experiment, the steps is set to be 1, so I tried this:

times = reversed(torch.linspace(-1, 1000, steps=1))
print(times)
time_pairs = list(zip(times[:-1], times[1:]))
print(time_pairs)

and the output print:

tensor([-1.])
[]

I change the step to 4:

times = reversed(torch.linspace(-1, 1000, steps=4))
print(times)
time_pairs = list(zip(times[:-1], times[1:]))
print(time_pairs)

Then I get:

tensor([1000.0000,  666.3334,  332.6667,   -1.0000])
[(tensor(1000.), tensor(666.3334)), (tensor(666.3334), tensor(332.6667)), (tensor(332.6667), tensor(-1.))]

So when the step is 1, we only get tensor([-1.]), that's kind of weird, or I misunderstood the process?

Thought about Box corruption.

Looks like only scaling operation is done in Box corruption for training stage, however by only scaling them, the centers of the boxes remain the same, but in inference stage, the boxes are generated totally randomly, which means that you have to adjust the center(maybe drastically) to make a box center right.
Is there any ablation studies about the Corruption method? (or they're already in the paper, i just missed it?)

Training time

Hi,

Thanks for sharing your great work. May I know the training time of each model if possible?

Many thanks!

how to train diffusiondet on my own datasets?

Thanks for sharing this amazing work,but i have some confusion about training this model on my own datasets?
If my data format is just like coco, which parameters shou i change? Could please show the command line of training diffusiondet for us?? I did not see any quick start of training this model??

A question about the time in Sparse head

Thank you for your brilliant work!
The model uses 6 sparse heads, each head predicts an offset. After a head (e.g. the 1st head) the noise level of the boxes should decrease. Do you think it is reasonable or better to send a different and smaller time to the next head?

OS Error occured when save model pth file using 8GPU

Hello.

Thanks for excellent detection algorighm.

I faced OSError when saving the model when using 8 GPU training.

Is that a problem of my environment or known error?

[12/20 18:16:09 d2.utils.events]: eta: 19:46:55 iter: 4959 total_loss: 10.31 loss_ce: 0.8514 loss_bbox: 0.3865 loss_giou: 0.4012 loss_ce_0: 0.902 loss_bbox_0: 0.4758 loss_giou_0: 0.4936 loss_ce_1: 0.8713 loss_bbox_1: 0.4325 loss_giou_1: 0.4167 loss_ce_2: 0.8526 loss_bbox_2: 0.4106 loss_giou_2: 0.4151 loss_ce_3: 0.8237 loss_bbox_3: 0.409 loss_giou_3: 0.391 loss_ce_4: 0.8287 loss_bbox_4: 0.4046 loss_giou_4: 0.4085 time: 0.2815 data_time: 0.0115 lr: 2.5e-05 max_mem: 5418M
[12/20 18:16:14 d2.utils.events]: eta: 19:46:57 iter: 4979 total_loss: 9.924 loss_ce: 0.7974 loss_bbox: 0.3912 loss_giou: 0.3875 loss_ce_0: 0.8377 loss_bbox_0: 0.4803 loss_giou_0: 0.448 loss_ce_1: 0.8489 loss_bbox_1: 0.3942 loss_giou_1: 0.3934 loss_ce_2: 0.8181 loss_bbox_2: 0.4009 loss_giou_2: 0.3799 loss_ce_3: 0.7867 loss_bbox_3: 0.3836 loss_giou_3: 0.38 loss_ce_4: 0.7726 loss_bbox_4: 0.3905 loss_giou_4: 0.3784 time: 0.2814 data_time: 0.0108 lr: 2.5e-05 max_mem: 5418M
[12/20 18:16:20 fvcore.common.checkpoint]: Saving checkpoint to ./output/model_0004999.pth
ERROR [12/20 18:16:20 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/torch/serialization.py", line 423, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/torch/serialization.py", line 650, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 7] Argument list too long

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 128, in save
torch.save(data, f)
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/torch/serialization.py", line 424, in save
return
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/torch/serialization.py", line 299, in exit
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:325] . unexpected pos 361728768 vs 361728656

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 150, in train
self.after_step()
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 180, in after_step
h.after_step()
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/detectron2/engine/hooks.py", line 206, in after_step
self.step(self.trainer.iter)
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 430, in step
self.checkpointer.save(
File "/home/acd14181ss/difdet/lib/python3.8/site-packages/fvcore/common/checkpoint.py", line 128, in save
torch.save(data, f)
OSError: [Errno 7] Argument list too long
[12/20 18:16:20 d2.engine.hooks]: Overall training speed: 4997 iterations in 0:23:26 (0.2815 s / it)
[12/20 18:16:20 d2.engine.hooks]: Total training time: 0:23:38 (0:00:11 on hooks)
[12/20 18:16:20 d2.utils.events]: eta: 19:46:44 iter: 4999 total_loss: 9.931 loss_ce: 0.8132 loss_bbox: 0.3718 loss_giou: 0.3842 loss_ce_0: 0.881 loss_bbox_0: 0.476 loss_giou_0: 0.4806 loss_ce_1: 0.859 loss_bbox_1: 0.407 loss_giou_1: 0.4112 loss_ce_2: 0.8139 loss_bbox_2: 0.3961 loss_giou_2: 0.399 loss_ce_3: 0.8251 loss_bbox_3: 0.3842 loss_giou_3: 0.3866 loss_ce_4: 0.818 loss_bbox_4: 0.3894 loss_giou_4: 0.3854 time: 0.2814 data_time: 0.0117 lr: 2.5e-05 max_mem: 5418M

How the detection heads work?

Why using so many rcnn heads(default 6 heads) here? If I only use less head or just one head, would it work well? Any references about this part can I learn?

Why permute pro_features and why `[0]` here?

pro_features = pro_features.view(N, nr_boxes, self.d_model).permute(1, 0, 2)
pro_features2 = self.self_attn(pro_features, pro_features, value=pro_features)[0]

  1. Before passing proposal feature into self.attn, you permute the feature and make the size of it be nr_boxes*N*C, it confused me.

For normal transformer block, the feature should be with size N*num_tokens*C however yours nr_boxes*N*C, Why ?

  1. [0] makes the feature be N*C by dragging out the first bbox ?

show your command line

Thanks for sharing this amazing wok. If I use the offical coco datasets to train this model,could please show me your command line and which config file you choose???

Inference Speed

Hi,
I have used pretrained model but its inference speed is not good. (using COLAB) (Tesla T4 GPU)
video dimensions: 4096 × 1080
getting speed (1.20s/it) 1.20 sec per frame.

Using this:

!python3 demo.py --config-file configs/diffdet.coco.res50.yaml \
    --video-input input.mp4 \
    --output output \
    --opts MODEL.WEIGHTS checkpoints/diffdet_coco_res50.pth

Is there any way to increase the inference speed?
What should be the dim of the video, to get high speed?
Thanks

[A Bug?] The diffused boxes as input may have negative coordinates.

x = torch.clamp(x, min=-1 * self.scale, max=self.scale)
x = ((x / self.scale) + 1) / 2.
diff_boxes = box_cxcywh_to_xyxy(x)

Though you clamped the coordinates (x) at line400, they may become negative number when converted from (cx, cy, w, h) to (x, y, x, y) mode. Here is an example:

image

It happens when cx or cy is close to zero (or clamped to zero), if conducting cx - w/2 or cy - h/2, it becomes negative number.

Is it acceptable to pass the negative coordinates to the RCNN head? Will it cause any unexpected behavior extracting roi feature?

Question about the targets format / prepare_targets / self.head

Hi!

Could you please elaborate a bit on the format of outputs_coord returned by self.head? and the format of targets returned by prepare_targets?

In training, it seems that it's predicting the new boxes coordinates in absolute coordinates, given previous iteration's estimate. The loss is applied on those outputs_coord, and the loss seems to compare it not to noises, but always to ground truth boxes: https://github.com/ShoufaChen/DiffusionDet/blob/main/diffusiondet/loss.py#L180-L181. noises return value from prepare_targets seems discarded

Am I correct? Basically trying to understand how the box prediction is parametrized. Is it because the RCNNhead reapplies the predicted deltas/shift before returning its value?

Another question is on the DynamicHead: am I correct that at every step of diffusion inside the head it already does NUM_HEADS = 6 rounds of refinement via self.head_series which consists of RCNNHead instances?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.