GithubHelp home page GithubHelp logo

plain-detr's Introduction

Plain-DETR

By Yutong Lin*, Yuhui Yuan*, Zheng Zhang*, Chen Li, Nanning Zheng and Han Hu*

This repo is the official implementation of "DETR Doesn’t Need Multi-Scale or Locality Design".

Introduction

We present an improved DETR detector that maintains a “plain” nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that re-introduce architectural inductive biases of multi-scale and locality into the decoder.

We show that two simple technologies are surprisingly effective within a plain design: 1) a box-to-pixel relative position bias (BoxRPB) term to guide each query to attend to the corresponding object region; 2) masked image modeling (MIM)-based backbone pre-training to help learn representation with fine-grained localization ability and to remedy dependencies on the multi-scale feature maps.

Main Results

BoxRPB MIM PT. Reparam. AP Paper Position CFG CKPT
37.2 Tab2 Exp1 cfg ckpt
46.1 Tab2 Exp2 cfg ckpt
48.7 Tab2 Exp5 cfg ckpt
50.9 Tab2 Exp6 cfg ckpt

Installation

Conda

# create conda environment
conda create -n plain_detr python=3.8 -y
conda activate plain_detr

# install pytorch (other versions may also work)
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt

Docker

We have tested with the docker image superbench/dev:cuda11.8. Other dockers may also work.

# run docker
sudo docker run -it -p 8022:22 -d --name=plain_detr --privileged --net=host --ipc=host --gpus=all -v /:/data superbench/dev:cuda11.8 bash
sudo docker exec -it plain_detr bash

# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt

Usage

Dataset preparation

Please download COCO 2017 dataset and organize them as following:

code_root/
└── data/
    └── coco/
        ├── train2017/
        ├── val2017/
        └── annotations/
        	├── instances_train2017.json
        	└── instances_val2017.json

Pretrained models preparation

Please run the following script to download supervised and mask-image-modeling pretrained models.

(We adopts Swin Transformer v2 as the default backbone. If you are interested in the pretraining, please refer to Swin Transformer v2 (paper, github) and SimMIM (paper, github) for more details.)

bash tools/prepare_pt_model.sh

Training

Training on single node

GPUS_PER_NODE=<num gpus> ./tools/run_dist_launch.sh <num gpus> <path to config file>

Training on multiple nodes

On each node, run the following script:

MASTER_ADDR=<master node IP address> GPUS_PER_NODE=<num gpus> NODE_RANK=<rank> ./tools/run_dist_launch.sh <num gpus> <path to config file> 

Evaluation

To evalute a plain-detr model, please run the following script:

 <path to config file> --eval --resume <path to plain-detr model>

You could also use ./tools/run_dist_launch.sh to evaluate a model on multiple GPUs.

Limitation & Discussion

  • While we have eliminated multi-scale designs for the backbone output and decoder input, the generation of proposals still depends on multi-scale features.

    We have performed trials utilizing single-scale features for proposals(not included in the paper), but it led to ~1 mAP performance drop.

Known issues

  • Most of our experiments are conducted on 16 GPUs with 1 image per GPU. We have tested our released checkpoints with larger batch size and found that the performance of first three models drops significantly.

    We are now reviewing our implementation and will update our code to support larger batch size for both training and inference.

Citing Plain-DETR

If you find Plain-DETR useful in your research, please consider citing:

inproceedings{lin2023detr,
  title={DETR Does Not Need Multi-Scale or Locality Design},
  author={Lin, Yutong and Yuan, Yuhui and Zhang, Zheng and Li, Chen and Zheng, Nanning and Hu, Han},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6545--6554},
  year={2023}
}

plain-detr's People

Contributors

impiga avatar

Stargazers

 avatar  avatar  avatar Shijie avatar kingfly avatar Piotr Wójcik avatar zhouzhen avatar  avatar  avatar Qing Jiang avatar Yiwen Tang avatar CANOE avatar  avatar  avatar Ryan avatar Mr Rotten peach avatar Jeff Carpenter avatar Michael Baumgartner avatar Ruiqi Wang avatar Howard avatar livion avatar phalanx avatar Shihui Wu avatar  avatar Sinyu Jeong avatar syjeon avatar 0x6f6f avatar Chao Yu avatar  avatar zy avatar shiontao avatar trabish avatar Tiancheng Zhao (Tony)  avatar  avatar takuoko avatar Chuofan Ma avatar  avatar lumos avatar Qian Peisheng avatar  avatar Yiyun avatar Huiwen Huang avatar TsVanco avatar Jerry Cheng avatar  avatar Shiyu Xuan avatar Denis Rybalchenko avatar LI Minghan avatar 飞天小女警 avatar persistence avatar Jokester avatar wangrujia avatar Huixian Cheng avatar Zhe Liu avatar Tang avatar Bencheng avatar Junfeng Wu avatar  avatar Cheaple avatar  avatar QinHsiu avatar  avatar Jiang Li avatar  avatar Miracle avatar Van-Loc Nguyen avatar Yonghye Kwon avatar Jianhua Yang avatar  avatar Hengtao Li avatar  avatar feivel avatar  avatar zhuyun97 avatar Chris Clear avatar  avatar Cheng Shi avatar  avatar CXH_NJUST avatar  avatar  avatar  avatar  avatar Bowen Yuan avatar  avatar Richard Chen avatar Jinhyung Park avatar Li avatar JingwenYe avatar Jack Hu avatar  avatar  avatar  avatar  avatar wulele avatar Zhiyu Ji avatar  avatar Vijaysrinivas Rajagopal avatar LiuZhuang avatar Courage  avatar

Watchers

OKUMURA Yoshio avatar Zhiqiang Wang avatar Yang avatar Pilhyeon Lee (이필현) avatar Aloïs Pourchot avatar Sean Cha avatar Chi Xie avatar Yonghye Kwon avatar  avatar Megamind avatar Huiwen Huang avatar CXH_NJUST avatar  avatar Yiyun avatar

plain-detr's Issues

Some questions about data preprocessing

Amazing performance! Congratulations! I couldn't agree with you more on the importance of exploring the architecture of single-level DETR and simplifying the current DETR research landscape.

Recently, I've been attempting to reproduce the results of Plain DETR incrementally. I have a few questions regarding the data preprocessing employed in PlainDETR. For the 1x training schedule, does the data preprocessing solely consist of RandomHflip and RandomResize(800)? If not, does PlainDETR utilize the same transforms as the original DETR, as shown in the following code snippet:

scales = [480, ..., 800]
transforms = [
                RandomHorizontalFlip(),
                RandomSelect(
                    RandomResize(scales, max_size=1333),
                    Compose([
                        RandomResize([400, 500, 600]),
                        RandomSizeCrop(384, 600),
                        RandomResize(scales, max_size=1333),
                    ])
                ),
                ToTensor(),
                Normalize(...)
            ]

By the way, could you tell me the configuration about the optimizer and lr scheduler in details? I would greatly appreciate it if you could provide some clarification on this matter. Thank you!

Will code be released?

@impiga et al, thank you again for the inspired research on DETR!

It's been a little over 12 weeks since the paper was published to arXiv (08/03), but there have not been any changes to this repo or engagement with users in other issues.

May we expect code to be released?

It doesn't need to be "poster-ready"; any working code is fine! 🙆🏻‍♂️

点赞

代码简洁明了!!!

Question about BoxRPB

Hi,

Thank you for your great work. :)

I am curious about whether the weights of MLP for BoxRPB are shared across all decoder layers? Or you initialize the BoxRPB MLP independently for each layer?

Question about flatten in HungarianMatcher

In models/matcher.py/HungarianMatcher.forward

out_delta = outputs["pred_deltas"].flatten(0, 1)
out_bbox_old = outputs["pred_boxes_old"].flatten(0, 1)

Predictions from different batches are mixed up by flattening them by the first two dims, and are later used to generate deltas together with flattened gt . As a beginner may I ask that is this some kind of trick or the reason causes the drop when larger batch size is set.

Questions about content-related query and hybrid matching.

Thanks for your great work. After reading your paper and code, I have some questions about Content-related Query and Hybrid Matching.

  1. When two_stage enabled, query number = two_stage_num_proposals which is not equal one2one + one2many. Seems like the hybrid matching disabled by two_stage.
  2. If pass the one2one + one2many to two_stage_num_proposals, whether the one2many query part should initialized by encoder proposal and why?

if self.two_stage:
(reference_points, max_shape, enc_outputs_class,
enc_outputs_coord_unact, enc_outputs_delta, output_proposals) \
= self.get_reference_points(memory, mask_flatten, spatial_shapes)
init_reference_out = reference_points
pos_trans_out = torch.zeros((bs, self.two_stage_num_proposals, 2*c), device=init_reference_out.device)
pos_trans_out = self.pos_trans_norm(self.pos_trans(self.get_proposal_pos_embed(reference_points)))
if not self.mixed_selection:
query_embed, tgt = torch.split(pos_trans_out, c, dim=2)
else:
# query_embed here is the content embed for deformable DETR
tgt = query_embed.unsqueeze(0).expand(bs, -1, -1)
query_embed, _ = torch.split(pos_trans_out, c, dim=2)
else:
query_embed, tgt = torch.split(query_embed, c, dim=1)
query_embed = query_embed.unsqueeze(0).expand(bs, -1, -1)
tgt = tgt.unsqueeze(0).expand(bs, -1, -1)
reference_points = self.reference_points(query_embed).sigmoid()
init_reference_out = reference_points

About only MIM trick?

Hello, i just want to try "mim trick " with ape decoder, my config is :
`#!/usr/bin/env bash

set -x

FILE_NAME=$(basename $0)
EXP_DIR=./exps/${FILE_NAME%.*}
PY_ARGS=${@:1}

python -u main.py
--output_dir ${EXP_DIR}
--with_box_refine
--two_stage
--mixed_selection
--look_forward_twice
--num_queries_one2one 300
--num_queries_one2many 1500
--k_one2many 6
--lambda_one2many 1.0
--dropout 0.0
--norm_type pre_norm
--backbone swin_v2_small_window12to16_2global
--drop_path_rate 0.1
--upsample_backbone_output
--upsample_stride 16
--num_feature_levels 1
--decoder_type global_ape
--proposal_feature_levels 4
--proposal_in_stride 16
--pretrained_backbone_path ./pt_models/swinv2_small_1k_500k_mim_pt.pth
--epochs 12
--lr_drop 11
--warmup 1000
--lr 2e-4
--use_layerwise_decay
--lr_decay_rate 0.9
--weight_decay 0.05
--wd_norm_mult 0.0
${PY_ARGS}
`
But in the 5 epoch, there is an error:
2024-03-05 04:10 File "/gemini/data-2/Algorithm/Plain-DETR/util/box_ops.py", line 70, in generalized_box_iou
2024-03-05 04:10 assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
2024-03-05 04:10 AssertionError

My env is 8 card 3090; And my command is :
bash tools/run_dist_launch.sh 8 configs/swinv2_small_mim_pt_ape.sh
Could u give me some advice? Be grateful.

预训练权重

您好,我想复刻一下您的代码,但是下载预训练权重那个网页不可访问了

Question about the ablation of box relative position bias

Thanks for the great work!

I have a question on the BoxRPB:

In the ablation of paper, when you use center as the encoding points instead of 2-corners, the performance seems very similar to DAB/Condition/SMCA cross attention, have you also try using the 2-corners schema in DAB/Condition/SMCA? Just want to figure out if the major improvement of BoxRPB over DAB/Condition/SMCA is due to using 2-corners instead of 1-center.

Thanks!

Question about the BoxRPB

Congratulations on the great work.
How can you get the coordinate of K predicted boxes to generate BoxRPB before cross-attention?

Expected code release

Hi @impiga,
Thanks and congrats on your great work, it was so nice talking with you at the ICCV conference.

Do you have an expected date for code release?

Thanks a lot!

About light weight of swin_tiny.

Hello excuse me, I want to use swin_tiny for lightweight work, I want to know where there is swin_tiny_mae_pretrain, thank you.

Question about backbone

Hello,

Congratulations on the great work. I have some questions on the backbone used.

  1. Was it pretrained and frozen for feature extraction? Or was it fine-tuned in a supervised fashion on the self-supervised pretrained features (looks like the latter)?
  2. If fine-tuned, did you unfreeze all layers or few layers only? Did you do an ablation on how many layers to unfreeze?
  3. Did you try to use the frozen features and see how it yields with respect to localization? It would be helpful if you could throw some light on these?
  4. Why did you choose SimMIM, for e.g., why not MAE? Did you try and find out SimMIM works better?

I am sorry if I ask any redundant question. It would be helpful to have some insights into these aspects.

Thanks a lot, again!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.