expect release code

Plain-DETR

By Yutong Lin*, Yuhui Yuan*, Zheng Zhang*, Chen Li, Nanning Zheng and Han Hu*

This repo is the official implementation of "DETR Doesn’t Need Multi-Scale or Locality Design".

Introduction

We present an improved DETR detector that maintains a “plain” nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that re-introduce architectural inductive biases of multi-scale and locality into the decoder.

We show that two simple technologies are surprisingly effective within a plain design: 1) a box-to-pixel relative position bias (BoxRPB) term to guide each query to attend to the corresponding object region; 2) masked image modeling (MIM)-based backbone pre-training to help learn representation with fine-grained localization ability and to remedy dependencies on the multi-scale feature maps.

Main Results

BoxRPB	MIM PT.	Reparam.	AP	Paper Position	CFG	CKPT
✗	✗	✗	37.2	Tab2 Exp1	cfg	ckpt
✓	✗	✗	46.1	Tab2 Exp2	cfg	ckpt
✓	✓	✗	48.7	Tab2 Exp5	cfg	ckpt
✓	✓	✓	50.9	Tab2 Exp6	cfg	ckpt

Installation

Conda

# create conda environment
conda create -n plain_detr python=3.8 -y
conda activate plain_detr

# install pytorch (other versions may also work)
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt

Docker

We have tested with the docker image superbench/dev:cuda11.8. Other dockers may also work.

# run docker
sudo docker run -it -p 8022:22 -d --name=plain_detr --privileged --net=host --ipc=host --gpus=all -v /:/data superbench/dev:cuda11.8 bash
sudo docker exec -it plain_detr bash

# other requirements
git clone https://github.com/impiga/Plain-DETR.git
cd Plain-DETR
pip install -r requirements.txt

Usage

Dataset preparation

Please download COCO 2017 dataset and organize them as following:

code_root/
└── data/
    └── coco/
        ├── train2017/
        ├── val2017/
        └── annotations/
        	├── instances_train2017.json
        	└── instances_val2017.json

Pretrained models preparation

Please run the following script to download supervised and mask-image-modeling pretrained models.

(We adopts Swin Transformer v2 as the default backbone. If you are interested in the pretraining, please refer to Swin Transformer v2 (paper, github) and SimMIM (paper, github) for more details.)

bash tools/prepare_pt_model.sh

Training

Training on single node

GPUS_PER_NODE=<num gpus> ./tools/run_dist_launch.sh <num gpus> <path to config file>

Training on multiple nodes

On each node, run the following script:

MASTER_ADDR=<master node IP address> GPUS_PER_NODE=<num gpus> NODE_RANK=<rank> ./tools/run_dist_launch.sh <num gpus> <path to config file>

Evaluation

To evalute a plain-detr model, please run the following script:

 <path to config file> --eval --resume <path to plain-detr model>

You could also use ./tools/run_dist_launch.sh to evaluate a model on multiple GPUs.

Limitation & Discussion

While we have eliminated multi-scale designs for the backbone output and decoder input, the generation of proposals still depends on multi-scale features.

We have performed trials utilizing single-scale features for proposals(not included in the paper), but it led to ~1 mAP performance drop.

Known issues

Most of our experiments are conducted on 16 GPUs with 1 image per GPU. We have tested our released checkpoints with larger batch size and found that the performance of first three models drops significantly.

We are now reviewing our implementation and will update our code to support larger batch size for both training and inference.

Citing Plain-DETR

If you find Plain-DETR useful in your research, please consider citing:

inproceedings{lin2023detr,
  title={DETR Does Not Need Multi-Scale or Locality Design},
  author={Lin, Yutong and Yuan, Yuhui and Zhang, Zheng and Li, Chen and Zheng, Nanning and Hu, Han},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={6545--6554},
  year={2023}
}

	if self.two_stage:
	(reference_points, max_shape, enc_outputs_class,
	enc_outputs_coord_unact, enc_outputs_delta, output_proposals) \
	= self.get_reference_points(memory, mask_flatten, spatial_shapes)
	init_reference_out = reference_points
	pos_trans_out = torch.zeros((bs, self.two_stage_num_proposals, 2*c), device=init_reference_out.device)
	pos_trans_out = self.pos_trans_norm(self.pos_trans(self.get_proposal_pos_embed(reference_points)))

	if not self.mixed_selection:
	query_embed, tgt = torch.split(pos_trans_out, c, dim=2)
	else:
	# query_embed here is the content embed for deformable DETR
	tgt = query_embed.unsqueeze(0).expand(bs, -1, -1)
	query_embed, _ = torch.split(pos_trans_out, c, dim=2)
	else:
	query_embed, tgt = torch.split(query_embed, c, dim=1)
	query_embed = query_embed.unsqueeze(0).expand(bs, -1, -1)
	tgt = tgt.unsqueeze(0).expand(bs, -1, -1)
	reference_points = self.reference_points(query_embed).sigmoid()
	init_reference_out = reference_points

impiga / plain-detr Goto Github PK

plain-detr's Introduction

Plain-DETR

Introduction

Main Results

Installation

Conda

Docker

Usage

Dataset preparation

Pretrained models preparation

Training

Training on single node

Training on multiple nodes

Evaluation

Limitation & Discussion

Known issues

Citing Plain-DETR

plain-detr's People

Contributors

Stargazers

Watchers

Forkers

plain-detr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs