GithubHelp home page GithubHelp logo

dddzg / up-detr Goto Github PK

View Code? Open in Web Editor NEW
473.0 13.0 69.0 3.54 MB

[TPAMI 2022 & CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

License: Apache License 2.0

Python 100.00%
coco transformers detection detr cvpr2021 cvpr self-supervised tpami

up-detr's Introduction

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

This is the official PyTorch implementation and models for UP-DETR paper and the extended version:

@ARTICLE{9926201,
  author={Dai, Zhigang and Cai, Bolun and Lin, Yugeng and Chen, Junying},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence}, 
  title={Unsupervised Pre-Training for Detection Transformers}, 
  year={2022},
  volume={},
  number={},
  pages={1-11},
  doi={10.1109/TPAMI.2022.3216514}}

@InProceedings{Dai_2021_CVPR,
    author    = {Dai, Zhigang and Cai, Bolun and Lin, Yugeng and Chen, Junying},
    title     = {UP-DETR: Unsupervised Pre-Training for Object Detection With Transformers},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {1601-1610}
}

In UP-DETR, we introduce a novel pretext named random query patch detection to pre-train transformers for object detection. UP-DETR inherits from DETR with the same ResNet-50 backbone, same Transformer encoder, decoder and same codebase. With unsupervised pre-training CNN, the whole UP-DETR pre-training doesn't require any human annotations. UP-DETR achieves 43.1 AP(even higher) on COCO with 300 epochs fine-tuning. The AP of open-source version is a little higher than paper report.

UP-DETR

Model Zoo

We provide pre-training UP-DETR and fine-tuning UP-DETR models on COCO, and plan to include more in future. The evaluation metric is same to DETR.

Here is the UP-DETR model pre-trained on ImageNet without labels. The CNN weight is initialized from SwAV, which is fixed during the transformer pre-training:

name backbone epochs url size md5
UP-DETR R50 (SwAV) 60 model | logs 164Mb 49f01f8b

The result of UP-DETR fine-tuned on COCO:

name backbone (pre-train) epochs box AP APS APM APL url
DETR R50 (Supervised) 500 42.0 20.5 45.8 61.1 -
DETR R50 (SwAV) 300 42.1 19.7 46.3 60.9 -
UP-DETR R50 (SwAV) 300 43.1 21.6 46.8 62.4 model | logs

COCO val5k evaluation results of UP-DETR can be found in this gist.

Usage - Object Detection

There are no extra compiled components in UP-DETR and package dependencies are same to DETR. We provide instructions how to install dependencies via conda:

git clone tbd
conda install -c pytorch pytorch torchvision
conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

UP-DETR follows two steps: pre-training and fine-tuning. We present the model pre-trained on ImageNet and then fine-tuned on COCO.

Unsupervised Pre-training

Data Preparation

Download and extract ILSVRC2012 train dataset.

We expect the directory structure to be the following:

path/to/imagenet/
  n06785654/  # caterogey directory
    n06785654_16140.JPEG # images
  n04584207/  # caterogey directory
    n04584207_14322.JPEG # images

Images can be organized disorderly because our pre-training is unsupervised.

Pre-training

To pr-train UP-DETR on a single node with 8 gpus for 60 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \
    --lr_drop 40 \
    --epochs 60 \
    --pre_norm \
    --num_patches 10 \
    --batch_size 32 \
    --feature_recon \
    --fre_cnn \
    --imagenet_path path/to/imagenet \
    --output_dir path/to/save_model

As the size of pre-training images is relative small, so we can set a large batch size.

It takes about 2 hours for a epoch, so 60 epochs pre-training takes about 5 days with 8 V100 gpus.

In our further ablation experiment, we found that object query shuffle is not helpful. So, we remove it in the open-source version.

Fine-tuning

Data Preparation

Download and extract COCO 2017 dataset train and val dataset.

The directory structure is expected as follows:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Fine-tuning

To fine-tune UP-DETR with 8 gpus for 300 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env detr_main.py \
    --lr_drop 200 \
    --epochs 300 \
    --lr_backbone 5e-5 \
    --pre_norm \
    --coco_path path/to/coco \
    --pretrain path/to/save_model/checkpoint.pth

The fine-tuning cost is exactly same to DETR, which takes 28 minutes with 8 V100 gpus. So, 300 epochs training takes about 6 days.

The model can also extended to panoptic segmentation, checking more details on DETR.

Evaluation

python detr_main.py \
    --batch_size 2 \
    --eval \
    --no_aux_loss \
    --pre_norm \
    --coco_path path/to/coco \
    --resume path/to/save_model/checkpoint.pth

COCO val5k evaluation results of UP-DETR can be found in this gist.

Notebook

We provide a notebook in colab to get the visualization result in the paper:

  • Visualization Notebook: This notebook shows how to perform query patch detection with the pre-training model (without any annotations fine-tuning).

vis

License

UP-DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

up-detr's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

up-detr's Issues

_IncompatibleKeys

Hi, Thanks your works. I use my dataset pretrial on the you provided imagenet model. Then I use my pretreaind model to finetune on my dataset. The code return
_IncompatibleKeys(missing_keys=['class_embed.weight', 'class_embed.bias'], unexpected_keys=['patch2query.weight', 'patch2query.bias', 'feature_align.layers.0.weight', 'feature_align.layers.0.bias', 'feature_align.layers.1.weight', 'feature_align.layers.1.bias'])
Could you please offer some suggestion ? Is this normal or abnormal ? Thanks !

Random Crop

Can we randomly crop from other image and paste it on the training picture, and also use the randomly cropped as a pseudo-label, that is, find the cropped block in the original image?

How extract precision, recall and f1-score metrics

Hello thank you for sharing the code.

I would like to know how to extract precision, recall and f1-score metrics. I already have the AP and AR metrics.

I am trying to use the following code but it gives me a numpy matrix:

precision = coco_eval.eval['precision']
recall = coco_eval.eval['recall']

Can you help me?

How to get the inference time (speed) ?

❓ How to get the inference time or speed using UP-DETR?

I want to make some comparisons between some models, how can I export (print) the inference time when testing or evaling UP-DETR?

Questions about the multi-query patch model

Dear author, why do you design a multi-query patch model, can't you do a binary match between all random query patches and predictions like DETR? Is it to maintain independence between patches?

Questions about the matching loss

Hi there, I have a few important questions about multiple-query-patch pretraining, which I could not find answer to in the paper.

  1. Is there a reason why during multiple-query-patch pretraining, all predictions are matched with all query patches instead of matching each prediction only within its own group?
  2. Is the binary ground-truth (c_i) just always 1, or does it take different values, depending on which group was matched to it? In other words, will predictions from a given group be penalized for predicting the patch from a different group, if they turn out to be matched with that other group?
  3. Also what is the meaning/interpretation of $\widehat{c}_i=1$ during multiple-query-patch pretraining?

Thanks for the amazing work by the way.

How to support batch learning for one-shot object detection training?

So in the paper you suggest training UP-DETR for the task of one-shot object detection and provided interesting results on VOC.

As you don't seem to provide any code in this github related to the one-shot object detection training (please correct me if I'm mistaken), I tried to implement it myself. That being said, I confronted an obstacle when it came to supporting batch learning. This is because, if we have a minibatch of N target images, each of them will have a corresponding query patch, so a total of N query patches in this minibatch. How would you apply GAP and add the features of these N query patches to the object queries in the decoder? It doesn't seem to me that adding the features of the ith query patch to the object queries while forwarding a batch containing the jth target image through the decoder (where the jth target image isn't related to the ith query object) is the correct thing to do.

So, my question is, were you able to support batch learning for one-shot object detection? If so, how?

detectiong objects

In DETR, only the image is needed for object detection. But in UP-DETR, it needs an extra list of patchs to detect objects. Why? the list of patchs could be fixed for all images if only detectiong objects? Thank you for the interesting work!

Cannot reproduce the author's results with the pre-trained models

Hi there,

I'm currently experimenting with some Few/One/Zero-Shot for object detection and classification. For one of the tasks, your paper has been experimented with.

Unfortunately, I haven't been able to reproduce your results with the pre-trained models you have made available. I also noticed that the inference code you made available does not work out of the box. To support my points, here some details:

  1. At the moment is not possible to use the latest PyTorch with the latest TorchVision. The latter should be pinned to version 0.9.0.
  2. For the ImageNet pre-trained model
  • In your code samples, you use 6 patches only, but the model has been trained with the default 100 queries and 10 patches. The README file needs adjustments

Results:

  • ImageNet pre-trained model (I duplicated some patches to make sure I had 10, same kittens image used)

Patches
image

Detections
image

  • COCO pre-trained model (custom image used)

Patches
image

Detections
image

Hardware used

  • MacBook Air M1
  • NVIDIA GeForce RTX2080i

Yeah, I tried with both CPU and a CUDA compliant device.

Are you sure you have uploaded the rights checkpoint files?

Thanks in advance and looking to hear from you.

训练模型不收敛

图片
在自己的训练集上(已转换成coco格式,单类别检测)训练了170个epoch,loss基本不降,验证集AP也是0
训练命令:python -m torch.distributed.launch --nproc_per_node=1 --use_env detr_main.py --lr_drop 200 --epochs 300 --lr_backbone 5e-4 --pre_norm --coco_path /home/work/mnt/project/up-detr/data/coco --pretrain /home/work/mnt/project/up-detr/checkpoints/up-detr-pre-training-60ep-imagenet.pth

Evaluate the fine-tuned model

Hi again, @dddzg

I'm trying to evaluate the model I fine-tuned on a separate dataset, that doesn't contain annotations (as it suppose to be). According to the docs in the readme, you did the evaluation step using the COCO val5k - which contains only images.

When I try the same, I get an error due to the non-existence of the instances_val2017.json. For evaluation purposes, those should not be required.

The path below has been edited for simplicity.

FileNotFoundError: [Errno 2] No such file or directory: '.../annotations/instances_val2017.json'

Thanks in advance.

Class Loss

image
I didn't find this parameter in your code. Can you tell me which one?

As I know, the CNN backbone does not participate in training, but is only used to extract image features. Can CNN and transformer be separated. For example, first use Resnet to extract image features, and then randomly crop patch at the feature level. I mean starting from features.
I plan to use this idea to do video tasks, but I can't directly manipulate the video itself, I can only start from the video features. I don't know if this is possible.

Evaluation problem

          I have the opposite situation.

RuntimeError: Error(s) in loading state_dict for UPDETR:
size mismatch for class_embed.weight: copying a param with shape torch.Size([3, 256]) from checkpoint, the shape in current model is torch.Size([92, 256]).
size mismatch for class_embed.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([92]).

How to fix it?

Originally posted by @liuchengying758650786 in #13 (comment)

SWAV

Is updetr very dependent on pre-trained CNN? Why choose SWAV? Can other Unsupervised Learning replace SWAV?

Unexpected key(s) in state_dict: "feature_align.layers.0.weight", "feature_align.layers.0.bias", "feature_align.layers.1.weight", "feature_align.layers.1.bias", "patch2query.weight", "patch2query.bias".

excuse me, I Fine-tuning in own dataset and evaluation

This is mine warning after evaluation in pycharm(win 10),pytorch==1.12.1,torchvision==0.13.1,cuda==11.7,3070ti

Unexpected key(s) in state_dict: "feature_align.layers.0.weight", "feature_align.layers.0.bias", "feature_align.layers.1.weight", "feature_align.layers.1.bias", "patch2query.weight", "patch2query.bias".

I dont know how to solve this problem.I tested the following methods
1.pop this weight and bias before the Fine-tuning,but the evaluation result is 0,yes ,all IOU is 0 .
2.pop pop this weight and bias after the Fine-tuning,all IOU is 0 .

Please give me some advice

Error in notebook while loading up-detr-coco-fine-tuned-300ep.pth

RuntimeError: Error(s) in loading state_dict for UPDETR:
size mismatch for class_embed.weight: copying a param with shape torch.Size([92, 256]) from checkpoint, the shape in current model is torch.Size([3, 256]).
size mismatch for class_embed.bias: copying a param with shape torch.Size([92]) from checkpoint, the shape in current model is torch.Size([3]).

Code for VOC

Thanks for your interesting work,

Would you mind releasing the training code for PASCAL VOC?

Regards,

Unexpected keys in dict when running evaluation

How can I evaluate the provided model? I'm trying to use the DETR's recipe and getting a checkpoint loading error.

Thank you!

DATASETROOT=$PWD/path/to/coco/

UPDETRCKPTURL='https://drive.google.com/file/d/1_YNtzKKaQbgFfd6m2ZUCO6LWpKqd7o7X'
UPDETRCKPT='up-detr-coco-fine-tuned-300ep.pt'

git clone https://github.com/dddzg/up-detr
cd up-detr

GOOGLE_DRIVE_FILE_ID=$(echo $UPDETRCKPTURL | rev | cut -d'/' -f1 | rev)
CONFIRM=$(wget --quiet --save-cookies googlecookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$GOOGLE_DRIVE_FILE_ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')
wget -O $UPDETRCKPT --load-cookies googlecookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$GOOGLE_DRIVE_FILE_ID"

python main_detr.py --batch_size 2 --no_aux_loss --eval --resume $UPDETRCKPT --coco_path $DATASETROOT
Not using distributed mode
git:
  sha: 00be9b996f52324335e0cc3fe7a59bfba9f43540, status: clean, branch: master

Namespace(aux_loss=False, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='/specific/netapp5_2/gamir/lab/vadim/foo/../selfsupslots/data/common/coco/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=True, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, pretrain='', remove_difficult=False, resume='up-detr-coco-fine-tuned-300ep.pt', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
number of params: 41302368
loading annotations into memory...

Done (t=32.57s)
creating index...
index created!
loading annotations into memory...
Done (t=4.40s)
creating index...
index created!
Traceback (most recent call last):
  File "detr_main.py", line 267, in <module>
    main(args)
  File "detr_main.py", line 197, in main
    model_without_ddp.load_state_dict(checkpoint['model'])
  File ".../vadim/prefix/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DETR:
        Unexpected key(s) in state_dict: "transformer.encoder.norm.weight", "transformer.encoder.norm.bias".

Getting access to the one-shot object detection training code

Hello there!

As the code for the one-shot object detection task is not available in this repository, would there be any way to access it? If not would it be possible for you to share with me this code?

I tried to re-implement the ideas presented in your paper on top of DETR, but was unsuccessful in replicating the results shown in the paper. In fact, I was not able to build a model that "learns", as the loss remains high throughout the training without ever showing a consistent downwards trend.

What I've done in detail is the following: I took DETR's architecture, added to it the queries as input, passed the queries through the same backbone CNN as the target image, forwarded the resulting embedding to an average pooling layer to reduce the H*W dimensions to 1 (nn.AdaptiveAvgPool2d((1, 1))), forwarded the resulting vector to a projection linear layer (nn.Linear(backbone.num_channels, hidden_dim)) to project the features from an N-dimensional space to an M-dimensional space (where N is the channels dimension of the CNN backbone and M is the dimension within the encoder-decoder transformers), and finally, repeated the resulting vector X times (X being the number of object queries in the architecture) and added that to the object queries vectors (according to our discussion in #24 ).

My goal was to replicate the results (shown below) of "DETR" (without pretraining) in your paper for one-shot object detection on PASCAL VOC.

2022-07-18 09_38_03-2011 09094 pdf

Unfortunately, I was not able to replicate these results, and in fact have not had a converging model that learned the task at all (loss is always high and oscillating). I Tried various backbone learning rates, such as 1e-4, 5e-5, 1e-5, and 0 and all resulted in approximately the same results. Lastly, I tried to also add to my code your proposed feature reconstruction loss (both with backbone lr = 0 and > 0), but that also didn't help.

Thank you for your time, and I'm looking forward to hearing back from you!

Some questions about your code

Hi, I'm very interested in your work about the newly object query in decoder of Transformer through the cropped patches form original images, but when I debug the code, I find it's report the error, like this:
6aad2771224a98ff46a22c2d74df0db

In the code, I didn't find anything about the generation of patches and the call of forward propagation process,due to the forward function of UP-DETR need the patches inputs.
Besides, I use COCO2017-train dataset for pre-training dataset, I find the process of finetune is absolutely same as DETR,so I want to study the pre-training process,in other words,I want to look the UP-DETR how works,especially in the decoder part.

I sincerely hope you can give some solutions, Thanks !

Deformable DETR support

Hello!
I am really interested in your work as I think it is something necessary to successfully exploit DETR in real world applications.
At ICLR 2021 an improvement of DETR called "Deformable-DETR" has been proposed with a number of modifications in the transformer part of the network which improve performance and reduce computational complexity.
Are you planning to support Deformable DETR and provide a pretrained model even for it? I certainly think that this could improve the success of your pretraining approach as more people could exploit it.

Code for Deformable DETR is available: https://github.com/fundamentalvision/Deformable-DETR

Thanks in advance

num_classes

Hi, why did you set the number of categories to 2 in the code. Can I set it to 1 or any integer in pre-train stage? Any advice is greatly appreciated.

if args.dataset_file=="ImageNet":
num_classes = 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.