GithubHelp home page GithubHelp logo

leaplabthu / pseudo-q Goto Github PK

View Code? Open in Web Editor NEW
142.0 3.0 10.0 23.49 MB

[CVPR 2022] Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Home Page: https://arxiv.org/abs/2203.08481

License: Apache License 2.0

Python 93.73% Shell 6.27%
computer-vision visual-grounding cvpr2022 deep-learning pytorch multimodal-deep-learning vision-and-language

pseudo-q's Introduction

Pseudo-Q

This repository is the official Pytorch implementation for CVPR2022 paper Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding. (Primary Contact: Haojun Jiang)

Links: arXiv | Poster | Video

Please leave a STAR ⭐ if you like this project!

News

  • Update on 2022/03/15: Release the training code.
  • Update on 2022/06/02: Provide the poster and presentation video.
  • Update on 2022/06/04: Release the pseudo-query generation code.
  • Update on 2022/08/25: Provide the detection results for all datasets.

Reference

If you find our project useful in your research, please consider citing:

@inproceedings{jiang2022pseudoq,
  title={Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding},
  author={Jiang, Haojun and Lin, Yuanze and Han, Dongchen and Song, Shiji and Huang, Gao},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2022}
}

Contents

  1. Introduction
  2. Usage
  3. Results
  4. Contacts
  5. Acknowledgments

Introduction

We present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module. Extensive experimental results demonstrate that our method has two notable benefits: (1) it can reduce human annotation costs significantly, e.g., 31% on RefCOCO without degrading original model's performance under the fully supervised setting, and (2) without bells and whistles, it achieves superior or comparable performance compared to state-of-the-art weakly-supervised visual grounding methods on all the five datasets we have experimented. For more details. please refer to our paper.

Usage

Dependencies

Data Preparation

1.You can download the images from the original source and place them in ./data/image_data folder:

Finally, the ./data/image_data folder will have the following structure:

|-- image_data
   |-- data
      |-- flickr
      |-- gref
      |-- gref_umd
      |-- referit
      |-- unc
      |-- unc+
   |-- Flickr30k
      |-- flickr30k-images
   |-- other
      |-- images
      |-- refcoco
      |-- refcoco+
      |-- refcocog
   |-- referit
      |-- images
      |-- mask
      |-- splits
  • ./data/image_data/data/xxx/: Take the Flickr30K dataset as an example, ./data/image_data/data/flickr/ shoud contain files about the dataset's validation/test annotations(bbox-query pairs download from Gdrive) and our generated pseudo-annotations(pseudo-samples) for this dataset. You should uncompress the provided pseudo-sample files and put them on the corresponding folder.
  • ./data/image_data/Flickr30k/flickr30k-images/: Image data for the Flickr30K dataset, please download from this link. Fill the form and download the images.
  • ./data/image_data/other/images/: Image data for RefCOCO/RefCOCO+/RefCOCOg.
  • ./data/image_data/referit/images/: Image data for ReferItGame.
  • Besides, I notice the links of refcoco/refcoco+/refcocog/referit data are not available recently. You can leave an email in Issues#2 and I will send you a download link.
  • ./data/image_data/other/refcoco/, ./data/image_data/other/refcoco+/, ./data/image_data/other/refcocog/, ./data/image_data/referit/mask/, ./data/image_data/referit/splits/: I follow the TransVG to prepare the data and I find these folders actually are not used in training.

2.The generated pseudo region-query pairs can be download from Tsinghua Cloud or you can generate it follow instructions.

mkdir data
mv pseudo_samples.tar.gz ./data/
tar -zxvf pseudo_samples.tar.gz

Note that to train the model with pseudo samples for different dataset you should put the uncompressed pseudo sample files under the right folder ./data/image_data/data/xxx/. For example, put the flickr_train_pseudo.pth under ./data/image_data/data/flickr/.

For generating pseudo-samples, we adopt the pretrained detector and attribute classifier from the Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. The pytorch implementation of this paper is available at bottom-up-attention.

Pretrained Checkpoints

1.You can download the DETR checkpoints from Tsinghua Cloud. These checkpoints should be downloaded and move to the checkpoints directory.

mkdir checkpoints
mv detr_checkpoints.tar.gz ./checkpoints/
tar -zxvf checkpoints.tar.gz

2.Checkpoints that trained on our pseudo-samples can be downloaded from Tsinghua Cloud. You can evaluate the checkpoints following the instruction right below.

mv pseudoq_checkpoints.tar.gz ./checkpoints/
tar -zxvf pseudoq_checkpoints.tar.gz

Training and Evaluation

  1. Training on RefCOCO.

    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port 28888 --use_env train.py --num_workers 8 --epochs 10 --batch_size 32 --lr 0.00025 --lr_bert 0.000025 --lr_visu_cnn 0.000025 --lr_visu_tra 0.000025 --lr_scheduler cosine --aug_crop --aug_scale --aug_translate --backbone resnet50 --detr_model checkpoints/detr-r50-unc.pth --bert_enc_num 12 --detr_enc_num 6 --dataset unc --max_query_len 20 --data_root ./data/image_data --split_root ./data/pseudo_samples/ --prompt "find the region that corresponds to the description {pseudo_query}" --output_dir ./outputs/unc/;
    

    Notably, if you use a smaller batch size, you should also use a smaller learning rate. Original learning rate is set for batch size 256(8GPU x 32). Please refer to scripts/train.sh for training commands on other datasets.

  2. Evaluation on RefCOCO.

    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node=8 --master_port 28888 --use_env eval.py --num_workers 4 --batch_size 128 --backbone resnet50 --bert_enc_num 12 --detr_enc_num 6 --dataset unc --max_query_len 20 --data_root ./data/image_data --split_root ./data/pseudo_samples/ --eval_model ./checkpoints/unc_best_checkpoint.pth --eval_set testA --prompt "find the region that corresponds to the description {pseudo_query}" --output_dir ./outputs/unc/testA/;
    

    Please refer to scripts/eval.sh for evaluation commands on other splits or datasets.

Results

1. Visualization of Pseudo-samples.

2. Experiments of Reducing the Manual Labeling Cost on RefCOCO.

3. Results on RefCOCO/RefCOCO+/RefCOCOg.

4. Results on ReferItGame/Flickr30K Entities.

Please refer to our paper for more details..

Contacts

jhj20 at mails dot tsinghua dot edu dot cn

Any discussions or concerns are welcomed!

Acknowledgments

This codebase is built on TransVG, bottom-up-attention and Faster-R-CNN-with-model-pretrained-on-Visual-Genome. Please consider citing or starring these projects.

pseudo-q's People

Contributors

jianghaojun avatar leaplabthu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

pseudo-q's Issues

can't find build_model

Hey, I can't the fing the code for build_model in the models folder. which .py file is it in?

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

Sorry for bothering.When i run the train.py , something went wrong. Here is the output imformation:

E:\Users\JayLee\anaconda3\envs\myenv\python.exe E:/Pseudo-Q-main/train.py
Not using distributed mode
git:
sha: N/A, status: clean, branch: N/A

INFO ### torch.backends.cudnn.benchmark = False

number of params: 155559940
Missing keys when loading detr model:
[]
Start training
E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ..\c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at ..\aten\src\ATen\native\BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
Traceback (most recent call last):
File "E:\Pseudo-Q-main\train.py", line 310, in
main(args)
File "E:\Pseudo-Q-main\train.py", line 265, in main
train_stats = train_one_epoch(
File "E:\Pseudo-Q-main\engine.py", line 38, in train_one_epoch
output = model(img_data, text_data)
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Pseudo-Q-main\models\trans_vg_mlcma.py", line 36, in forward
visu_mask, visu_src = self.visumodel(img_data)
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Pseudo-Q-main\models\visual_model\detr.py", line 72, in forward
out = self.transformer(self.input_proj(src), mask, pos[-1], query_embed=None)
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Pseudo-Q-main\models\visual_model\transformer.py", line 56, in forward
memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Pseudo-Q-main\models\visual_model\transformer.py", line 118, in forward
output = layer(output, src_mask=mask,
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Pseudo-Q-main\models\visual_model\transformer.py", line 225, in forward
return self.forward_post(src, src_mask, src_key_padding_mask, pos)
File "E:\Pseudo-Q-main\models\visual_model\transformer.py", line 196, in forward_post
src2 = self.self_attn(q, k, value=src, attn_mask=src_mask,
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\modules\activation.py", line 1031, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\functional.py", line 4969, in multi_head_attention_forward
q, k, v = _in_projection_packed(query, key, value, in_proj_weight, in_proj_bias)
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\functional.py", line 4734, in _in_projection_packed
return linear(q, w_q, b_q), linear(k, w_k, b_k), linear(v, w_v, b_v)
File "E:\Users\JayLee\anaconda3\envs\myenv\lib\site-packages\torch\nn\functional.py", line 1847, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle)

进程已结束,退出代码1

I don't know how to fix it. Could you please help and give me some ideas? Thank you!

关于生成伪标注的问题 Regarding the Issue of Generating Pseudo-label

请问您在生成伪标注时,针对不同的数据集是否使用了不同的方法或超参?我注意到refcoco和refcoco+生成的伪标注数量相差很大,但refcoco和refcoco+包含的图片数量似乎相差并不大。

Could you please tell me if you used different methods or hyperparameters when generating pseudo-label for different datasets? I have noticed that the number of pseudo-label generated for refcoco and refcoco+ differs significantly, but the number of images contained in refcoco and refcoco+ seems to be quite similar.

Evaluation on RefCOCO

Hi. Thanks to sharing your nice work!

When I run eval.sh on RefCOCO testA, I got the error "No such file unc_testA.pth'.
I wonder Why unc_testA.pth file is needed during evaluation?

I also run generate_pseudo_data_unc.sh before evaluation and I got unc_train_pseudo_split.pth files, not unc_pseudo_val.pth or unc_pseudo_testA.pth.

Thanks.

Unable to download the faster RCNN results

Hi, thank you for your work!

I cannot download the 'detection_results.tar.gz' file following the instructions here. Is that a server issue? Can you please provide other download sources?

Best,
Yunzhong

Problem about the training.

Recently, several researchers asked me questions about training. The phenomenon is that the training loss did not decrease or the validation acc was very low.

The reason might be that they adopted a smaller batch size, e.g., 96, but did not change the learning rate.

First of all, I strongly recommend using the same batch size to reproduce our work. Secondly, if you use a smaller batch size, please try to use a smaller learning rate.

If you have any new problems with training, please post your questions inside this issue or open a new one. It would be better to provide as much information as you can, which can help me understand your question quicker.

Inference API

Hi, this is a nice work!
Could you please provide an inference api so that, for example, the user only needs to provide the path to the image and the corresponding description?

Clarification about the loss

TL; DR

Can you explain what is loaded in the dataset along with image data? I would like to understand especially the content of bbox.

Dear authors,

I'm trying to figure out how the training of your model works.

In particular, from this line

loss_dict = loss_utils.trans_vg_loss(output, target)
I noticed that the target is used to compute the loss. The function trans_vg_loss confirms it:
def trans_vg_loss(batch_pred, batch_target):
"""Compute the losses related to the bounding boxes,
including the L1 regression loss and the GIoU loss
"""
batch_size = batch_pred.shape[0]
# world_size = get_world_size()
num_boxes = batch_size
loss_bbox = F.l1_loss(batch_pred, batch_target, reduction='none')
loss_giou = 1 - torch.diag(generalized_box_iou(
xywh2xyxy(batch_pred),
xywh2xyxy(batch_target)
))
losses = {}
losses['loss_bbox'] = loss_bbox.sum() / num_boxes
losses['loss_giou'] = loss_giou.sum() / num_boxes
return losses

I tried to understand what target is, and from this line

img_data, text_data, target = batch
I checked the collate_fn used in dataloader:

Pseudo-Q/utils/misc.py

Lines 294 to 308 in ce1688f

def collate_fn(raw_batch):
raw_batch = list(zip(*raw_batch))
img = torch.stack(raw_batch[0])
img_mask = torch.tensor(raw_batch[1])
img_data = NestedTensor(img, img_mask)
word_id = torch.tensor(raw_batch[2])
word_mask = torch.tensor(raw_batch[3])
text_data = NestedTensor(word_id, word_mask)
bbox = torch.tensor(raw_batch[4])
if len(raw_batch) == 7:
batch = [img_data, text_data, bbox, raw_batch[5], raw_batch[6]]
else:
batch = [img_data, text_data, bbox]
return tuple(batch)

Is this using the ground truth bounding box from the dataset?

I checked the __getitem__ function from the dataset and I ended up with this three lines

imgset_file = '{0}_{1}.pth'.format(self.dataset, split)
imgset_path = osp.join(dataset_path, imgset_file)
self.images += torch.load(imgset_path)

Here a .pth file is loaded, and along with image data something else is loaded. Can you explain what the loaded bbox exactly contains?

Thank you,
Luca

作者您好,请问能不能对数据集目录部分描述再详细一点,谢谢。Could author provide more detailed descriptions about the dataset folder?

|-- image_data
   |-- data
      |-- flickr
      |-- gref
      |-- gref_umd
      |-- referit
      |-- unc
      |-- unc+
   |-- Flickr30k
      |-- flickr30k-images
   |-- other
      |-- images
      |-- refcoco
      |-- refcoco+
      |-- refcocog
   |-- referit
      |-- images
      |-- mask
      |-- splits

上述 Readme 部分提供的数据集目录结构,只是简单罗列,而且很多都有出入未说明。''other'' 目录应该是指 原始 ''refer'' 仓库的文件吧? 另外,''data'' 目录应该是指下载的 ''pseudo_samples'' 吧? 然后 ''referit'' 文件目录有点莫名其妙不太理解,原始的 ''refer'' 里面是 ''ReCLEF'',但是目录下面的 ‘’images, mask, splits‘’ 未查到,请问这个referit 目录又该怎么构建??
谢谢!

如何改变迭代次数?How to set the epoch?

请问epoch设置在哪儿啊?原本的迭代13643太多了,我的电脑运行需要好几天。我想将epoch减少一些。谢谢!

How to set the epoch? The original iteration 13643 was so much that it took days on my computer to run. I want to reduce the number of epoch. Thanks!

statistics of datasets

Hi, thank you for your excellent work! I found that /data/statistic/ did not have the .txt split files of other datasets. Is there any way to access these files?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.