aimagelab / show-control-and-tell Goto Github PK

View Code? Open in Web Editor NEW

282.0 10.0 61.0 1.75 MB

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. CVPR 2019

Home Page: https://arxiv.org/abs/1811.10652

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

image-captioning captioning-images caption-generation visual-semantic pytorch cvpr2019

show-control-and-tell's Introduction

Show, Control and Tell

This repository contains the reference code for the paper Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions (CVPR 2019).

Please cite with the following BibTeX:

@inproceedings{cornia2019show,
  title={{Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions}},
  author={Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2019}
}

Environment setup

Clone the repository and create the sct conda environment using the conda.yml file:

conda env create -f conda.yml
conda activate sct

Our code is based on SpeakSee: a Python package that provides utilities for working with Visual-Semantic data, developed by us. The conda enviroment we provide already includes a beta version of this package.

Data preparation

COCO Entities

Download the annotations and metadata file dataset_coco.tgz (~85.6 MB) and extract it in the code folder using tar -xzvf dataset_coco.tgz.

Download the pre-computed features file coco_detections.hdf5 (~53.5 GB) and place it under the datasets/coco folder, which gets created after decompressing the annotation file.

Flickr30k Entities

As before, download the annotations and metadata file dataset_flickr.tgz (~32.8 MB) and extract it in the code folder using tar -xzvf dataset_flickr.tgz.

Download the pre-computed features file flickr30k_detections.hdf5 (~13.1 GB) and place it under the datasets/flickr folder, which gets created after decompressing the annotation file.

Evaluation

To reproduce the results in the paper, download the pretrained model file saved_models.tgz (~4 GB) and extract it in the code folder with tar -xzvf saved_models.tgz.

Sequence controllability

Run python test_region_sequence.py using the following arguments:

Argument	Possible values
`--dataset`	`coco`, `flickr`
`--exp_name`	`ours`, `ours_without_visual_sentinel`, `ours_with_single_sentinel`
`--sample_rl`	If used, tests the model with CIDEr optimization
`--sample_rl_nw`	If used, tests the model with CIDEr + NW optimization
`--batch_size`	Batch size (default: 16)
`--nb_workers`	Number of workers (default: 0)

For example, to reproduce the results of our full model trained on COCO-Entities with CIDEr+NW optimization (Table 2, bottom right), use:

python test_region_sequence.py --dataset coco --exp_name ours --sample_rl_nw

Set controllability

Run python test_region_set.py using the following arguments:

Argument	Possible values
`--dataset`	`coco`, `flickr`
`--exp_name`	`ours`, `ours_without_visual_sentinel`, `ours_with_single_sentinel`
`--sample_rl`	If used, tests the model with CIDEr optimization
`--sample_rl_nw`	If used, tests the model with CIDEr + NW optimization
`--batch_size`	Batch size (default: 16)
`--nb_workers`	Number of workers (default: 0)

For example, to reproduce the results of our full model trained on COCO-Entities with CIDEr+NW optimization (Table 4, bottom row), use:

python test_region_set.py --dataset coco --exp_name ours --sample_rl_nw

Expected output

Under logs/, you may also find the expected output of all experiments.

Training procedure

Run python train.py using the following arguments:

Argument	Possible values
`--exp_name`	Experiment name
`--batch_size`	Batch size (default: 100)
`--lr`	Initial learning rate (default: 5e-4)
`--nb_workers`	Number of workers (default: 0)
`--sample_rl`	If used, the model will be trained with CIDEr optimization
`--sample_rl_nw`	If used, the model will be trained with CIDEr + NW optimization

For example, to train the model with cross entropy, use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-4

To train the model with CIDEr optimization (after training the model with cross entropy), use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-5 --sample_rl

To train the model with CIDEr + NW optimization (after training the model with cross entropy), use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-5 --sample_rl_nw

Note: the current training code only supports the use of the COCO Entities dataset.

COCO Entities

If you want to use only the annotations of our COCO Entities dataset, you can download the annotation file coco_entities_release.json (~403 MB).

The annotation file contains a python dictionary structured as follows:

coco_entities_release.json
 └── <id_image>
      └── <caption>
           └── 'det_sequences'
           └── 'noun_chunks'
           └── 'detections'
           └── 'split'

In details, for each image-caption pair, we provide the following information:

det_sequences, which contains a list of detection classes associated to each word of the caption (for an exact match with caption words, split the caption by spaces). None indicates the words that are not part of noun chunks, while _ indicates noun chunk words for which an association with a detection in the image was not possible.
noun_chunks, which is a list of tuples representing the noun chunks of the captions associated with a detection in the image. Each tuple is composed by two elements: the first one represents the noun chunk in the caption, while the second is the detection class associated to that noun chunk.
detections, which contains a dictionary with a number of elements equal to the number of detection classes associated with at least a noun chunk in the caption. For each detection class, it provides a list of tuples representing the image regions detected by Faster R-CNN re-trained on Visual Genome [1] and corresponding to that detection class. Each tuple is composed by the detection id and the corresponding boundig box in the form [x1, y1, x2, y2]. The detection id can be used to recover the detection feature vector from the pre-computed features file coco_detections.hdf5 (~53.5 GB). See the demo section below for more details.
split, which indicates the dataset split of that sample (i.e. train, val or test) following the COCO splits provided by [2].

Note that this annotation file includes all image-caption pairs for which at least one noun chunk-detection association has been found. However, in validation and testing phase of our controllable captioning model, we dropped all captions with empty region sets (i.e. those captions with at least one _ in the det_sequences field).

By downloading the dataset, you declare that you will use it for research and educational purposes only, any commercial use is prohibited.

Demo

An example of how to use the COCO Entities annotations can be found in the coco_entities_demo.ipynb file.

References

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[2] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

Contact

If you have any general doubt about our work, please use the public issues section on this github repo. Alternatively, drop us an e-mail at marcella.cornia [at] unimore.it or lorenzo.baraldi [at] unimore.it.

show-control-and-tell's People

Contributors

Stargazers

Watchers

show-control-and-tell's Issues

Could't open the link of .tgz and hdf5! help me please!

When I clik the link, it will return me like the under!

Forbidden
You don't have permission to access /releases/show-control-and-tell/dataset_coco.tgz on this server.
Apache/2.4.29 (Ubuntu) Server at ailb-web.ing.unimore.it Port 80

Got some trouble with install speaksee

Thanks for sharing,
I got ERROR when I install the requirement torch, the situation is the same as seen in the photo.

Is there any way to solve it?

conda.yml - openssl=1.1.1=h7b6447c_0

Creating the environment with conda.yml was throwing errors at - openssl=1.1.1=h7b6447c_0. I changed it to - openssl=1.1.1g=h7b6447c_0 which worked.

coco_detections.hdf5

I can't download coco_detections.hdf5（~53.5G） from google network disk, can you share the connection of Baidu network disk？Thank you very much。

The way to extract noun chunks

Hi, could you provide the code to extract noun chunks or details about how you do that?

Running on other images

It would be great if you could describe if it is possible and how to run inference on new images (without loading all the training data) :)

Do you purpose to public the training code？

Hi, I'm appreciate your amazing job! I do get many inspirations from it. I wanna to know whether you are going to public your training code?

How to filter the detected objects？

Hi, Thanks for sharing.
But I have a question, I hope I can get the answer.
We can usually get 36 detection objects using Faster R-CNN, but I see that det_sequences usually contain only a few objects in file coco_entities_release.json. I‘m not sure which mechanism in the model implements the filter effect from 36 objects to several objects?
Can I understand this? Sorting Network plays a role. Because the sorting network ranks the regions of higher importance in the front, and only the first few regions of the sequence of region sets are used to generate the caption. So there are still a lot of regions that have not been used to be filtered out.
Or is it that Adaptive attention with visual sentinel？
Thank U.

Download speed is so slow!

Hi!
In the process of downloading coco and Flickr30k, the speed is so slow and it break off sometimes! Do you have other way to let me get it? I really want to try the code.

When will you provide the training code and put it on GitHub?

Thanks for your great work!I found that you provided the verification code of the controllable captioning model , but there is no training code. I want to know if you are convenient to provide the code of training part or put it on GitHub.

"test",for gaining cls_seq_test,which using det_class with caption?

dear author:
for get cls_seq_test, every region corresponding to detection features,it using detection class with caption,but testing time,we will not know ground truth captions, I did not understanding.
thank you, very much,for help,my mail is [email protected]

What is ctrl_det_idxs

I thought ctrl_det_idxs means the detections id, I am confuse about ctrl_det_idxs + prev_outputs[1]. prev_outputs[1] is a gate_weights?

NONE

To cancel the first

new image not in coco

How to extract the features of the images not in the coco or flicker datasets to generate its captions?

Controllability through a set of detections

Hi ,
Could you please give more information about Figure 4 in the paper ?
In my understanding, you choose regions based ground-truth captions in the dataset for Controllability through a sequence of detections. In experiment of figure 4, how do you choose a set of regions for an image ?

some questions regarding the paper

Hi. I've read through your paper, and it's very interesting. Congrats on that amazing work!
There are a few doubts, appreciate your kind help. I haven't gone through the code yet, so if question 2 and 3 are related to the implementation aspect, please ignore them.
1- In equation 11 (objective), is there a typo for chunk-level probability? According to my understanding, your switching gate is a Boolean (0,1). This is equivalent to binary cross entropy loss, so I assume it should be log(1-p) rather than 1-log(p)?
2- For equation 6, you are taking the normalized exponential of zt. You are dividing by the sum of each element of ztr added with the vector ztc. Isn't it supposed to be added with the sum of each element of vector ztc rather than adding with the vector ztc?
Thanks you, and wish you all the best for your CVPR presentation!

RuntimeError: view size is not compatible with input tensor's size and stride

Hi,
I'm trying to run the code with the command like below,
python test_region_sequence.py --dataset coco --exp_name ours --sample_rl_nw
and
python test_region_set.py --dataset coco --exp_name ours --sample_rl_nw
, which are COCO Entities with CIDEr + NW optimization.

However, both give me same error like below,
Traceback (most recent call last): File "test_region_sequence.py", line 138, in <module> out, _ = model.beam_search((detections_i, ctrl_det_seqs_test_unique), eos_idxs=[text_field.vocab.stoi['<eos>'], -1], beam_size=5) File "/home/tejasrii/show_control_tell/models/CaptioningModel.py", line 151, in beam_search selected_logprob, selected_idx = torch.sort(seq_logprob.view(b_s, -1), -1, descending=True) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

What should I do here..?

Thank you for sharing the code! :)
and Hope be safe from COVID-19.

Why are there multiple detection regions for the same instance

Why are there multiple detection regions for the same instance in coco_entities_release.json, such as:
u'women': [[17,
[211.71670532226562,
21.305891036987305,
428.42095947265625,
386.89093017578125]],
[23,
[98.44943237304688,
94.96699523925781,
302.0435485839844,
318.7557678222656]],
[38,
[228.9897918701172,
72.67094421386719,
458.8382263183594,
418.9734191894531]]]}

And is there any difference between them?

Thanks!

How to generate captures for a certain picture

How to generate captures for a certain picture,l have no idea...

a “soft” permutation matrix P through the Sinkhorn operator.

firstly,thank you,for your help!
please tell me Sinkhorn operator,which will how deal with P?
thank you,very much.

Lower CIDEr values on CIDEr Optimization

I am training my captioning model for 32 epochs (with CE loss) and then doing SCST training. But my rewards increase in the negative and my CIDEr decreases. Any possible reasons for this?
Thanks for the help in advance.

Objective function in the paper

Do the classes in 'detections' appear in the same order as they appear in the caption label?

For example，if there is a caption label like 'A man walking his dog on a street', I assume that its corresponding 'detections' is just like:
['man':[...],'dog':[...], 'street':[...]]
The class 'man' in the first position, 'dog' in the second one and 'street in the last, which means the order of these 3 words are corresponding to their order in the caption label. Is this assumption correct?

How can I get Detection Sequence?

Hi, thanks for your great work!
I have a question and need your help. How can I get the Detection Sequence for all regions in each image? And I want to know how to guarantee the Detection Sequence of the same shape, because the number of regions in each image is not the same.
What's more, I found this sentence in your code: _, _, ctrl_det_seqs_test, _, captions = values. So ctrl_det_seqs_test is the Detection Sequence for all regions features in each image?
If so, can the Detection Sequence of the training data be obtained in the same way?
In short, I want to get the Detection Sequence for each image of coco data.
Can you help me? Thanks a lot!!

Import errors of speaksee

@marcellacornia @baraldilorenzo
When I run python test_region_set.py --dataset coco --exp_name ours --sample_rl_nw,
i get import errors of speaksee.data. However the installation shows successful.

Can't find the features of test split in coco_detections.hdf5

Hi,

Why I can't find the entities' features of test split image in coco_detections.hdf5?
There are 4995 images that annotated as 'test' in coco_entities_release.json, but I can't find their entities' features in coco_detections.hdf5.

Thanks!

Does coco_detections.hdf5 have to be downloaded ？

Do I need to download coco_detections.hdf5 ? If I have COCO pictures, can I generate them by myself

loss

why loss = loss_cap + 4*loss_gate,i can not understand the 4

how to get detection features

Could you please show the codes on how can we get the detection features? You know we can only run the test dataSet through downloading the detection feature

object_class_list.txt

epoch

The training epoch is not pre-determined in the code. Moreover, when training the model, the epoch is different every time.

BrokenPipeError: [Errno 32] Broken pipe

python test_region_sequence.py --dataset flickr --exp_name ours --sample_rl_nw
Loading "ours" model trained with CIDEr + NW optimization.
Test: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 63/63 [02:37<00:00, 2.09s/it]
Computing sequence contrallabity results.
Blue_1 0.4043501599100964
Bleu_2 0.265110236768992
Bleu_3 0.17974917026202153
Bleu_4 0.1251760198819223
Traceback (most recent call last):
File "test_region_sequence.py", line 175, in
val_meteor, _ = Meteor().compute_score(gts_t, gen_t)
File "/root/anaconda3/lib/python3.7/site-packages/speaksee/evaluation/meteor/meteor.py", line 48, in compute_score
stat = self._stat(res[i][0], gts[i])
File "/root/anaconda3/lib/python3.7/site-packages/speaksee/evaluation/meteor/meteor.py", line 65, in _stat
self.meteor_p.stdin.flush()
BrokenPipeError: [Errno 32] Broken pipe

'Meteor' object has no attribute 'lock'

When I did my 'Meteor 'evaluation, an error occurred.

Training of Sinkhorn Operator and Data Defination

Hi, may I know is the code for training the Sinkhorn network available? Currently, I found that only test_region_set.py has uses pretrained Sinkhorn Network, but I'm interested to know how it was trained from scratch.

Also, I'm kinda confuse about all the data loaded into project. I'll write down my understanding regarding the data below and please correct me if I'm wrong.

detections (shape = 100x20x20x2048): the first dim is batch size, the second dim is the time step for each word in the caption, the third dim is all the bounding box related to the noun label of current time step, while the last is feature dimension of an image region
captions (shape = 100x100x2048) : the first dim is batch size, the second dim is words seq while the last dim is the embedding feature of each word. If this is right, why is there 100 words time step for each samples but only 20 in detections?
ctrl_det_seqs (shape = 100x20) : the first dimension is batch size while the second dim refers to index associating with the list of nouns.
ctrl_det_gts (shape = 100x20x20x2048) : same as detections. I'm actually confuse on what's the difference between ctrl_det_gts and detections. Please help.

object_class_list.txt

FileNotFoundError: [Errno 2] No such file or directory: 'datasets/coco/object_class_list.txt'

I don't know where to download object_class_list.txt

Sorry, how can I get the flickr30k Entities dataset?

Thanks for your great work! I have read in README about:
If you want to use only the annotations of our COCO Entities dataset, you can download the annotation file coco_entities_release.json (~403 MB).
And I think it's so interesting! But I can't find the way to generate the flickr30k entities dataset, can you help me?

Compared with the original paper, there is a big gap between the cider index run by ourselves and the original paper.

I did the experiment according to the public code of github without any modification, and the highest cider index was only 171. I don't know why.

No such file or directory: 'java': 'java'

When I run the test_region_sequence.py,
If you run on Linux from the command line, you won't get an error
But running on pycharm will report this error. How can I solve it

aimagelab / show-control-and-tell Goto Github PK

show-control-and-tell's Introduction

Show, Control and Tell

Environment setup

Data preparation

COCO Entities

Flickr30k Entities

Evaluation

Sequence controllability

Set controllability

Expected output

Training procedure

COCO Entities

Demo

References

Contact

show-control-and-tell's People

Contributors

Stargazers

Watchers

Forkers

show-control-and-tell's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs