aimagelab / meshed-memory-transformer Goto Github PK

View Code? Open in Web Editor NEW

507.0 13.0 136.0 7.24 MB

Meshed-Memory Transformer for Image Captioning. CVPR 2020

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

image-captioning transformer captioning-images caption-generation visual-semantic pytorch cvpr2020

meshed-memory-transformer's Introduction

M²: Meshed-Memory Transformer

This repository contains the reference code for the paper Meshed-Memory Transformer for Image Captioning (CVPR 2020).

Please cite with the following BibTeX:

@inproceedings{cornia2020m2,
  title={{Meshed-Memory Transformer for Image Captioning}},
  author={Cornia, Marcella and Stefanini, Matteo and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Environment setup

Clone the repository and create the m2release conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate m2release

Then download spacy data by executing the following command:

python -m spacy download en

Note: Python 3.6 is required to run our code.

Data preparation

To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file annotations.zip and extract it.

Detection features are computed with the code provided by [1]. To reproduce our result, please download the COCO features file coco_detections.hdf5 (~53.5 GB), in which detections of each image are stored under the <image_id>_features key. <image_id> is the id of each COCO image, without leading zeros (e.g. the <image_id> for COCO_val2014_000000037209.jpg is 37209), and each value should be a (N, 2048) tensor, where N is the number of detections.

Evaluation

To reproduce the results reported in our paper, download the pretrained model file meshed_memory_transformer.pth and place it in the code folder.

Run python test.py using the following arguments:

Argument	Possible values
`--batch_size`	Batch size (default: 10)
`--workers`	Number of workers (default: 0)
`--features_path`	Path to detection features file
`--annotation_folder`	Path to folder with COCO annotations

Expected output

Under output_logs/, you may also find the expected output of the evaluation code.

Training procedure

Run python train.py using the following arguments:

Argument	Possible values
`--exp_name`	Experiment name
`--batch_size`	Batch size (default: 10)
`--workers`	Number of workers (default: 0)
`--m`	Number of memory vectors (default: 40)
`--head`	Number of heads (default: 8)
`--warmup`	Warmup value for learning rate scheduling (default: 10000)
`--resume_last`	If used, the training will be resumed from the last checkpoint.
`--resume_best`	If used, the training will be resumed from the best checkpoint.
`--features_path`	Path to detection features file
`--annotation_folder`	Path to folder with COCO annotations
`--logs_folder`	Path folder for tensorboard logs (default: "tensorboard_logs")

For example, to train our model with the parameters used in our experiments, use

python train.py --exp_name m2_transformer --batch_size 50 --m 40 --head 8 --warmup 10000 --features_path /path/to/features --annotation_folder /path/to/annotations

References

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

meshed-memory-transformer's People

Contributors

Stargazers

Watchers

Forkers

jankintian nelaturuharsha zzorjj mad-red rubickh lukebrandl pandinosaurus megayeye trungthanhtran ruotianluo adrelino alesolano spartag117 jordan-5i gfiameni wonnerky autogyro code-cse masonwang96 toanhvu svp19 onlywordding gyq716 nilinykh wdyin pyd20001201 mymuli 66xiaowei shreyanshchordia ambifire zhangyuewei98 luowensheng esdolo oalacam zencyyoung zanyarz tommylitlle wdy364256785 cv-ip kilichbek daveredrum wutang22 ericwang0701 mikkokotola chaoso binliang-nlp serignecisse testmailtt mxbi mykaah gmhshanest deeputom bessy-mukaria jihyojeon sabirdvd hcwei13 eugeniotonanzi jmhessel mariaholodov sumathigit kerengaiger cranooooooo gilamsalem young499 quanha72 quangdaist01 nobelvictory qingzwang crystalsixone sonanerd lengmm rl-gan-vision-privacy-finance-projects tangzwei ske159 zhu-wc soloist97 paula-kli aphelios-c ydjiao yuith xiaobingdu sixingyan fujie-cit jianqingxie yqgao716 lalithz006 aliciaviernes quuhua911 roypic ffzhang1231 feizc haoyifei996 cryptowealth-technology song-wenpo yzc526 yangyanggit89 talking-bird ph0enix0803 stellating lpsunny

meshed-memory-transformer's Issues

Did this project use memory slots?

Hello, while I want to try implement your model. I found that "ScaledDotProductAttentionMemory" have been referenced on 0 files. Did you really use the memory slots that has been mentioned on paper? or I get it wrong and miss something out. Thank you :D

meshed-memory-transformer/models/transformer/attention.py

Line 69 in e0fe3fa

class ScaledDotProductAttentionMemory(nn.Module):

variance dtype issue

Traceback (most recent call last):
File "test.py", line 77, in
scores = predict_captions(model, dict_dataloader_test, text_field)
File "test.py", line 26, in predict_captions
out, _ = model.beam_search(images, 20, text_field.vocab.stoi[''], 5)
File "/home/mingjie/meshed-memory-transformer/models/captioning_model.py", line 70, in beam_search
return bs.apply(visual, out_size, return_probs, **kwargs)
File "/home/mingjie/meshed-memory-transformer/models/beam_search/beam_search.py", line 71, in apply
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
File "/home/mingjie/meshed-memory-transformer/models/beam_search/beam_search.py", line 121, in iter
self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size))
File "/home/mingjie/meshed-memory-transformer/models/containers.py", line 30, in apply_to_states
self._buffers[name] = fn(self._buffers[name])
File "/home/mingjie/meshed-memory-transformer/models/beam_search/beam_search.py", line 27, in fn
beam.expand(*([self.b_s, self.beam_size] + shape[1:])))
RuntimeError: gather_out_cuda(): Expected dtype int64 for index

When I try to evaluate your model, it came up to that....and I don't know how to fix that.. Would you please give me some tips?

How to inference video with this model?

I have produce the result as tips, but i want to use this model to inference real video

register_state or register_buffer ?

Hi,
Thank you for open-sourcing your codes. I really enjoyed reading your paper.
I am having a problem when try to understand:

here. How does register_state work here? Does it have any difference with register_buffer?
Does the enc_output and mask_enc defined in register_state have something to do with the output here?

Thank you for you time.

The purpose for saveing different states during beam search

Hi,

I have question for the code implementation, During the inference, I see you concatenate the running keys with current running keys,, running values with current value vector together so on and so forth.

If I remove it, the performance will be very low, so I am curious what is the purpose behind the concatenation?

Thanks

Parallelizing the Network

Hi, Thanks for providing the implementation it really helps.

I am getting the following error when trying to use Multiple GPUs with DataParallel. Please note the implementation works perfectly fine on a single GPU.
Here is the traceback:

Traceback (most recent call last):
File "train.py", line 247, in
train_loss = train_xe(model, dataloader_train, optim, text_field)
File "train.py", line 77, in train_xe
out = model(detections, captions)
File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 151, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 156, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 111, in replicate
buffer_copies_not_rg = _broadcast_coalesced_reshape(buffers_not_rg, devices, detach=True)
File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/nn/parallel/replicate.py", line 75, in _broadcast_coalesced_reshape
return comm.broadcast_coalesced(tensors, devices)
File "/opt/conda/envs/m2release/lib/python3.6/site-packages/torch/cuda/comm.py", line 39, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: tensors.size() == order.size() INTERNAL ASSERT FAILED at /pytorch/torch/csrc/utils/tensor_flatten.cpp:66, please report a bug to PyTorch. (reorder_tensors_like at /pytorch/torch/csrc/utils/tensor_flatten.cpp:66)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f71080e7273 in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: torch::utils::reorder_tensors_like(std::vector<at::Tensor, std::allocatorat::Tensor >&, c10::ArrayRefat::Tensor) + 0x139f (0x7f710c34c9cf in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: torch::cuda::broadcast_coalesced(c10::ArrayRefat::Tensor, c10::ArrayRef, unsigned long) + 0x1d96 (0x7f710c834d76 in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #3: + 0x5f422c (0x7f71528fd22c in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x1d3ef4 (0x7f71524dcef4 in /opt/conda/envs/m2release/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #48: __libc_start_main + 0xf0 (0x7f7160f72830 in /lib/x86_64-linux-gnu/libc.so.6)
Thanks :)

visualize the attention part like results.png

I really appreciate it that you offer such good work for everyone. I am very interested that how to visualize the image like the results.png in images folder. Thanks a lot !!!!

The input tensor shape of the self-attention in decoder when doing the beam search

Hello! Thank you for providing this good work.
When I run evaluate_metrics function in train.py (it's actually doing the beam search), I found that, at each time step, the input tensor shape of MeshedDecoderLayer is (10, 1, 512), where 10 is the batch size and 512 is d_model.
Why the shape isn't (10, t, 512)? Here, t is the number of time step. By inputting this (10, t, 512)-shaped tensor, the decoder can make use of information from all previous time steps instead of only the last time step. And in my opinion, this is also how Transformer works.

the caption is incomplete

I use https://github.com/peteanderson80/bottom-up-attention/ for feature extraction on my own images, and then run the image caption model, but the result caption is incomplete.

e.g.

caption: "a view of a city with a building in the"

caption: "a view of a city with a view of a river and a"

caption: "a woman in a yellow dress walking on a"

It seems the result is truncated.

Unable to open file (file read failed)

Traceback (most recent call last):
File "test.py", line 77, in
scores = predict_captions(model, dict_dataloader_test, text_field)
File "test.py", line 23, in predict_captions
for it, (images, caps_gt) in enumerate(iter(dataloader)):
File "/home/bg/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 560, in next
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/home/bg/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 560, in
batch = self.collate_fn([self.dataset[i] for i in indices])
File "/home/bg/BG/caption_space/meshed-memory-transformer/data/dataset.py", line 128, in getitem
return self.key_dataset[i], self.value_dataset[i]
File "/home/bg/BG/caption_space/meshed-memory-transformer/data/dataset.py", line 42, in getitem
data.append(field.preprocess(getattr(example, field_name)))
File "/home/bg/BG/caption_space/meshed-memory-transformer/data/field.py", line 109, in preprocess
f = h5py.File(self.detections_path, 'r')
File "/home/bg/anaconda3/envs/m2release/lib/python3.6/site-packages/h5py/_hl/files.py", line 312, in init
fid = make_fid(name, mode, userblock_size, fapl, swmr=swmr)
File "/home/bg/anaconda3/envs/m2release/lib/python3.6/site-packages/h5py/_hl/files.py", line 142, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 78, in h5py.h5f.open

I met this problem. How can i fix it?

Conda Env Create Failed

Hi~ when I create the conda env, I got this error 👇

Could you tell me how to fix it? seems like some channels which used to install these libs not in my conda's config?

Confirming Training Time/Memory Information

Hello, thank you for your great work!

I just wanted to confirm some simple information about training time and memory usage, as I didn't see them in the paper/repo, and I wanted to make sure that the code is running correctly on my machine.

I am running on a single V100 with your parameters: --batch_size 50 --m 40 --head 8. I find that this consumes around 6GB of GPU memory and that each epoch takes around 3 hours (so around 30 epochs should take around ~90h = ~4 days). Does this match your training time/memory usage?

I see from the paper that you are training with a (single?) 2080TI, and I see in the code that you stop training dynamically using the patience variable. Do you know how many epochs it took for you to stop training on your final run (around 130 CIDEr) and how long this took?

Thank you again for your work!

I can't download coco_detectons.hdf5

I tried many times,but always can't download it,how can I get.it?

Can I get the label name in coco_detections.hdf5

In coco_detections.hdf5 the id_cls_prob is (n,1061) ,I want to now the 1061 classes label

Flag for cpu only evalutaion.

I tried running the evaluation script "test.py" along with arguments but an error occured as follow

Meshed-Memory Transformer Evaluation
Traceback (most recent call last):
  File "test.py", line 69, in <module>
    model = Transformer(text_field.vocab.stoi['<bos>'], encoder, decoder).to(device)
  File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
    return self._apply(convert)
  File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    module._apply(fn)
  [Previous line repeated 3 more times]
  File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply
    param.data = fn(param.data)
  File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 384, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
    _check_driver()
  File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

Looks like the code uses cuda somewhere.
Does the code support cpu only execution ?

The detection features cannot be download. Where can I get the detection features?

the purpose for different regestered states

Unable to open file (file signature not found)

Epoch 0 - train: 0%| | 0/11328 [00:00<?, ?it/s]
Traceback (most recent call last):
File "train.py", line 239, in
train_loss = train_xe(model, dataloader_train, optim, text_field)
File "train.py", line 75, in train_xe
for it, (detections, captions) in enumerate(dataloader):
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in next
data = self._next_data()
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 385, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/meshed-memory-transformer-master/data/dataset.py", line 42, in getitem
data.append(field.preprocess(getattr(example, field_name)))
File "/root/meshed-memory-transformer-master/data/field.py", line 109, in preprocess
f = h5py.File(self.detections_path, 'r')
File "/opt/conda/lib/python3.7/site-packages/h5py/_hl/files.py", line 408, in init
swmr=swmr)
File "/opt/conda/lib/python3.7/site-packages/h5py/_hl/files.py", line 173, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 88, in h5py.h5f.open

about HDF5 file boxes shape ？

<HDF5 dataset "544046_boxes": shape (25, 4), type "<f4">
[[ 0. 0. 405.69147 342.30185 ]
[109.02886 216.21622 499.09998 419.3 ]
[ 0. 98.41356 175.93782 311.00922 ]
.......[162.85463 284.66254 215.16975 341.75537 ]]
what does [a, b, c, d] mean?

Information leakage? Decoder is taking in the full output sequence during evaluation?

module cupy has no attribute rawkernel

Get "module cupy has no attribute rawkernel" after execute "python -m spacy download en". Any suggestion please.

The model runs quite slowly

Hi @marcellacornia,

I have tried to build a demo using M2 transformer because I found that the model worked very well. Unfortunately, when I tried to make inference using device = CPU, it took about 30 seconds for 64 character length sequence, I guess the beam-search algorithm is the reason why the inference consumed lots of time. Do you have any idea that I could enhance the performance of this implementation?

Error on testing the network on Windows 10

I'm traying to test the network on my windows 10 notebook. I configure all the packages but when the test start it gives me the next error:

Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files\JetBrains\PyCharm 2020.2.3\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2020.2.3\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "E:/DeepLearning/Fashion/FeaturesExtraction/FashionFeaturesExtraction/captioning/meshed-memory-transformer/test.py", line 77, in
scores = predict_captions(model, dict_dataloader_test, text_field)
File "E:/DeepLearning/Fashion/FeaturesExtraction/FashionFeaturesExtraction/captioning/meshed-memory-transformer/test.py", line 26, in predict_captions
out, _ = model.beam_search(images, 20, text_field.vocab.stoi[''], 5, out_size=1)
File "E:\DeepLearning\Fashion\FeaturesExtraction\FashionFeaturesExtraction\captioning\meshed-memory-transformer\models\captioning_model.py", line 70, in beam_search
return bs.apply(visual, out_size, return_probs, **kwargs)
File "E:\DeepLearning\Fashion\FeaturesExtraction\FashionFeaturesExtraction\captioning\meshed-memory-transformer\models\beam_search\beam_search.py", line 71, in apply
visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs)
File "E:\DeepLearning\Fashion\FeaturesExtraction\FashionFeaturesExtraction\captioning\meshed-memory-transformer\models\beam_search\beam_search.py", line 121, in iter
self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size))
File "E:\DeepLearning\Fashion\FeaturesExtraction\FashionFeaturesExtraction\captioning\meshed-memory-transformer\models\containers.py", line 30, in apply_to_states
self._buffers[name] = fn(self._buffers[name])
File "E:\DeepLearning\Fashion\FeaturesExtraction\FashionFeaturesExtraction\captioning\meshed-memory-transformer\models\beam_search\beam_search.py", line 26, in fn
s = torch.gather(s.view(*([self.b_s, cur_beam_size] + shape[1:])), 1,
RuntimeError: gather_out_cuda(): Expected dtype int64 for index

Is this code using the karpathy splits?

Hello. Is this code using karpathy splits? I didn't see any file about the karpathy splits.

Random output after several early epoch then start training

Hi @marcellacornia,

When I started my train, I got random outputs for about the first five epochs, I mean it generated words. Then, it produced nothing, and I had to train for several epochs to get good results. Do you have any idea? Because of initialization?

Q/A visual for coding

Hi @baraldilorenzo,

I 'm trying to improve the speed of beam_search. When doing it, I found this function:
visual = self._expand_visual(visual, cur_beam_size, selected_beam)
in the iter function of beam_search.py

Please tell me what does this mean?

T.T.T

Performance without reinforce optimization stage

Hi,

Thanks for providing the implementation.

I checked the paper and observe that only performance with reinforce optimization is reported. I wonder if you can report the performance of your model before reinforce optimization stage.

Thanks.

about batch grouping

Hello. Thanks for your work on M2.
I would like to ask regarding the data.
Is the batch grouped according to the lengths of the caption/image features? What I mean is that does each batch contain all similar lengths of the caption? For example, a batch of 5 has the length of its captions as [16,16,16,16,16], which means that all captions in that specific batch have the same length. Do you have that in your code? (I'm asking because the original implementation of the Transformer in tensorflow has this, so i'm wondering if it's important to do it and has an effect on the performance).
Thanks

Feature for COCO on-line Test images

Thanks for sharing the code of this brilliant work!

I'm wondering that is it possible to make the detection file for coco online test images available as the one for train/val hdf5 file. Or is there some available online resource that I did not spot it.

Thanks in advance!

Architecture related doubt

Are you feeding the raw image (y * y * 3 dimension) regions or do you feed some deep CNN generated embeddings to your Encoder Decoder network?

features (N, 2048) = graph node features? & How to switch "mode='teacher_forcing" to mode='feedback' when testing

Thanks for your quick response before! It's really helpful! I have some other questions. Hope you can kindly help me. (That issue is closed.)

The features (N, 2048) for each image in coco_detections.hdf5 are graph node features. You didn't use the whole image as a context. Is it correct?
From my understanding, Mk, Mv are related with "the priori knowledge on relationships between image regions"?
I try to save the predicted caption "json.dump(gen, open('predict_caption/predict_caption_val.json', 'w'))". Have you ever encountered the problem that all the predicted captions are the same sentence?
During training, we use teacher forcing and feed the true word in the next time step. During testing though, we have to feed the predicted word to generate the sentence. In the code, I saw "mode='teacher_forcing", mode='feedback'. But how you switch them between training and testing?
If you don’t mind, could you publish the script for "visualization of attention states" (Integrated Gradients approach)?

Unable to download Coco feature files.

The download is failing every time. Maybe upload it again or over some other storage location

Is it possible to run this on a new input image?

Hello,

Thank you for open-sourcing the code for the awesome projects you've done including M^2 and Show, Control and Tell.

I had read an issue on Show, Control and Tell that it is not possible to use it on new image as the control signals are extracted from COCO/Flickr. Is this the case for this approach as well?

In case one would attempt to scale it for new images (not in coco features) how could one proceed?

Thanks!

I meet error but i can't solve,please help

PS D:\meshed-memory-transformer-master> python train.py --exp_name m2_transformer --batch_size 50 --m 40 --head 8 --warmup 10000 --features_path meshed_memory_transformer.pth --annotation_folder annotations
Namespace(annotation_folder='annotations', batch_size=50, exp_name='m2_transformer', features_path='meshed_memory_transformer.pth', head=8, logs_folder='tensorboard_logs', m=40, resume_best=False, resume_last=False, warmup=10000, workers=0)
Meshed-Memory Transformer Training
Traceback (most recent call last):
File "train.py", line 186, in
cider_train = Cider(PTBTokenizer.tokenize(ref_caps_train))
File "D:\meshed-memory-transformer-master\evaluation\tokenizer.py", line 48, in tokenize
p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname,stdout=subprocess.PIPE, stderr=open(os.devnull, 'w'))
File "D:\soft\Anaconda3\lib\subprocess.py", line 709, in init
restore_signals, start_new_session)
File "D:\soft\Anaconda3\lib\subprocess.py", line 997, in _execute_child
startupinfo)

Learned a priori knowledge & New dataset which is very different from MSCOCO

Hi, in the paper, you mentioned "encodes relationships between image regions exploiting learned a priori knowledge". I am confused about it. The learned priori knowledge exists there before you train the model? In the code, which part you input the learned priori knowledge? How to get the learned priori knowledge for a new dataset which is very different from MSCOCO?

Reg. Training time

Hi,

Thanks for sharing your code here.

Can you please tell on what type of GPUs you training your model, how much time it took to complete one epoch and the number of epochs till you run your model?

Regards
Deepak Mittal

Unable to replicate results after retraining

Hello and thank you for this fantastic repo!

I am trying to retrain your model using COCO features I have extracted myself using the bottom-up attention repo as you have suggested in #2. I am currently on epoch 15 and the highest CIDEr score on the test set has been 1.13. This is much less than the 1.31 that I get when using your pretrained model. Other than the new features, I am using your default values for all hyperparameters.

Could you give me some guidance in order to better replicate your results?

Should mask for padding be used on the attention weights?

Hello! In the decoder part of this code, it seems that mask for padding only be used after each attention module, and in the attention module, only masks for self-attention or cross-attention be used. Should mask for padding be used on the attention weights? In this way, padding information can be prevented from getting into the attended features.

Beamsearch

hello, thanks for your work.
(1)why input is the one array words [5,1] generated (beam size :5) at x timestep, not is the generated sequence [5,x] , and then got the last word logprob.
(2) with the code beam search in our work, it 's stoped untill runing all steptimes, I think it's not reasonable for some sentences generated have been over.

These problems happened in our work with your beam search codes

Please, help me.! thanks

evaluation error

When I conduct eval script test.py and it report error as:

RuntimeError: Expected object of scalar type Byte but got scalar type Bool for sequence element 1 in sequence argument at position #1 'tensors'

Is there any mistake in evaluation?

Can captions be generated for new images out from the coco dataset

Hi,

Can captions be generated for new images out from the coco dataset?

Say for example, i want to generate caption on my profile pic, is it possible with this code?

As i dont see documentation being much of an help, to pass image directly to test.py

Regards,
Vinod

About reproduce CIDer value?

I got some trouble when I reproduce your work. I exactly followed the instructions of the project, and trained the model several times (on one 2080ti gpu). However, the test set are always around 129 in terms of CIDer, which is lower than your released model. Could you please kindly show me the possible solutions to reproduce your results?

Generating HDF5 detections from custom dataset or bottom-up-attention TSV

I have a custom dataset,

I have generated the detections TSV using : https://github.com/airsplay/py-bottom-up-attention
But the model requires HDF5.

TSV has these per each example:

{
   'image_id': image_id,
   'image_h': np.size(im, 0),
   'image_w': np.size(im, 1),
   'num_boxes' : len(keep_boxes),
   'boxes': base64.b64encode(cls_boxes[keep_boxes]),
   'features': base64.b64encode(pool5[keep_boxes])
}

When examining the coco dataset examples I see the following for example:

>>> dts["35368_boxes"]
<HDF5 dataset "35368_boxes": shape (37, 4), type "<f4">
>>> dts["35368_features"]
<HDF5 dataset "35368_features": shape (37, 2048), type "<f4">
>>> dts["35368_cls_prob"]
<HDF5 dataset "35368_cls_prob": shape (37, 1601), type "<f4">

>>> dts["35368_boxes"][36]
array([349.57147, 154.07967, 420.0327 , 408.64462], dtype=float32)

I'll try to figure out how to convert my TSV to required HDF5 myself from the code but guide would be appreciated.

Thank you.

Cannot reproduce the results during RL training stage

Thanks for your work. We are using fixed seeds of these and the result is reproducible for each run until the RL part. Specifically, the results from the XE training stage are reproducible but results are not reproducible from the RL training stage. Is there any way to make the results of the RL stage reproducible?

seed = 1234
random.seed(seed)
torch.manual_seed(seed)
np.random.seed(seed)
torch.cuda.manual_seed_all(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

From features of new images to M2 transformer

First of all, congrats for your work and thanks for releasing the code! 😄

Following #2 and #5, I'm trying to run the network on a new set of images. To get the image features I went to the bottom-up attention repo you suggested here, using the Faster-R-CNN-ResNet101 model with these weights.

My problem is the following: how to transform the outputs of this feature extractor into the format you require?

Following the Readme and code, I understand that you need to express the features as a Nx2048 tensor. Following this line, I understand that you also need a cls_prob vector to sort your feature vector.

Now, I took the blob res5c for the features and cls_prob for the probabilities, but the dimensions are not quite as I expected. res5c has dimension Nx2048x14x14, so the 14x14 should be mapped into one number I guess. And cls_prob has Nx1061 which is not coherent with the rest.

Am I missing something?

Thanks!

Meshed decoder for 6 layers Transformer

I try to use meshed decoder in a 6 layers transfomer model ,but it is in very poor performance. Did you test the meshed structure in more than 3 layers??

Scripts about "out-of-domain" captioning

Thanks for your work! Could you please release the scripts about the "out-of-domain" captioning / describing novel objects / constrained beam search? If cannot, I still appreciate your kindly explanation of my questions before.

About the data for COCO online test server

Really brilliant work~
May I ask where can I find the HDF5 data of those 40 775 images for the COCO online test server?
Thanks~

Reproduce results with test.py

Q1: In test.py:
data = torch.load('meshed_memory_transformer.pth')

data = torch.load('saved_models/m2_transformer_best.pth')

model.load_state_dict(data['state_dict'])
print("Epoch %d" % data['epoch'])
print(data['best_cider'])

Error: KeyError: 'epoch', KeyError: 'best_cider'

The provided 'meshed_memory_transformer.pth' is not saved from train.py? Because when I use the saved model from training in test.py. there is no error. Where the provided 'meshed_memory_transformer.pth' comes from?

And for my own dataset, in test.py, I load the saved model, why the performance drops compared with the evaluation metrics recorded in train.py?

Q2: dict_dataset_val = val_dataset.image_dictionary({'image': image_field, 'text': RawField()})
What's the function of "image_dictionary"? What's the difference with dict_dataset_val and val_dataset? I print them out, and observed that the caption of these two are different. And the len(dict_dataset_val) is different with len(val_dataset). Why is that?

Thanks for your help!

questions regarding the paper

Hello. Congratulations on your brilliant work!
I'd like to ask some questions regarding the paper:
In section 4.3, you mentioned that

we firstly introduce a reduced version of our approach in which the i-th decoder layer is only connected to the corresponding i-th encoder layer (1-to-1), instead of being connected to all encoders.

Is that step included in your code? As what i see, your query is all the same for the visual attention in the decoder part, as you are doing:

        enc_att1 = self.enc_att(self_att, enc_output[:, 0], enc_output[:, 0], mask_enc_att) * mask_pad
        enc_att2 = self.enc_att(self_att, enc_output[:, 1], enc_output[:, 1], mask_enc_att) * mask_pad
        enc_att3 = self.enc_att(self_att, enc_output[:, 2], enc_output[:, 2], mask_enc_att) * mask_pad

Do you mean that you set alphas to 1 (simply take the sum of all encoder layers)? Because if the ith decoder layer is connected to the ith encoder layer, that means the queries are different. And may i also kindly know if you have examined the importance of taking a weighted sum rather than a sum of the encoder layers?