GithubHelp home page GithubHelp logo

pytorch-vqa's Introduction

Strong baseline for visual question answering

This is a re-implementation of Vahid Kazemi and Ali Elqursh's paper Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering in PyTorch.

The paper shows that with a relatively simple model, using only common building blocks in Deep Learning, you can get better accuracies than the majority of previously published work on the popular VQA v1 dataset.

This repository is intended to provide a straightforward implementation of the paper for other researchers to build on. The results closely match the reported results, as the majority of details should be exactly the same as the paper. (Thanks to the authors for answering my questions about some details!) This implementation seems to consistently converge to about 0.1% better results – there are two main implementation differences:

  • Instead of setting a limit on the maximum number of words per question and cutting off all words beyond this limit, this code uses per-example dynamic unrolling of the language model.
  • An issue with the official evaluation code makes some questions unanswerable. This code does not normalize machine-given answers, which avoids this problem. As the vast majority of questions are not affected by this issue, it's very unlikely that this will have any significant impact on accuracy.

A fully trained model (convergence shown below) is available for download.

Graph of convergence of implementation versus paper results

Note that the model in my other VQA repo performs better than the model implemented here.

Running the model

  • Clone this repository with:
git clone https://github.com/Cyanogenoid/pytorch-vqa --recursive
  • Set the paths to your downloaded questions, answers, and MS COCO images in config.py.
    • qa_path should contain the files OpenEnded_mscoco_train2014_questions.json, OpenEnded_mscoco_val2014_questions.json, mscoco_train2014_annotations.json, mscoco_val2014_annotations.json.
    • train_path, val_path, test_path should contain the train, validation, and test .jpg images respectively.
  • Pre-process images (93 GiB of free disk space required for f16 accuracy) with ResNet152 weights ported from Caffe and vocabularies for questions and answers with:
python preprocess-images.py
python preprocess-vocab.py
  • Train the model in model.py with:
python train.py

This will alternate between one epoch of training on the train split and one epoch of validation on the validation split while printing the current training progress to stdout and saving logs in the logs directory. The logs contain the name of the model, training statistics, contents of config.py, model weights, evaluation information (per-question answer and accuracy), and question and answer vocabularies.

  • During training (which takes a while), plot the training progress with:
python view-log.py <path to .pth log>

Python 3 dependencies (tested on Python 3.6.2)

  • torch
  • torchvision
  • h5py
  • tqdm

pytorch-vqa's People

Contributors

cyanogenoid avatar gipster avatar guoyang9 avatar mantasbandonis avatar pplantinga avatar redcontritio avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytorch-vqa's Issues

maximum q len

The paper said questions are capped at 15 words, but I can't find any mention of 15 in the code.

About attention showing in the pic

Hi, i recently reimplemented your code and i had some problem. I don't know how to show attention in the pic like this.
图片
I just show the attention like this.
图片
Thank you so much!

when running preprocess_images.py, "size mismatch" occured

The Error Log is as following

/home/mmc_xhma/software/anconda3/bin/python3.6 /home/mmc_xhma/code/TMM_2017/pytorch-vqa-master/preprocess-images.py
/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/transforms/transforms.py:156: UserWarning: The use of the transforms.Scale transform is deprecated, please use transforms.Resize instead.
found 82783 images in mscoco/train2014
found 40504 images in mscoco/val2014
0%| | 0/123287 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/mmc_xhma/code/TMM_2017/pytorch-vqa-master/preprocess-images.py", line 79, in
main()
File "/home/mmc_xhma/code/TMM_2017/pytorch-vqa-master/preprocess-images.py", line 70, in main
out = net(imgs)
File "/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/mmc_xhma/code/TMM_2017/pytorch-vqa-master/preprocess-images.py", line 31, in forward
self.model(x)
File "/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/torchvision-0.2.0-py3.6.egg/torchvision/models/resnet.py", line 151, in forward
File "/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 55, in forward
return F.linear(input, self.weight, self.bias)
File "/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 835, in linear
return torch.addmm(bias, input, weight.t())
RuntimeError: size mismatch at

/opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCTensorMathBlas.cu:247

config.py

preprocess_batch_size = 64
image_size =448 # scale shorter end of image to this size and centre crop
output_size = image_size // 32 # size of the feature maps after processing through a network
output_features = 2048 #2048 # number of feature maps thereof
central_fraction = 0.875 # only take this much of the centre when scaling and centre

when the parameters are set as default , the error occured.

I have checked the input of the last fc layer in resnet152 , the input shape is [64 131072], however the weight martix shape is [2048 1000] and bias is none.
File "/home/mmc_xhma/software/anconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 835, in linear return torch.addmm(bias, input, weight.t())
Obviously , the size mismatch.
How can I fix the error.

Metric computation in training phase

Hi,
I read your code from start to finish and found it very clear and modular.
Based on your implementation, with some minor modifications, an overall accuracy result of 60.16 (original paper is 59.67) on VQA v2.0 validation set as well as 64.99 (original paper is 64.5) on VQA v1.0 test-dev were observed! It's very impressive and promising which means we can easily design our own algorithm on this code.
But there is a small issue confusing me (maybe stupid). That the computation of two metrics (loss and acc) in the training phase doesn't make much sense. What does the momentum do? How about just using the batch loss and acc?
Thanks.

Large memory consume

I have such warning pytorch-vqa/model.py:96: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().

Even though I use self.lstm.flatten_parameters() before _, (_, c) = self.lstm(packed), the programme consume almost all of my memory (16G), which is abnormal. In the issues before, you states that you can run epoch in 7 min, which I guess is because you have SSD.

Let's check the code to see what causes memory leak :)

expanded size of the tensor (2048) must match the existing size (128)

Attempting to run this code without any modifications sometimes results in this error:

will save to logs/2017-09-26_17:30:02.pth
train E000:   0% 0/3396 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 128, in <module>
    main()
  File "train.py", line 109, in main
    _ = run(net, train_loader, optimizer, tracker, train=True, prefix='train', epoch=i)
  File "train.py", line 55, in run
    out = net(v, q, q_len)
  File "/usr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 58, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/lib/python3.6/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/peter/Documents/Repositories/pytorch-vqa/model.py", line 51, in forward
    v = v / (v.norm(p=2, dim=1).expand_as(v) + 1e-8)
  File "/usr/lib/python3.6/site-packages/torch/autograd/variable.py", line 725, in expand_as
    return Expand.apply(self, (tensor.size(),))
  File "/usr/lib/python3.6/site-packages/torch/autograd/_functions/tensor.py", line 111, in forward
    result = i.expand(*new_size)
RuntimeError: The expanded size of the tensor (2048) must match the existing size (128) at non-singleton dimension 1. at /tmp/yaourt-tmp-peter/aur-python-pytorch/src/pytorch-0.2.0/torch/lib/THC/generic/THCTensor.c:323

EOFError: Ran out of input when training

Hi, I got the following issue when I ran train.py. Could you please help me to fix it?

will save to logs\2019-02-21_13:49:31.pth
D:\blondie\pytorch-vqa\model.py:90: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
init.xavier_uniform(w)
D:\blondie\pytorch-vqa\model.py:86: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
init.xavier_uniform(self.embedding.weight)
D:\blondie\pytorch-vqa\model.py:44: UserWarning: nn.init.xavier_uniform is now deprecated in favor of nn.init.xavier_uniform_.
init.xavier_uniform(m.weight)
train E000: 0% 0/3396 [00:00<?, ?it/s]Traceback (most recent call last):
File "train.py", line 134, in
main()
File "train.py", line 115, in main
_ = run(net, train_loader, optimizer, tracker, train=True, prefix='train', epoch=i)
File "train.py", line 45, in run
for v, q, a, idx, q_len in tq:
File "C:\Users\Konstantinos\Miniconda3\lib\site-packages\tqdm_tqdm.py", line 1002, in iter
for obj in iterable:
File "C:\Users\Konstantinos\Miniconda3\lib\site-packages\torch\utils\data\dataloader.py", line 819, in iter
return _DataLoaderIter(self)
File "C:\Users\Konstantinos\Miniconda3\lib\site-packages\torch\utils\data\dataloader.py", line 560, in init
w.start()
File "C:\Users\Konstantinos\Miniconda3\lib\multiprocessing\process.py", line 112, in start
self._popen = self._Popen(self)
File "C:\Users\Konstantinos\Miniconda3\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Users\Konstantinos\Miniconda3\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "C:\Users\Konstantinos\Miniconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "C:\Users\Konstantinos\Miniconda3\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "C:\Users\Konstantinos\Miniconda3\lib\site-packages\torch\multiprocessing\reductions.py", line 286, in reduce_storage
metadata = storage.share_filename()
RuntimeError: Couldn't map view of shared file <torch_3844_2531614511>, error code: <5>

Traceback (most recent call last):
File "", line 1, in
File "C:\Users\Konstantinos\Miniconda3\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "C:\Users\Konstantinos\Miniconda3\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

concat or sum?

Hello,

Thanks for the implementation. The paper does not detail how the LSTM encoding and the feature maps are fused altogether but only provides the Figure 2 where it says "the concatenated image features and the final state of the LSTM are then used to compute multiple attention distributions over image features". They also draw a Concat box in the diagram that receives the tiled LSTM state and the spatial feature maps as input.

I was experimenting with this idea for another task where I did the fusion with concatenation across the channel dimension but looking at your code I see that after tiling the q vector, you simply do a self.relu(v + q). Did you have some insight about this step, maybe some discussions with the authors?

Thanks!

answer normalization

The answer normalization code is different from the official VQA code.

  1. Can we assume this?
    "normalizations is not needed, assuming that the human answers are already normalized."

  2. The official code removes the articles (a, the) but this code doesn't.
    There are actually "the" in answers.

Mismatch in Computing Accuracy

According to the code from official VQA api , accuracy is accounted only when agreeing answer is not same as discarded because in line 98, it is computing other GT answers which are not same as discarded one. But it varies from what you have done. you are assuming that it will be matched to all but discarded answer while in their API, they are computing it as if it will be matched to all but discarded and other which are same as discarded answers.
So, seems like it will (10-agreeing)*min(agreeing/3,1).

Please correct me if I am wrong.

@Cyanogenoid @pplantinga @guoyang9

Runtime error with preprocess-images

Hi, I am trying to replicate your work on Jupyter notebook with Pytorch. Bu when I tried to run preprocess-image, it keeps telling me this error.
屏幕快照 2019-05-28 下午4 15 42
屏幕快照 2019-05-28 下午4 32 06
Really don't know what's wrong. I am following exactly as the instructions in README except I used Resnet from torch instead of caffe. But I don't think that is the reason.

Training time

Hi,

Thank you so much for providing the code. Can you please give me some indication about the training time. On my google VM with Tesla k80 its takes around 35 mins for one train epoch. Is that what you faced?

Thanks

run without CUDA

Is there a way to convert the preprocess-images.py to a version that doesnt require CUDA?

Test the model

Can you specify how exactly can i test the model i.e given an image with question the model is expected to return answers with confidence.

ssd create issue

I have this issue:
OSError: Unable to create file (unable to open file: name = '/ssd/resnet-14x14.h5', errno = 2, error message = 'No such file or directory', flags = 15, o_flags = c2)

Issue with train_loader and val_loader in train.py

Hello,

I wanted a way to look at the images in the main training/validation loop of train.py

I wasn't able to do this.

On the one hand, the images from tq.iterable.dataset.coco_ids does not sync with the question and answers (which do sync).

On the other hand, the dictionary tq.iterable.dataset.coco_id_to_index contains 123287 keys, which means that it contains both train and validation sets (82783+40504).

When I "reverse" this dictionary, the mapping from idx (where idx comes from for v, q, a, idx, q_len in tq:) to coco_id isn't correct (for example, idx from the val_loader maps to coco_ids in the training set).

Could you take a look at this?

Working with abstract scenes VQA v1

In place of MSCOCO images, would the code work as is with the abstract scene images? We will place the images in the same folders as mentioned in the config.py folder.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.