facebookresearch / mmf Goto Github PK

View Code? Open in Web Editor NEW

5.4K 115.0 924.0 17.49 MB

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

Home Page: https://mmf.sh/

License: Other

Python 98.89% Shell 0.11% C 0.20% JavaScript 0.55% CSS 0.16% MDX 0.10%

pytorch vqa pretrained-models multimodal deep-learning captioning dialog textvqa hateful-memes multi-tasking

mmf's Introduction

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-the-art vision and language models and has powered multiple research projects at Facebook AI Research. See full list of project inside or built on MMF here.

MMF is powered by PyTorch, allows distributed training and is un-opinionated, scalable and fast. Use MMF to bootstrap for your next vision and language multimodal research project by following the installation instructions. Take a look at list of MMF features here.

MMF also acts as starter codebase for challenges around vision and language datasets (The Hateful Memes, TextVQA, TextCaps and VQA challenges). MMF was formerly known as Pythia. The next video shows an overview of how datasets and models work inside MMF. Checkout MMF's video overview.

Installation

Follow installation instructions in the documentation.

Documentation

Learn more about MMF here.

Citation

If you use MMF in your work or use any models published in MMF, please cite:

@misc{singh2020mmf,
  author =       {Singh, Amanpreet and Goswami, Vedanuj and Natarajan, Vivek and Jiang, Yu and Chen, Xinlei and Shah, Meet and
                 Rohrbach, Marcus and Batra, Dhruv and Parikh, Devi},
  title =        {MMF: A multimodal framework for vision and language research},
  howpublished = {\url{https://github.com/facebookresearch/mmf}},
  year =         {2020}
}

License

MMF is licensed under BSD license available in LICENSE file

mmf's People

Contributors

Stargazers

Watchers

Forkers

ml-lab gq124 hal2001 jdc08161063 codeaudit zhouyonglong chenghuige joydeep75 trendingtechnology liu3xing3long allensmile huguanglong xiaojie18 dineshsonachalam opencvfun esmono souravbadami fangego shadowkun apsdehal sysujayce haoyang09 levelsethu cosecant-csc little1tow pankajmehar jiyulongxu yangxs rl-gan-vision-privacy-finance-projects abc3436645 erika1203 hao122065175 wxy920801 spartag117 jangocheng meetps sun-jingwei cadene hatleon silverbulletmdc shubhampachori12110095 crazyrex farleylai fenggenb gabegrand dl-85 monajalal yuzcccc smartwell ram-iyer schangpi mobenqi thilinicooray gaopeng-eugene daviddmc cyhbrilliant sunyancn stevenji autogyro zofuthan yucoian gragtah outformatics kaihuatang dhlrd yashkant mathfac databill86 xuhuaren b2220333 bing-w gabrielibagon ronghanghu aistudentsh chwlsunny empireofkings yourtone chen-joe-zy mohanarunachalam jmessou yewenting xiaojino zhichenghuang jiasenlu hyzcn fanlyu totoliser jyotianeja memozhu natsudalkr xennygrimmato sourav-roni tomarraj008 sghoshcvc milllllk abhiskk quangvy2703 zbzstar wh0330 umagunturi

mmf's Issues

some errors about loading pretrained model

Hi, when I loaded the detectron_100_resnet_most_data model, some problems happened:

While copying the parameter named "module.image_embedding_models_list.0.0.image_attention_model.modal_combine.Fa_image.main.0.weight_g", whose dimensions in the model are torch.Size([]) and whose dimensions in the checkpoint are torch.Size([1]). While copying the parameter named "module.image_embedding_models_list.0.0.image_attention_model.modal_combine.Fa_txt.main.0.weight_g", whose dimensions in the model are torch.Size([]) and whose dimensions in the checkpoint are torch.Size([1]).

I think might weight_norm caused it, but I don't know the reason. Can you give me some advices.

pre-trained VizWiz model

Can we please share a model trained on VizWiz with Saqib Shaikh [email protected]? Thanks!

Have you tried training the object detection and the VQA model in an "end to end" fashion ?

Hi guys 😅 😅 I was just wondering, Have you tried doing that? If not, do you expect the accuracy to go down if we did that?

Training Lorra on VQA2

I'm trying to train Lorra model for VQA2 dataset but I'm getting the following error
ValueError: /raid/saransh/pythia/pythia/.vector_cache/wiki.en.bin cannot be opened for loading!

What is the average accuracy?

> avg_accuracy += (1 - accuracy_decay) * (accuracy - avg_accuracy)

I tired to search about "average accuracy" but I didn't find anything useful. I didn't find that in the paper also
Can anyone please tell me what this line is? and what is the "average accuracy"? Does it have multiple names in the literature because I did not find anything similar to it before?
I am not a deep learning expert so maybe I need to learn those 😅 .

How do you train the bottom up model with attributes

Hi:

I want to know how you add predicting attributes in bottom up model,because i currently can't find it in repo. Would you mind providing the code for that ?

Best Wishes
Hanshan Zhang

Getting Object Labels

Is there a way for getting the object labels for bounding boxes, given by the (fine-tuned) Detectron model?

Unexpected EOF on extracting data tar

I am getting unexpected EOF on extracting the tar file from https://dl.fbaipublicfiles.com/pythia/features/coco.tar.gz.
Could someone please help?

[feature] Allow inheriting the default trainer for custom trainers

As seen in Meet's usecase, it will be useful to allow end-users to inherit default trainer to make their own customized trainers. This will allow users to write their own forward or backward pass.

docker demo doesnt run

Ran the instructions at https://github.com/facebookresearch/pythia#docker-demo :

git clone https://github.com/facebookresearch/pythia.git
nvidia-docker build pythia -t pythia:latest
nvidia-docker run -ti --net=host pythia:latest

then loaded localhost:8888 in my web-browser. This showed a file listing containing 'vqa_demo' and 'vqa_standalone_image_demo'. Tried opening both of these, and doing 'restart and run all'. Both gave errors, though different errors:

vqa_demo:

FileNotFoundError: [Errno 2] No such file or directory: 'data/imdb/imdb_test2015.npy'

vqa_standalone_image_demo:

FileNotFoundError: [Errno 2] No such file or directory: '/private/home/nvivek/VQA/pythia/vqa_detectron_master/config.yaml'

m5sum checksum target values in wrong order?

Hi, I downloaded yesterday the coco.tar.gz features file (240GB) and when I made the md5sum checksum I got ab7947b04f3063c774b87dfbf4d0e981 instead of the target b22e80997b2580edaf08d7e3a896e324 value. The funny thing is that ab7947b04f3063c774b87dfbf4d0e981 is the target value said to belong to the OpenImages features file, so I believe the target values where misplaced by accident. Is that correct? Thanks.

Out of memory issue

Hello there,

I have a cuda runtime error after the end of epoch 3000 by running the following code:

$ python train.py --config config/keep/detectron_100_resnet_most_data.yaml

Also I am running pytorch 0.3.1 on a single V100 (32G memory)

>>> torch.__version__
'0.3.1

By any chance, did someone encounter this error?

Thanks for the help

Traceback:

i_epoch: 1 i_iter: 2000 val_loss:3.4700 val_acc:0.6148 runtime: 67.63 min
iter: 2100 train_loss: 2.8821  train_score: 0.6225  avg_train_score: 0.6105 val_score: 0.6008 val_loss: 3.4801 time(s): 561.6 s
iter: 2200 train_loss: 2.6174  train_score: 0.6195  avg_train_score: 0.6135 val_score: 0.6803 val_loss: 3.1912 time(s): 218.7 s
iter: 2300 train_loss: 2.7957  train_score: 0.6426  avg_train_score: 0.6190 val_score: 0.6205 val_loss: 3.3420 time(s): 412.5 s
iter: 2400 train_loss: 2.4924  train_score: 0.6453  avg_train_score: 0.6207 val_score: 0.6117 val_loss: 3.4666 time(s): 192.8 s
iter: 2500 train_loss: 2.7591  train_score: 0.6234  avg_train_score: 0.6243 val_score: 0.6293 val_loss: 3.4114 time(s): 190.6 s
iter: 2600 train_loss: 2.9420  train_score: 0.5928  avg_train_score: 0.6237 val_score: 0.6400 val_loss: 3.2718 time(s): 185.9 s
iter: 2700 train_loss: 2.6800  train_score: 0.6441  avg_train_score: 0.6247 val_score: 0.6590 val_loss: 3.0637 time(s): 176.4 s
iter: 2800 train_loss: 2.7028  train_score: 0.6506  avg_train_score: 0.6303 val_score: 0.6828 val_loss: 3.1584 time(s): 189.8 s
iter: 2900 train_loss: 2.6380  train_score: 0.6432  avg_train_score: 0.6326 val_score: 0.6340 val_loss: 3.3097 time(s): 183.0 s
iter: 3000 train_loss: 2.7275  train_score: 0.6227  avg_train_score: 0.6311 val_score: 0.6725 val_loss: 3.1253 time(s): 187.7 s
THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.py", line 230, in <module>
    scheduler=scheduler,best_val_accuracy=best_accuracy)
  File "/home/rcadene/pythia/train_model/Engineer.py", line 159, in one_stage_train
    data_reader_eval)
  File "/home/rcadene/pythia/train_model/Engineer.py", line 87, in save_a_snapshot
    loss_criterion=loss_criterion)
  File "/home/rcadene/pythia/train_model/Engineer.py", line 204, in one_stage_eval_model
    score, loss, n_sample = compute_a_batch(batch, myModel, eval_mode=True, loss_criterion=loss_criterion)
  File "/home/rcadene/pythia/train_model/Engineer.py", line 191, in compute_a_batch
    logit_res = one_stage_run_model(batch, my_model, add_graph, log_dir, eval_mode)
  File "/home/rcadene/pythia/train_model/Engineer.py", line 249, in one_stage_run_model
    image_feat_variables=image_feat_variables)
  File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rcadene/pythia/top_down_bottom_up/top_down_bottom_up_model.py", line 103, in forward
    question_embedding_total, image_dim_variable_use)
  File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rcadene/pythia/top_down_bottom_up/image_embedding.py", line 39, in forward
    image_feat_variable, question_embedding, image_dims)
  File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rcadene/pythia/top_down_bottom_up/image_attention.py", line 140, in forward
    joint_feature = self.modal_combine(image_feat, question_embedding)
  File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rcadene/pythia/top_down_bottom_up/multi_modal_combine.py", line 142, in forward
    joint_feature = self.dropout(joint_feature)
  File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/modules/dropout.py", line 46, in forward
    return F.dropout(input, self.p, self.training, self.inplace)
  File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/functional.py", line 526, in dropout
    return _functions.dropout.Dropout.apply(input, p, training, inplace)
  File "/home/rcadene/.conda/envs/pythia/lib/python3.6/site-packages/torch/nn/_functions/dropout.py", line 32, in forward
    output = input.clone()
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

pythia/top_down_bottom_up/classifier.py

https://github.com/facebookresearch/pythia/blob/18cd8b9c23db229d8e0e55b0a6c2d337cb66d2e3/top_down_bottom_up/classifier.py#L38

How this line works well? I think the right way is to check for 'pretrained_text' in kwargs, not 'pretrained_image'.

evaluation use more memory than training

As the title implied

rcnn_10_100.tar.gz download speed

dying downloading

Some questions about the demo

Hi , I am trying to run the demo, but when I load the pretrained model ,something is wrong,just like this:

RuntimeError: invalid argument 2: sizes do not match at /pytorch/torch/lib/THC/THCTensorCopy.cu:31
During handling of the above exception, another exception occurred:
RuntimeError: While copying the parameter named question_embedding_models.0.embedding.weight, whose dimensions in the model are torch.Size([25541, 300]) and whose dimensions in the checkpoint are torch.Size([17871, 300]).

If you have time ,I wish you can help me ! Thank you very much!

Colab demo fails

The Colab demo fails. I believe the problem is that the CUDA version install now on Colab is not what this version of PyTorch uses. I noticed that in general the Colab Python package versions are ahead of what this demo uses.

If you want to have a permanently working Colab demo, I suspect that you need to nuke & pave the standard Colab runtime. That is, remove everything installed by PIP, the CUDA runtime, whatever else you can think of, and install from the repos. For example, the Python packages need:
!pip freeze > /tmp/all_packages.txt
!pip uninstall -r /tmp/all_packages.txt

Also, the demo demands GPU and will not work in CPU-only mode. I have not tried the TPU runtime, but suspect it will also not work.

Stack trace:

/content/pythia/pythia/.vector_cache/glove.6B.zip: 862MB [01:03, 13.5MB/s]
100%|█████████▉| 399163/400000 [00:50<00:00, 7829.19it/s]

RuntimeError Traceback (most recent call last)
in ()
----> 1 demo = PythiaDemo()

8 frames
in init(self)
40 def init(self):
41 self._init_processors()
---> 42 self.pythia_model = self._build_pythia_model()
43 self.detection_model = self._build_detection_model()
44 self.resnet_model = self._build_resnet_model()

in _build_pythia_model(self)
82
83 model.load_state_dict(state_dict)
---> 84 model.to("cuda")
85 model.eval()
86

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in to(self, *args, **kwargs)
379 return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
380
--> 381 return self._apply(convert)
382
383 def register_backward_hook(self, hook):

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _apply(self, fn)
185 def _apply(self, fn):
186 for module in self.children():
--> 187 module._apply(fn)
188
189 for param in self._parameters.values():

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in _apply(self, fn)
115 def _apply(self, fn):
116 ret = super(RNNBase, self)._apply(fn)
--> 117 self.flatten_parameters()
118 return ret
119

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py in flatten_parameters(self)
111 all_weights, (4 if self.bias else 2),
112 self.input_size, rnn.get_cudnn_mode(self.mode), self.hidden_size, self.num_layers,
--> 113 self.batch_first, bool(self.bidirectional))
114
115 def _apply(self, fn):

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

The training time increases for every iteration especially afrer 1000 iterations

Hi guys,
I don't know why the training time increases for every iteration especially after 1000 iterations like that. Is that a bug?

BEGIN TRAINING...
iter: 100 train_loss: 3.6883  train_score: 0.2910  avg_train_score: 0.1358 val_score: 0.2867 val_loss: 3.3840 time(s): 307.7 s
iter: 200 train_loss: 3.2849  train_score: 0.3172  avg_train_score: 0.2345 val_score: 0.2809 val_loss: 3.0615 time(s): 268.5 s
iter: 300 train_loss: 2.7410  train_score: 0.3990  avg_train_score: 0.2964 val_score: 0.3617 val_loss: 2.9113 time(s): 282.1 s
iter: 400 train_loss: 2.4218  train_score: 0.4266  avg_train_score: 0.3522 val_score: 0.3578 val_loss: 2.6786 time(s): 299.6 s
iter: 500 train_loss: 2.1596  train_score: 0.4867  avg_train_score: 0.3994 val_score: 0.3617 val_loss: 2.5952 time(s): 285.3 s
iter: 600 train_loss: 2.1245  train_score: 0.4711  avg_train_score: 0.4355 val_score: 0.3736 val_loss: 2.5149 time(s): 329.9 s
iter: 700 train_loss: 2.0011  train_score: 0.4941  avg_train_score: 0.4629 val_score: 0.3406 val_loss: 2.7058 time(s): 356.2 s
iter: 800 train_loss: 1.9192  train_score: 0.5035  avg_train_score: 0.4865 val_score: 0.3521 val_loss: 2.5427 time(s): 324.3 s
iter: 900 train_loss: 1.8463  train_score: 0.5258  avg_train_score: 0.5033 val_score: 0.3668 val_loss: 2.4809 time(s): 335.9 s
iter: 1000 train_loss: 1.7650  train_score: 0.5379  avg_train_score: 0.5152 val_score: 0.3984 val_loss: 2.5709 time(s): 351.2 s
i_epoch: 1 i_iter: 1000 val_loss:2.5339 val_acc:0.3811 runtime: 58.30 min
iter: 1100 train_loss: 1.6619  train_score: 0.5543  avg_train_score: 0.5284 val_score: 0.3645 val_loss: 2.7063 time(s): 761.0 s
iter: 1200 train_loss: 1.6474  train_score: 0.5939  avg_train_score: 0.5371 val_score: 0.3746 val_loss: 2.4699 time(s): 412.1 s
iter: 1300 train_loss: 1.7397  train_score: 0.5527  avg_train_score: 0.5425 val_score: 0.3492 val_loss: 2.4378 time(s): 464.8 s
iter: 1400 train_loss: 1.6970  train_score: 0.5656  avg_train_score: 0.5492 val_score: 0.3955 val_loss: 2.4459 time(s): 506.1 s
iter: 1500 train_loss: 1.6235  train_score: 0.5543  avg_train_score: 0.5565 val_score: 0.3902 val_loss: 2.3059 time(s): 518.4 s
iter: 1600 train_loss: 1.6067  train_score: 0.5924  avg_train_score: 0.5603 val_score: 0.3059 val_loss: 2.6804 time(s): 515.8 s
iter: 1700 train_loss: 1.5721  train_score: 0.5746  avg_train_score: 0.5633 val_score: 0.3906 val_loss: 2.5084 time(s): 555.8 s
iter: 1800 train_loss: 1.4407  train_score: 0.5904  avg_train_score: 0.5684 val_score: 0.4020 val_loss: 2.3391 time(s): 577.6 s
iter: 1900 train_loss: 1.8080  train_score: 0.5508  avg_train_score: 0.5692 val_score: 0.4139 val_loss: 2.3122 time(s): 630.0 s
iter: 2000 train_loss: 1.5992  train_score: 0.5533  avg_train_score: 0.5716 val_score: 0.3705 val_loss: 2.6689 time(s): 1204.8 s
i_epoch: 1 i_iter: 2000 val_loss:2.4600 val_acc:0.3718 runtime: 113.33 min
iter: 2100 train_loss: 1.5500  train_score: 0.5908  avg_train_score: 0.5771 val_score: 0.3785 val_loss: 2.5471 time(s): 2415.6 s
iter: 2200 train_loss: 1.6981  train_score: 0.5525  avg_train_score: 0.5822 val_score: 0.3852 val_loss: 2.5870 time(s): 1327.5 s
iter: 2300 train_loss: 1.4888  train_score: 0.5959  avg_train_score: 0.5826 val_score: 0.4281 val_loss: 2.3235 time(s): 1194.0 s
iter: 2400 train_loss: 1.5351  train_score: 0.6010  avg_train_score: 0.5838 val_score: 0.4047 val_loss: 2.3183 time(s): 1595.9 s
iter: 2500 train_loss: 1.5369  train_score: 0.5912  avg_train_score: 0.5889 val_score: 0.3975 val_loss: 2.3280 time(s): 2286.7 s
iter: 2600 train_loss: 1.5912  train_score: 0.5668  avg_train_score: 0.5916 val_score: 0.4189 val_loss: 2.2325 time(s): 3049.3 s
iter: 2700 train_loss: 1.5094  train_score: 0.5900  avg_train_score: 0.5932 val_score: 0.3729 val_loss: 2.3636 time(s): 2395.5 s

image dataset to .npy

How did you transform the vqa dataset to a .npy file available for download?

Are the different kinds of attention in "image_attention.py" redundant?

Hi 😅
In image_attention.py There are three classes

concatenate_attention
project_attention
double_project_attention

But they are not being used at all and there is no place for them to be called in the code from other file or function.(I think we don't need them because we use the MFH model)
We only uses top_down_attention class in the build_image_attention_module function.
My questions are, Are they redundant? and If we wanted to use the ordinary concatenate_attention or project_attention should I modify the build_image_attention_module function to

return concatenate_attention(image_feat_dim, txt_rnn_embeding_dim, hidden_size)

fc7_w.pkl not found

I am trying to run the vqa_standalone_image_demo. I have followed the steps to preprocess the data. But I there are some files apparently necessary, but there is no mention how to get them.
fc7_w.pkl and fc7_b.pkl are the one i am stuck with.

data_preprocess

In the step Preprocess dataset，the first command is cd ../../VQA_suite.
But there is no such directory，I create the directory and run following commands.
First, I run command python data_prep/vqa_v2.0/extract_vocabulary.py --input_files ../orig_data/vqa_v2.0/v2_OpenEnded_mscoco_train2014_questions.json ../orig_data/vqa_v2.0/v2_OpenEnded_mscoco_val2014_questions.json ../orig_data/vqa_v2.0/v2_OpenEnded_mscoco_test2015_questions.json --out_dir data/
directly, and the result is "python: can't open file 'data_prep/vqa_v2.0/extract_vocabulary.py': [Errno 2] No such file or directory"

So I change the cmd and run python ../data_prep/vqa_v2.0/extract_vocabulary.py --input_files ../orig_data/vqa_v2.0/v2_OpenEnded_mscoco_train2014_questions.json ../orig_data/vqa_v2.0/v2_OpenEnded_mscoco_val2014_questions.json ../orig_data/vqa_v2.0/v2_OpenEnded_mscoco_test2015_questions.json --out_dir data/
and result is "Traceback (most recent call last):
File "../data_prep/vqa_v2.0/extract_vocabulary.py", line 13, in
from dataset_utils.text_processing import tokenize
ImportError: No module named dataset_utils.text_processing"

I don't know where is my problem, and need your help. Thanks ahead!

Inference on K40m GPU

I tried running inference for pre-trained pythia mdoel on K40m. It didn't start for quite some time and then the ETA was oscillating around 10-15 hours.
So I enabled multi-GPU training using dataparallel flag. But now its not starting, I've waited for around 30 minutes before stopping it. I got the following errors on stopping:
"
File "/mnt/data_g/saransh/anaconda3/lib/python3.7/threading.py", line 1048, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
".
I've tried re-running it but the problem persists. Earlier I tried it on a V100, it was working fine on that.
Could you suggest something?

403 Forbidden when downloading the models for ensembling

Hi, thanks for sharing your excellent work.
I want to have a try on ensembling the 30 models which produces the highest score, however when I download the models hosted on https://s3-us-west-1.amazonaws.com/pythia-vqa/ensembled.tar.gz, it says: AccessDeniedAccess Denied052D4426298A2296MSgYIY2Ao5g//jFFqs6hZZnHxzzjrOE+h6yusooey8qvEDe0Ghy1IvDiXooZMwh3P97JEO84dgU=

Is there a way that I can access the models? Thanks again~

Question on TextVQA - Pythia vs LoRRA

Hey

I have few questions on the ablation studies conducted for the LoRRA model.

In Table 2 in the TextVQA paper , what is the difference between Pythia + O + C and Pythia + LoRRA. Is it that the second one also gets to choose its answers from a fixed lexicon ( either SA or LA)? Is it the only difference between the two?

While reading section 3 of the paper I get a feel that LoRRA has a VQA part ( image + question) and a reading part ( OCR tokens + question) and an answering module. This suggests that by LoRRA you mean the complete system. But in experiments Pythia + LoRRA is used to denote the best performing model. This nomenclature is bit confusing. Does it mean that to an existing Pythia style model you again add LoRRA which on its own has a vqa + reading + answering modules ?

Broken urls

I tried on two of my computers, but the links after wget in "README/Quick Start" seem to be broken. Please check.

Also mkdir data seems unnecessary in "README/Quick Start".

Also in README, https://www.continuum.io/downloads seems to be an obsolete link to Anaconda.

Unable to download validation splits

wget https://s3-us-west-1.amazonaws.com/vqa-suite/val_split/v2_OpenEnded_mscoco_minival2014_questions.json is causing a 404 error.

size of rcnn_10_100.tar.gz

I download rcnn_10_100.tar.gz twice, and find that it's size is about 33.8 GB.
But it should be 71.0 GB according to AWS s3 dataset summary.

Where is the bounding box?

Using detectron features

Hi,

I am trying to run the model and i am unable to download the detectron or the detectron_fix_100 (gunzip detectron_fix_100.tar.gz outputs gzip: detectron_fix_100.tar.gz: Input/output error), is there different link to the detectron features?

Thanks!

Why do we we need to split config['data']['image_feat_train'][0]?

In train_model/helper.py

> num_image_feat = len(config['data']['image_feat_train'][0].split(','))

although in config there is
__C.data.image_feat_train = ["rcnn_10_100/vqa/train2014", "rcnn_10_100/vqa/val2014"]
So config['data']['image_feat_train'][0] is equal "rcnn_10_100/vqa/train2014".
Why do we need to split that string using .split(',')?

Also, what does this if-condition mean?

> if hasattr(my_model, 'module'):
>    model = my_model.module

This if-condition is false(I mean hasattr(my_model, 'module') is false. But I don't know what those two lines mean.
Thank you so much for answering my last two questions by the way 😊 your code and your paper are great guides for me ☺️

Why is the number of "visual genome" data is different than in the paper?

len(np.load("imdb_genome.npy"))
>> 682736

But in the paper Tips and Tricks for VQA they said that their number is 485,000 questions. Why is there a difference?

What happen when you have duplicate class name to register?

Let's say you register following classes:

@registry.register_model("lorra")
class LoRRA(Pythia):
     ...


@registry.register_model("lorra")
class LoRRA_mod(Pythia):
     ...

i assume the class LORRA_mod will override class LoRRA. is my assumption correct?

when will you add bounding box?

Pretrained model performance?

Hi. Thanks for the resourceful repository.

I was wondering if you could share the pre-trained model performances for the validation and the test sets for the various datasets.

Did you experience increase in time while training?

In the code we print the time in save_a_report and save_a_snapshot functions. I find out the time increases while training (starts small and keep increasing with more iterations). and sometimes it increases dramatically.
Example in save_a_snapshot:
from [iteration 6000]

i_epoch: 2 i_iter: 6000 val_loss:2.4745 val_acc:0.3974 runtime: 33.43 min

to [iteration 7000]

i_epoch: 3 i_iter: 7000 val_loss:2.4929 val_acc:0.3963 runtime: 260.19 min

Example in save_a_report:
from [iteration 9200]

iter: 9200 train_loss: 1.1883  train_score: 0.6186  avg_train_score: 0.6580 val_score: 0.3975 val_loss: 2.6476 time(s): 195.1 s

to [iterations 9300, 9400]

iter: 9300 train_loss: 1.1513  train_score: 0.6588  avg_train_score: 0.6578 val_score: 0.3871 val_loss: 2.5129 time(s): 1371.4 s
iter: 9400 train_loss: 1.1269  train_score: 0.6826  avg_train_score: 0.6573 val_score: 0.4008 val_loss: 2.6888 time(s): 205.5 s

So it increases continuously in general(by small steps) while training and sometimes it increases dramatically(by big steps)
Another example:
from [the first thousand iterations]

BEGIN TRAINING...
iter: 100 train_loss: 3.6458  train_score: 0.3031  avg_train_score: 0.1377 val_score: 0.3258 val_loss: 3.4013 time(s): 199.5 s
iter: 200 train_loss: 3.1655  train_score: 0.3158  avg_train_score: 0.2387 val_score: 0.3410 val_loss: 2.9842 time(s): 228.1 s
iter: 300 train_loss: 2.6502  train_score: 0.3777  avg_train_score: 0.3034 val_score: 0.3326 val_loss: 2.8516 time(s): 192.2 s
iter: 400 train_loss: 2.3548  train_score: 0.4258  avg_train_score: 0.3544 val_score: 0.3467 val_loss: 2.5927 time(s): 193.4 s
iter: 500 train_loss: 2.1484  train_score: 0.4705  avg_train_score: 0.4003 val_score: 0.3934 val_loss: 2.5520 time(s): 215.1 s
iter: 600 train_loss: 2.1211  train_score: 0.4840  avg_train_score: 0.4367 val_score: 0.3977 val_loss: 2.4975 time(s): 183.0 s
iter: 700 train_loss: 2.0060  train_score: 0.4648  avg_train_score: 0.4661 val_score: 0.3475 val_loss: 2.6645 time(s): 182.8 s
iter: 800 train_loss: 1.8998  train_score: 0.5230  avg_train_score: 0.4891 val_score: 0.3543 val_loss: 2.5015 time(s): 187.2 s
iter: 900 train_loss: 1.8344  train_score: 0.5258  avg_train_score: 0.5037 val_score: 0.3783 val_loss: 2.4491 time(s): 185.4 s
iter: 1000 train_loss: 1.7774  train_score: 0.5184  avg_train_score: 0.5165 val_score: 0.3938 val_loss: 2.5243 time(s): 183.8 s
i_epoch: 1 i_iter: 1000 val_loss:2.4742 val_acc:0.3838 runtime: 34.87 min

to [the thirteenth thousand iterations]

i_epoch: 5 i_iter: 13000 val_loss:2.7267 val_acc:0.3917 runtime: 54.67 min
iter: 13100 train_loss: 1.0795  train_score: 0.6867  avg_train_score: 0.6843 val_score: 0.3723 val_loss: 2.8208 time(s): 1550.9 s
iter: 13200 train_loss: 1.1232  train_score: 0.6627  avg_train_score: 0.6836 val_score: 0.4021 val_loss: 2.8624 time(s): 196.6 s
iter: 13300 train_loss: 1.0556  train_score: 0.6756  avg_train_score: 0.6826 val_score: 0.4186 val_loss: 2.5904 time(s): 210.9 s
iter: 13400 train_loss: 1.0774  train_score: 0.6979  avg_train_score: 0.6825 val_score: 0.4125 val_loss: 2.5742 time(s): 207.8 s
iter: 13500 train_loss: 1.0958  train_score: 0.6840  avg_train_score: 0.6843 val_score: 0.4084 val_loss: 2.5981 time(s): 201.2 s
iter: 13600 train_loss: 1.0693  train_score: 0.6816  avg_train_score: 0.6870 val_score: 0.4365 val_loss: 2.5409 time(s): 202.8 s
iter: 13700 train_loss: 1.1302  train_score: 0.6598  avg_train_score: 0.6871 val_score: 0.3939 val_loss: 2.7158 time(s): 197.0 s
iter: 13800 train_loss: 1.0662  train_score: 0.6736  avg_train_score: 0.6859 val_score: 0.3746 val_loss: 2.7563 time(s): 197.9 s
iter: 13900 train_loss: 1.0325  train_score: 0.6984  avg_train_score: 0.6857 val_score: 0.3762 val_loss: 2.9416 time(s): 214.2 s
iter: 14000 train_loss: 0.9614  train_score: 0.7232  avg_train_score: 0.6857 val_score: 0.3832 val_loss: 2.6673 time(s): 270.2 s
i_epoch: 5 i_iter: 14000 val_loss:2.6989 val_acc:0.3935 runtime: 60.91 min

I tried pytorch 4.0 and pytorch 1.0.
PS: I am training with datasets [imdb_train2014.npy, imdb_val2train2014.npy, imdb_genome.npy, imdb_vdtrain.npy] but I don't think that this will make any difference.

LR hyperparameters tuning method

Hi,

Thanks again for your code. Unfortunately, I ran into a little issue. I can't reproduce some of your results because I am obliged to reduce my batch size (from 512 (yours) to 75). Thus I need to change the hyperparameters related to the learning rate.

Finding the right learning rate can be easily done by a little grid search. However, I would like to know how did you tune the hyperparameters related to the scheduler.
Especially:

__C.training_parameters.wu_factor = 0.2
__C.training_parameters.wu_iters = 1000
__C.training_parameters.lr_steps = [5000, 7000, 9000, 11000]
__C.training_parameters.lr_ratio = 0.1

Sharing your method would be awesome :)

Thanks for your help!
Remi

Why dropout equals to zero?

404 File not found when downloading VQA2.0

Hi, guys!
I was wondering why Annotations and Questions in VQA2.0 were unaccessible to me. I got
ERROR 404: Not Found.
when running file download_vqa_2.0.sh.
Any advice? Thanks!

Ask for more models to extract features

It seems that the data preprocessing guide only provides a pre-trained fast_rcnn model. Do you have any plans to release the other pre-trained models like faster_rcnn? Thank you so much.

https://github.com/facebookresearch/pythia/blob/a494706946af278da37db4de367c62d827eacf08/data_prep/data_preprocess.md

About the size of fc6

In the paper, you've mentioned that

This allows us to extract the 2048D fc6 features ...

As far as I know, Detectron's original FPN 2MLP head uses 1024D as channel dim. (https://github.com/facebookresearch/Detectron/blob/master/detectron/core/config.py#L643)

Is there any reason you chose 2048D over 1024D? Will there be any performance degrading (mAP for genome or acc for VQA) for using 1024D feature?

Thanks!

Adding mirror data make the training worse

data:
batch_size: 512
data_root_dir: data
dataset: vqa_2.0
image_depth_first: false
image_fast_reader: false
image_feat_test:

detectron_fix_100/fc6/vqa/val2014/
image_feat_train:
detectron_fix_100/fc6/vqa/train2014/
detectron_fix_100/fc6/vqa_mirror/train2014
image_feat_val:
detectron_fix_100/fc6/vqa/val2014/
image_max_loc: 100
imdb_file_test:
imdb/imdb_minival2014.npy
imdb_file_train:
imdb/imdb_train2014.npy
imdb/imdb_mirror_train2014.npy
imdb_file_val:
imdb/imdb_val2014.npy
num_workers: 20
question_max_len: 14
vocab_answer_file: answers_vqa.txt
vocab_question_file: vocabulary_vqa.txt
exp_name: intrainter_baseline

Hi, when I add the mirror data of training into training. The performance drop a lot for validation.
Do you have any suggestion?

Do I need to download the tremendous weights again?

I see the updates of the code, so do I need to download the tremendous weights again?

Broken link in README

The link for BAN in the README is broken

Model Zoo: Reference implementations for state-of-the-art vision and language model including LoRRA (SoTA on VQA and TextVQA), Pythia model (VQA 2018 challenge winner) and BAN.

Does GRU be used in question embedding?

In the paper, you mentioned about "we used 300D GloVe [11] vectors to initialize the word embeddings and then passed it to a GRU network and a question attention module to extract attentive text features"

However, in question_embeding.py, there are two methods for question embedding. As I noticed in the config, you're using att_que_embed, which does not pass through a GRU layer.

def build_question_encoding_module(method, par, num_vocab):
    if method == "default_que_embed":
        return QuestionEmbeding(num_vocab, **par)
    elif method == "att_que_embed":
        return AttQuestionEmbedding(num_vocab, **par)
    else:
        raise NotImplementedError(
            "unknown question encoding model %s" % method)

class QuestionEmbeding(nn.Module):
    def __init__(self, **kwargs):
        super(QuestionEmbeding, self).__init__()
        self.text_out_dim = kwargs['LSTM_hidden_size']
        self.num_vocab = kwargs['num_vocab']
        self.embedding_dim = kwargs['embedding_dim']
        self.embedding = nn.Embedding(
            kwargs['num_vocab'], kwargs['embedding_dim'])
        self.gru = nn.GRU(
            input_size=kwargs['embedding_dim'],
            hidden_size=kwargs['LSTM_hidden_size'],
            num_layers=kwargs['lstm_layer'],
            dropout=kwargs['lstm_dropout'],
            batch_first=True)
        self.batch_first = True

        if 'embedding_init' in kwargs and kwargs['embedding_init'] is not None:
            self.embedding.weight.data.copy_(
                torch.from_numpy(kwargs['embedding_init']))

    def forward(self, input_text):
        embeded_txt = self.embedding(input_text)
        out, hidden_state = self.gru(embeded_txt)
        res = out[:, -1]
        return res


class AttQuestionEmbedding(nn.Module):
    def __init__(self, num_vocab, **kwargs):
        super(AttQuestionEmbedding, self).__init__()
        self.embedding = nn.Embedding(num_vocab, kwargs['embedding_dim'])
        self.LSTM = nn.LSTM(input_size=kwargs['embedding_dim'],
                            hidden_size=kwargs['LSTM_hidden_size'],
                            num_layers=kwargs['LSTM_layer'],
                            batch_first=True)
        self.Dropout = nn.Dropout(p=kwargs['dropout'])
        self.conv1 = nn.Conv1d(
            in_channels=kwargs['LSTM_hidden_size'],
            out_channels=kwargs['conv1_out'],
            kernel_size=kwargs['kernel_size'],
            padding=kwargs['padding'])
        self.conv2 = nn.Conv1d(
            in_channels=kwargs['conv1_out'],
            out_channels=kwargs['conv2_out'],
            kernel_size=kwargs['kernel_size'],
            padding=kwargs['padding'])
        self.text_out_dim = kwargs['LSTM_hidden_size'] * kwargs['conv2_out']

        if 'embedding_init_file' in kwargs \
                and kwargs['embedding_init_file'] is not None:
            if os.path.isabs(kwargs['embedding_init_file']):
                embedding_file = kwargs['embedding_init_file']
            else:
                embedding_file = os.path.join(
                    cfg.data.data_root_dir, kwargs['embedding_init_file'])
            embedding_init = np.load(embedding_file)
            self.embedding.weight.data.copy_(torch.from_numpy(embedding_init))

    def forward(self, input_text):
        batch_size, _ = input_text.data.shape
        embed_txt = self.embedding(input_text)          # N * T * embedding_dim

        # self.LSTM.flatten_parameters()
        lstm_out, _ = self.LSTM(embed_txt)  # N * T * LSTM_hidden_size
        lstm_drop = self.Dropout(lstm_out)  # N * T * LSTM_hidden_size
        lstm_reshape = lstm_drop.permute(0, 2, 1)  # N * LSTM_hidden_size * T

        qatt_conv1 = self.conv1(lstm_reshape)  # N x conv1_out x T
        qatt_relu = F.relu(qatt_conv1)
        qatt_conv2 = self.conv2(qatt_relu)  # N x conv2_out x T

        qtt_softmax = F.softmax(qatt_conv2, dim=2)
        # N * conv2_out * LSTM_hidden_size
        qtt_feature = torch.bmm(qtt_softmax, lstm_drop)
        # N * (conv2_out * LSTM_hidden_size)
        qtt_feature_concat = qtt_feature.view(batch_size, -1)

        return qtt_feature_concat

Can we have suppport for CPU based demo for pythia?

It would be great if we can have support for CPU based demo for pythia.

Performance of pre-trained model

On using the pretrained pythia model downloaded, I'm just getting 66.7% overall accuracy on test-dev, which is much lower than reported single-model accuracy. Am I doing something wrong?
I tried downloading train+dev model from https://dl.fbaipublicfiles.com/pythia/pretrained_models/textvqa/pythia_train_val.pth
but the link seems broken.

Also, how much GPU memory is needed to train the model without any changes? I've a V100 but I'm getting not sufficient memory errors.
Thanks

Error when setup

Hi, I keep meeting this error and can not figure out what happen. Can you give me some suggestion to solve?

The below is the screen messages.

running develop
Checking .pth file support in /usr/local/lib/python3.5/dist-packages/
/usr/bin/python3 -E -c pass
TEST PASSED: /usr/local/lib/python3.5/dist-packages/ appears to support .pth files
running egg_info
writing pythia.egg-info/PKG-INFO
writing dependency_links to pythia.egg-info/dependency_links.txt
writing requirements to pythia.egg-info/requires.txt
writing top-level names to pythia.egg-info/top_level.txt
reading manifest file 'pythia.egg-info/SOURCES.txt'
writing manifest file 'pythia.egg-info/SOURCES.txt'
running build_ext
Creating /usr/local/lib/python3.5/dist-packages/pythia.egg-link (link to .)
pythia 0.3 is already the active version in easy-install.pth

Installed /home/victor/VQA/Pythia
Processing dependencies for pythia==0.3
Searching for fastText
Best match: fastText [unknown version]
Downloading https://github.com/facebookresearch/fastText/tarball/master#egg=fastText

Processing master
Writing /tmp/easy_install-37swd4hf/facebookresearch-fastText-6dd2e11/setup.cfg
Running facebookresearch-fastText-6dd2e11/setup.py -q bdist_egg --dist-dir /tmp/easy_install-37swd4hf/facebookresearch-fastText-6dd2e11/egg-dist-tmp-htcy2yd7
warning: no files found matching 'PATENTS'
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
error: Setup script exited with error: SandboxViolation: mkdir('/home/victor/.local/lib', 448) {}

The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.

This package cannot be safely installed by EasyInstall, and may not
support alternate installation locations even if you run its setup
script by hand.  Please inform the package's author and the EasyInstall
maintainers to find out if a fix or workaround is available.

How can we extract image features instead of downloading large files?

Can pythia provide pretrained image feature model thus we extract the features from the original image, no need to download image features' file , that's pretty large. Thank you!

What does "imdb" refer to?

I know it's a silly question 😅 And what is the layout_max_len, vocab_layout_file and the has_gt_layout? I mean, what is "layout"? 😅