GithubHelp home page GithubHelp logo

vqa-mfb's Introduction

MFB and MFH for VQA

This project is deprecated! The Pytorch implementation of MFB(MFH)+CoAtt with pre-trained models, along with several state-of-the-art VQA models are maintained in our OpenVQA project, which is much more convenient to use!

This project is the implementation of the papers Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering (MFB) and Beyond Bilinear: Generalized Multi-modal Factorized High-order Pooling for Visual Question Answering (MFH). Compared with existing state-of-the-art approaches such as MCB and MLB, our MFB models achieved superior performance on the large-scale VQA-1.0 and VQA-2.0 datasets. Moreover, MFH, the high-order extention of MFB, is also proveided to report better VQA performance. The MFB(MFH)+CoAtt network architecture for VQA is illustrated in Figure 1.

Figure 1: The MFB+CoAtt Network architecture for VQA.

Figure 1: The MFB+CoAtt Network architecture for VQA.

Update Dec. 2nd, 2017

The 3rd-party pytorch implementation for MFB(MFH) is released here. Great thanks, Liam!

Update Sep. 5th, 2017

Using the Bottom-up and Top-Down (BUTD) image features (the model with adaptive K ranges from [10,100]) here, our single MFH+CoAtt+GloVe model achieved the overall accuracy 68.76% on the test-dev set of VQA-2.0 dataset. With an ensemble of 8 models, we achieved the new state-of-the-art performance on the VQA-2.0 dataset's leaderboard with the overall accuracy 70.92%.

Update Aug. 1st, 2019

Our solution for the VQA Challenge 2017 is updated!

We proposed a high-order extention for MFB, i.e., the Multi-modal Factorized High-order Pooling (MFH). See the flowchart in Figure 2 and the implementations in mfh_baseline and mfh-coatt-glove folders. With an ensemble of 9 MFH+CoAtt+GloVe(+VG) models, we won the 2nd place (tied with another team) in the VQA Challenge 2017. The detailed information can be found in our paper (the second paper in the CITATION section on bottom of this page).

Figure 2: The high-order MFH model which consists of p MFB blocks (without sharing parameters).

Prerequisites

Our codes is implemented based on the high-quality vqa-mcb project. The data preprocessing and and other prerequisites are the same with theirs. Before running our scripts to train or test MFB model, see the Prerequisites and Data Preprocessing sections in the README of vqa-mcb's project first.

  • The Caffe version required for our MFB is slightly different from the MCB. We add some layers, e.g., sum pooling, permute and KLD loss layers to the feature/20160617_cb_softattention branch of Caffe for MCB. Please checkout our caffe version here and compile it. Note that CuDNN is not compatible with sum pooling currently, you should switch it off to run the codes correctly.

Pretrained Models

We release the pretrained single model "MFB(or MFH)+CoAtt+GloVe+VG" in the papers. To the best of our knowledge, our MFH+CoAtt+GloVe+VG model report the best result (test-dev) with a single model on both the VQA-1.0 and VQA-2.0 datasets(train + val + visual genome). The corresponding results are shown in the table below. The results JSON files (results.zip for VQA-1.0) are also included in the model folders, which can be uploaded to the evaluation servers directly. Note that the models are trained with a old version of GloVe in spacy. If you use the latest one, they maybe incosistent, leading to inferior performance. I suggest training the model from scratch by yourself.

Datasets\Models MCB MFB MFH MFH (BUTD img features)
VQA-1.0 65.38% 66.87% BaiduYun 67.72% BaiduYun or Dropbox 69.82%
VQA-2.0 62.33%1 65.09% BaiduYun 66.12% BaiduYun or Dropbox 68.76%2

1 the MCB result on VQA-2.0 is provided by the VQA Challenge organizer with does not introdunce the GloVe embedding.

2 overall: 68.76, yes/no: 84.27, num: 49.56, other: 59.89

Training from Scratch

We provide the scripts for training two MFB models from scratch, i.e., mfb-baseline and mfb-coatt-glove folders. Simply running the python scripts train_*.py to train the models from scratch.

  • Most of the hyper-parameters and configrations with comments are defined in the config.py file.
  • The solver configrations are defined in the get_solver function in the train_*.py scripts.
  • Pretrained GloVe word embedding model (the spacy library) is required to train the mfb-coatt-glove model. The installation instructions of spacy and GloVe model can be found here.

Evaluation

To generate an answers JSON file in the format expected by the VQA evaluation code and VQA test server, you can use eval/ensemble.py. This code can also ensemble multiple models. Running python ensemble.py will print out a help message telling you what arguments to use.

Licence

This code is distributed under MIT LICENSE. The released models are only allowed for non-commercial use.

Citation

If the codes are helpful for your research, please cite

@article{yu2017mfb,
  title={Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Fan, Jianping and Tao, Dacheng},
  journal={IEEE International Conference on Computer Vision (ICCV)},
  pages={1839--1848},
  year={2017}
}

@article{yu2018beyond,
  title={Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering},
  author={Yu, Zhou and Yu, Jun and Xiang, Chenchao and Fan, Jianping and Tao, Dacheng},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  volume={29},
  number={12},
  pages={5947--5959},
  year={2018}
}

Concat

Zhou Yu [yuz(AT)hdu.edu.cn]

vqa-mfb's People

Contributors

yuzcccc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vqa-mfb's Issues

error on building caffe

[ RUN ] LayerFactoryTest/1.TestCreateLayer
src/caffe/test/test_layer_factory.cpp:47: Failure
Value of: layer->type()
Actual: "MatMul2"
Expected: iter->first
Which is: "MatMul"
[ FAILED ] LayerFactoryTest/1.TestCreateLayer, where TypeParam = caffe::CPUDevice (195 ms)

Any advice? I guess the error is due to the custom layer.

Question about MFB Baseline

This is a confirmation question about MFB Baseline. According to paper there should be two layers of LSTM with 1024-D hidden units each, but in the implementation only one LSTM layer with 1024-D is used. Kindly confirm.

Thanks

Question about dataset

Hi Yu, when I am trying to run the code, I noticed that the config.py in mfh_coatt_glove contains some data files such as 'OpenEnded_mscoco_train2014_questions.json', but I have not found these files in the link you provided (http://visualqa.org/download.html). Could you please tell me where can I find those files? Thanks a lot.

Resuming training from last iteration

The original vqa-mcb didn't have build in support for restarting training which was due to the compact bilinear layer. Can this caffe implementation resume training from the last iteration?

A issue about dataset of VQA2 on preprocessing.

@yuzcccc Hi, Yu. When I do the preprocess for extracting features, I use the workstation with 2 TITAN. But the code use the GPU memory only 4G, the remaining 20G is not used, the code running speed is very slow. Could you tell me how to make full use of the 2 TITAN, and improve the code running speed?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.