GithubHelp home page GithubHelp logo

trellixvulnteam / wsg-vqa-vltransformers_4mk8 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from aurooj/wsg-vqa-vltransformers

0.0 0.0 0.0 1.21 MB

Weakly Supervised Grounding for VQA in Vision-Language Transformers

License: MIT License

Shell 7.14% Python 92.86%

wsg-vqa-vltransformers_4mk8's Introduction

Weakly Supervised Grounding for VQA in Vision-Language Transformers [ECCV 2022]

Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels Da Vitoria Lobo, Mubarak Shah

Website | arXiv | BibTeX

Official Pytorch implementation and pre-trained models for Weakly Supervised Grounding for VQA in Vision-Language Transformers (coming soon).

Abstract

Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, the following paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers. The approach leverages capsules by grouping each visual token in the visual encoder and uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field.

(a) Proposed Architecture, (b) Proposed Capsule Encoding layer, (c) Proposed Capsule Layer

Qualitative Results

gqa-qualitative

Code

This code is built upon code base of LXMERT. Thanks to Hao Tan for providing excellent code for their model.

Datasets

for pretraining, we used MSCOCO, VG for image-captions pairs and Viz7W, VQA v2.0, GQA for question-image pairs. We used instructions provided by LXMERT to prepare the data except a few changes.

  1. We removed GQA validation set from pretraining data as we use it for grounding evaluation.
  2. We validate our pretraining on mscoco-minival split.

Pretraining

To pretrain the backbone, use the following command:

bash run/pretrain_2stage_fulldata_no_init_16_caps.bash

Finetuning on downstream tasks

GQA

See run/gqa_finetune_caps.bash for finetuning on GQA dataset.

VQA-HAT

Finetuning on VQA-HAT is similar to how we finetune the model on GQA. I will keep adding more concrete details in next few days.

Citation

If this work is useful for your research, please cite our paper.

@inproceedings{Khan2022WeaklySG,
  title={Weakly Supervised Grounding for VQA in Vision-Language Transformers},
  author={Aisha Urooj Khan and Hilde Kuehne and Chuang Gan and Niels da Vitoria Lobo and Mubarak Shah},
  year={2022}
}

Questions?

Please contact '[email protected]'

wsg-vqa-vltransformers_4mk8's People

Contributors

aurooj avatar trellixvulnteam avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.