GithubHelp home page GithubHelp logo

doc-doc / next-oe Goto Github PK

View Code? Open in Web Editor NEW
25.0 2.0 1.0 2.88 MB

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR'21)

License: MIT License

Python 99.95% Shell 0.05%
videoqa vision-language video-comprehension multi-object-interaction causal-temporal-action-reasoning

next-oe's Introduction

We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021.

NExT-QA is a VideoQA benchmark targeting the explanation of video contents. It challenges QA models to reason about the causal and temporal actions and understand the rich object interactions in daily activities. We set up both multi-choice and open-ended QA tasks on the dataset. This repo. provides resources for open-ended QA; multi-choice QA is found in NExT-QA. For more details, please refer to our dataset page.

Todo

  1. Raw Videos are the same with NExT-QA(MC).
  2. Open online evaluation server and release test data.
  3. RoI features are the same with NExT-QA(MC).

Environment

Anaconda 4.8.4, python 3.6.8, pytorch 1.6 and cuda 10.2. For other libs, please refer to the file requirements.txt.

Install

Please create an env for this project using anaconda (should install anaconda first)

>conda create -n videoqa python==3.6.8
>conda activate videoqa
>git clone https://github.com/doc-doc/NExT-OE.git
>pip install -r requirements.txt

Data Preparation

Please download the pre-computed features and QA annotations from here. There are 3 zip files:

  • ['vid_feat.zip']: Appearance and motion feature for video representation (same as multi-choice QA).
  • ['nextqa.zip']: Annotations of QAs and GloVe Embeddings (open-ended version).
  • ['models.zip']: HGA model (open-ended version).

After downloading the data, please create a folder ['data/feats'] at the same directory as ['NExT-OE'], then unzip the video features into it. You will have directories like ['data/feats/vid_feat/', and 'NExT-OE/'] in your workspace. Please unzip the files in ['nextqa.zip'] into ['NExT-OE/dataset/nextqa'] and ['models.zip'] into ['NExT-OE/models/'].

Usage

Once the data is ready, you can easily run the code. First, to test the environment and code, we provide the prediction and model of the SOTA approach (i.e., HGA) on NExT-QA. You can get the results reported in the paper by running:

>python eval_oe.py

The command above will load the prediction file under ['results/'] and evaluate it. You can also obtain the prediction by running:

>./main.sh 0 val #Test the model with GPU id 0

The command above will load the model under ['models/'] and generate the prediction file. If you want to train the model, please run

>./main.sh 0 train # Train the model with GPU id 0

It will train the model and save to ['models']. (The results may be slightly different depending on the environments)

Results on Val

Methods Text Rep. WUPS_C WUPS_T WUPS_D WUPS
BlindQA GloVe 12.14 14.85 40.41 18.88
STVQA (CVPR17) GloVe 12.52 14.57 45.64 20.08
UATT (TIP17) GloVe 13.62 16.23 43.41 20.65
HME (CVPR19) GloVe 12.83 14.76 45.13 20.18
HCRN (CVPR20) GloVe 12.53 15.37 45.29 20.25
HGA (AAAI20) GloVe 14.76 14.90 46.60 21.48

Please refer to our paper for results on the test set.

Multi-choice QA vs. Open-ended QA

vis mc_oe

Some Latest Results

Methods Publication Highlight Val (WUPS@All) Test (WUPS@All)
Emu(0-shot) by BAAI arXiv'23 VL foundation model - 23.4
Flamingo(0-shot) by DeepMind NeurIPS'22 VL foundation model - 26.7
KcGA by Baidu AAAI'23 Knowledge base, GPT-2 - 28.2
Flamingo(32-shot) by DeepMind NeurIPS'22 VL foundation model - 33.5
PaLI-X by Google Research arXiv'23 VL foundation model - 38.3

Citation

@InProceedings{xiao2021next,
    author    = {Xiao, Junbin and Shang, Xindi and Yao, Angela and Chua, Tat-Seng},
    title     = {NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2021},
    pages     = {9777-9786}
}

Acknowledgement

Our reproduction of the methods is based on the respective official repositories, we thank the authors to release their code. If you use the related part, please cite the corresponding paper commented in the code.

next-oe's People

Contributors

doc-doc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

gary-code

next-oe's Issues

How to get the word representations

Hi, I have a simple question. How to get the glove_embed.npy and vocab.pkl if there is a new dataset. To get the glove_embed.npy, can we need to train new word vectors on the vocabulary build by us? And if possible, can you release the pre-processing nlp code? Thanks very much.

Evaluation results on testing set

I have trained the HGA model and evaluated the model on the testing set. But WUPS is 23-24.

I also tested the generated answers provided in this repository (HGA-same-att-qns23ans7-test.json), and the WUPS is 24.01.

So how can I reproduce the testing results reported in the paper.

Besides, I trained my blinded QA model and the results can reach 23, which is similar to the VideoQA model. The visual information seems to be NOT helpful in this task.

Inquire about the video feature extraction codes and fine tune bert code of QA

Hi! Thanks for your sharing codes and excellent work! now I can reproduce the same results from the OE model codes. However, I wonder how you extract the video features and how you fine-tune BERT. It seems that there are not appearing on the code GitHub. I check the issue of the NExT-QA pages, but the BERT link seems not in effect and you said you will add the code for extracting the video features. Could you tell me how to extract the video feature and fine tune bert on QA? or could you release the codes? Thanks for your reading and time!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.