thaolmk54 / hcrn-videoqa Goto Github PK

Implementation for the paper "Hierarchical Conditional Relation Networks for Video Question Answering" (Le et al., CVPR 2020, Oral)

License: Apache License 2.0

Python 100.00%

tgif-qa videoqa question-answering vqa

hcrn-videoqa's Introduction

Hierarchical Conditional Relation Networks for Video Question Answering (HCRN-VideoQA)

We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that encapsulates and transforms an array of tensorial objects into a new array of the same kind, conditioned on a contextual feature. The flexibility of CRN units is then examined in solving Video Question Answering, a challenging problem requiring joint comprehension of video content and natural language processing.

Illustrations of CRN unit and the result of model building HCNR for VideoQA:

CRN Unit	HCRN Architecture

Check out our paper for details.

Setups

Clone the repository:

 git clone https://github.com/thaolmk54/hcrn-videoqa.git

Download TGIF-QA, MSRVTT-QA, MSVD-QA dataset and edit absolute paths in preprocess/preprocess_features.py and preprocess/preprocess_questions.py upon where you locate your data. Default paths are with /ceph-g/lethao/datasets/{dataset_name}/.
Install dependencies:

conda create -n hcrn_videoqa python=3.6
conda activate hcrn_videoqa
conda install -c conda-forge ffmpeg
conda install -c conda-forge scikit-video
pip install -r requirements.txt

Experiments with TGIF-QA

Depending on the task to chose question_type out of 4 options: action, transition, count, frameqa.

Preprocessing visual features

To extract appearance feature:

python preprocess/preprocess_features.py --gpu_id 2 --dataset tgif-qa --model resnet101 --question_type {question_type}

To extract motion feature:

Download ResNeXt-101 pretrained model (resnext-101-kinetics.pth) and place it to data/preprocess/pretrained/.

python preprocess/preprocess_features.py --dataset tgif-qa --model resnext101 --image_height 112 --image_width 112 --question_type {question_type}

Note: Extracting visual feature takes a long time. You can download our pre-extracted features from here and save them in data/tgif-qa/{question_type}/. Please use the following command to join split files:

cat tgif-qa_{question_type}_appearance_feat.h5.part* > tgif-qa_{question_type}_appearance_feat.h5

Proprocess linguistic features

Download glove pretrained 300d word vectors to data/glove/ and process it into a pickle file:

python txt2pickle.py

Preprocess train/val/test questions:

python preprocess/preprocess_questions.py --dataset tgif-qa --question_type {question_type} --glove_pt data/glove/glove.840.300d.pkl --mode train
    
python preprocess/preprocess_questions.py --dataset tgif-qa --question_type {question_type} --mode test

Training

Choose a suitable config file in configs/{task}.yml for one of 4 tasks: action, transition, count, frameqa to train the model. For example, to train with action task, run the following command:

python train.py --cfg configs/tgif_qa_action.yml

Evaluation

To evaluate the trained model, run the following:

python validate.py --cfg configs/tgif_qa_action.yml

Note: Pretrained model for action task is available here. Save the file in results/expTGIF-QAAction/ckpt/ for evaluation.

Experiments with MSRVTT-QA and MSVD-QA

The following is to run experiments with MSRVTT-QA dataset, replace msrvtt-qa with msvd-qa to run with MSVD-QA dataset.

Preprocessing visual features

To extract appearance feature:

python preprocess/preprocess_features.py --gpu_id 2 --dataset msrvtt-qa --model resnet101

To extract motion feature:

python preprocess/preprocess_features.py --dataset msrvtt-qa --model resnext101 --image_height 112 --image_width 112

Proprocess linguistic features

Preprocess train/val/test questions:

python preprocess/preprocess_questions.py --dataset msrvtt-qa --glove_pt data/glove/glove.840.300d.pkl --mode train
    
python preprocess/preprocess_questions.py --dataset msrvtt-qa --question_type {question_type} --mode val
    
python preprocess/preprocess_questions.py --dataset msrvtt-qa --question_type {question_type} --mode test

Training

python train.py --cfg configs/msrvtt_qa.yml

Evaluation

To evaluate the trained model, run the following:

python validate.py --cfg configs/msrvtt_qa.yml

Citations

If you make use of this repository for your research, please cite the following paper:

@article{le2020hierarchical,
  title={Hierarchical Conditional Relation Networks for Video Question Answering},
  author={Le, Thao Minh and Le, Vuong and Venkatesh, Svetha and Tran, Truyen},
  journal={arXiv preprint arXiv:2002.10698},
  year={2020}
}

Acknowledgement

As for motion feature extraction, we adapt ResNeXt-101 model from this repo to our code. Thank @kenshohara for releasing the code and the pretrained models.
We refer to this repo for preprocessing.
Our implementation of dataloader is based on this repo.

hcrn-videoqa's People

Contributors

Stargazers

Watchers

hcrn-videoqa's Issues

MSRVTT-QA dataset some videos loss

Hi, thanks for sharing the repo and your work. When I need to use the source video of MSRVTT-QA, I found some provided urls are invalid now. Could you share the source video of MSRVTT-QA? Thanks a lot.

About the accuracy of tgif-qa

Hi，
I downloaded code、features、pre-trained models, but I got the accuracy of Count about 4.05/4.04/4.05 on test. When I train the model， I got 4.0639/4.0802/4.0599 on Count test and 0.7476/0.7454/0.7449 on Action test. I wonder if the parameters of configs/tgif_qa_xx.yml need to be adjusted, or I need do other settings.

Another way of downloading the tgif dataset

Hi,
Thank you for your work! I want to run the code, but when I download the dataset tgif by the links from the tsv file, it always fails. Do you have another way to download it? Like Google Drive or other cloud disks?

How much will the hinge loss converge to?

Hi,

I was trying to use the model to train on another dataset. I found that the hinge loss for multi-choice problems finally converged to about 1.0. I wonder how much will the hinge loss finally be in your training process.

Thanks for your reply!

Low acc when validating msrvtt-qa dataset

Sorry for distubing you, i ran you code with msrvtt-qa but get a bad error which cause the acc of val to 0.
When i ran train.py, train_acc is great but val_acc is 0 in every epoch.
Then i found that with model.train(), it got a great acc. With model.eval(), the model always outputed , the output tensor never change whatever the inputs are.
How can i fix this problem?

Pre-extracted features link not working

Hi,

I was trying to download the pre-extracted features through the link https://bit.ly/2TX9rlZ. But accessing the link gives me the error "We're sorry, but [email protected] can't be found in the deakin365-my.sharepoint.com directory. Please try again later, while we try to automatically fix this for you." ([email protected] is my email for Microsoft account). Is there anything I should do to fix this error? Or can you upload the features to google drive?

Thanks!

Decoder problem

I'm sorry to trouble you again. When I was debugging, there was a decoding problem. I found some data but didn't solve it. Please give me some ideas and methods to solve it.

Training on TGIF-QA / FrameQA

Hi,

Thanks for your great work. I have no problem using the code for MSVD-QA / MSRVTT-QA / the 3 other tasks of TGIF-QA, but as I train on TGIF-QA for FrameQA subtask, the loss quickly becomes nan (after about 80% of the first epoch), and the accuracy is 0. Do you have an idea why it happens?

A problem of accuracy

Hi,
I re-downloaded all your files and trained them four times. I completely follow your readme file to train, but the accuracy of the action task is only about 73% . If you need to view the log file, I can send it to you. And I tried both in 1080ti and 2080ti and got the same result.

MSRVTT-QA, MSVD-QA dataset

hello, Thank you for your excellent work.
I would like to know if you are willing to provide preprocessed features of MSRVTT-QA and MSVD-QA datasets? I want to test your model on MSRVTT-QA, MSVD-QA dataset.

Motion model information

Hi,

Firstly let me appreciate your work. Your code is such an elegant one. Unlike other code, you provide the code to extract visual and text features so that it is easier for me to apply your method to my customize datasets and tasks.

Now I'm changing the feature extraction method to improve the performance in my tasks. But I don't know where the motion model ResNeXt-101 is from, which dataset it was pre-trained on and what is the accuracy. So I cannot campare this model with other models directly. If I try them one by one, it will be very time consuming. Could you please tell me some information?

Thanks a lot!

A question of glove?

Hi,

This is your pre-processing question file. The file read in is glove.6B.300d.txt, but the download link you gave is glove.840B.300d.txt.
The file I trained data with was glove.6B.300d.txt last time, and did not use the link you gave.

epoch

您的代码为什么选用第25轮作为最终结果呢，在25轮时验证集的准确率明显下降，但是损失函数一直在降低

About the mutli-choice task in t-gif qa

Hello, thanks a lot for sharing your impressive work.

I notice you use the candidate answers information on MC tasks, in HCRNNetwork.forward
out = self.output_unit(question_embedding[batch_agg], q_visual_embedding[batch_agg], ans_candidates_embedding,a_visual_embedding).

So, I tried to use the candidate answer information to guide the visual_embedding. I change the code in HCRNNetwork.forward as follows:

ans_candidates_agg = ans_candidates.view(-1, ans_candidates.size(2))  
ans_candidates_len_agg = ans_candidates_len.view(-1)
batch_agg = np.reshape(
                np.tile(np.expand_dims(np.arange(batch_size), axis=1), [1, 5]), [-1])
ans_candidates_embedding = self.linguistic_input_unit(ans_candidates_agg, ans_candidates_len_agg)
ans_candidates_emb_mul = ans_candidates_embedding.view(batch_size, 5, -1).sum(1)
question_embedding = self.linguistic_input_unit(question, question_len)
visual_embedding = self.visual_input_unit(video_appearance_feat, video_motion_feat, 
             question_embedding+ans_candidates_emb_mul)
q_visual_embedding = self.feature_aggregation(question_embedding, visual_embedding)
a_visual_embedding = self.feature_aggregation(ans_candidates_embedding, visual_embedding[batch_agg])
out = self.output_unit(question_embedding[batch_agg], q_visual_embedding[batch_agg],
                           ans_candidates_embedding, a_visual_embedding).

I get the acurracy of 0.9380 and 0.9759 on the action and trasition task . I checked the loss fuction and acurracy evaluation function and I did not find any bug. Can you explain it?

dataset

May I ask what method is used to call the dataset in your code?

Question about the TGIFs dataset ?

Hello,

Thank you for your excellent work!

When I download the tgif-qa dataset, which includes approximately 124G of GIF files and some csv files with question and answer pairs, I find some gif_name in the csv files can't be found in the GIFs dataset. such as the
tumblr_nk172bbdPI1u1lr18o1_250 in the Test_action_question.csv.

Meanwhile, some tgif file can't be found in the csv file, such as the tumblr_l5zke1pg6r1qzzqaxo1_500.gif.

Have you ever had the same experience? Is there a solution here?
I downloaded the wrong data set ?

Data error

I'm sorry to bother you again. After I downloaded your data of frame_qa and merged it into an h5 file, I got an unexpected error. The other task files are right. I am not sure whether the file is damaged due to my download error or the original file you uploaded is wrong. If possible, can you test whether your original file can be read using h5py. It will take me a whole day to re-download your data, thank you.

Video Processing

This problem occurred when I modified the video path and tried to process the data. The path was modified, but why did the video read error?
Also, I want to know if the annotations file is under this link? The link is https://mega.nz/file/UnRnyb7A#es4XmqsLxl-B7MP0KAat9VibkH7J_qpKj9NcxLh8aHg

A question about dataset

I would like to ask you, in the extracted features, such a sentence is given.
.
Are the features you extracted only for action tasks?
And if I want to extract all the features, how long will it take to finish?
Thank you!

thaolmk54 / hcrn-videoqa Goto Github PK

hcrn-videoqa's Introduction

Hierarchical Conditional Relation Networks for Video Question Answering (HCRN-VideoQA)

Setups

Experiments with TGIF-QA

Preprocessing visual features

Proprocess linguistic features

Training

Evaluation

Experiments with MSRVTT-QA and MSVD-QA

Preprocessing visual features

Proprocess linguistic features

Training

Evaluation

Citations

Acknowledgement

hcrn-videoqa's People

Contributors

Stargazers

Watchers

Forkers

hcrn-videoqa's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs