cshizhe / hgr_v2t Goto Github PK

Code accompanying the paper "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning".

License: MIT License

Python 85.51% Jupyter Notebook 14.49%

hgr_v2t's Introduction

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

This repository contains PyTorch implementation of our paper Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning (CVPR 2020).

Prerequisites

Python 3 and PyTorch 1.3.

# clone the repository
git clone [email protected]:cshizhe/hgr_v2t.git
cd hgr_v2t
export PYTHONPATH=$(pwd):${PYTHONPATH}

Datasets

We provide annotations, pretrained features on MSRVTT, TGIF, VATEX and Youtube2Text video captioning datasets, which can be downloaded from BaiduNetdisk (code: vxpi).

Annotations

groundtruth: annotation/RET directory

ref_captions.json: dict, {videoname: [sent]}
sent2rolegraph.augment.json: {sent: (graph_nodes, graph_edges)}

vocabularies: annotation/RET directory int2word.npy: [word] word2int.json: {word: int}
data splits: public_split directory trn_names.npy, val_names.npy, tst_names.npy

Features

For MSRVTT, TGIF and Youtube2Text datasets, we extract features with Resnet152 pretrained on ImageNet. For VATEX dataset, we use the I3D features released by VATEX challenge organizers.

mean pooling features: ordered_feature/MP directory

format: np array, shape=(num_fts, dim_ft) corresponding to the order in data_split names

frame-level features: ordered_feature/SA directory

format: hdf5 file, {name: ft}, ft.shape=(num_frames, dim_ft)

Fine-grained Binary Selection Annotation

We construct the fine-grained binary selection dataset based on the testing set of Youtube2Text dataset. The annotations are in the Youtube2Text/annotation/binary_selection directory.

Training & Inference

Semantic Graph Construction

We provided constructed role graph annotations. If you want to generate role graphs for new datasets, please follow the following instructions.

semantic role labeling:

python misc/semantic_role_labeling.py ref_caption_file out_file --cuda_device 0

convert sentence into role graph:

cd misc
jupyter notebook
# open parse_sent_to_role_graph.ipynd

Training and Evaluation

The baseline VSE++ model:

cd t2vretrieval/driver

# setup config files
# you should modify data paths in configs/prepare_globalmatch_configs.py
python configs/prepare_globalmatch_cofig.py $datadir
resdir='' # copy the output string of the previous step

# training
python global_match.py $resdir/model.json $resdir/path.json --is_train --resume_file $resdir/../../word_embeds.glove42b.th

# inference
python global_match.py $resdir/model.json $resdir/path.json --eval_set tst

Our HGR model:

cd t2vretrieval/driver

# setup config files
# you should modify data paths in configs/prepare_mlmatch_configs.py
python configs/prepare_mlmatch_configs.py $datadir
resdir='' # copy the output string of the previous step

# training
python multilevel_match.py $resdir/model.json $resdir/path.json --load_video_first --is_train --resume_file $resdir/../../word_embeds.glove42b.th

# inference
python multilevel_match.py $resdir/model.json $resdir/path.json --load_video_first --eval_set tst

Citations

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@article{chen2020fine,
  title={Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning},
  author={Chen, Shizhe and Zhao, Yida and Jin, Qin and Wu, Qi},
  journal={CVPR},
  year={2020}
}

License

MIT License

hgr_v2t's People

Contributors

Stargazers

Watchers

hgr_v2t's Issues

about visualizing examples

Hello, thanks for your great work, I'm very interested in visualizing the examples.How can I visualize the retrieved videos?Could you please upload the code?

Problem about allennlp: "srl_bert is not a registered name for Model"

The pretrained model "bert-base-srl-2019.06.17.tar.gz" seems not to be applicable for the latest version of allennlp

Can you provide the split information of Vatex dataset?

I find that the vatex dataset you used in hgr is VATEX v1.0 which does not provide the annotations on testing set.
Then you randomly split the validation set into two equal parts with 1,500 videos as validation set and other 1,500 videos as testing set.
I want to follow your dataset partitioning, but i can not find any split information in this repo.
Could you please provide the 'csv' or 'json' files of vatex dataset which contain the partition information.

How to get the original dataset of MSRVTT?

Hello author, can you provide the original video dataset of MSRVTT?

Question about inconsistent results with the other papers

Hi cshizhe.
In your paper, video-to-text retrieval results of all methods on TGIF are much lower than the results in the PVSE paper.
Because there is no description about the result, I can't understand the discrepancy of the results.
Can you explain about this?
I do have to train your code on TGIF and get the result but I think it's more certain.

Thank you in advance.

MSVD has no word2int.json and int2word.npy file

The MSVD dataset has no word2int.json and int2word.npy file. Could you give me a new link�?

Can not find "word_embeds.glove42b.th"

Hi Shizhe,

Thanks for your great work! I noticed in the training script, it needs to load a pre-train model:

--resume_file $resdir/../../word_embeds.glove42b.th

This leads to initialize the text embedding module？

Besides, I can not find this file from "MSRVTT/results/RET.released/" and can only find one "MSRVTT/results/RET/word_embeds.glove32b.th". Is there any difference between word_embeds.glove42b.th and word_embeds.glove32b.th? Could you please share the "word_embeds.glove42b.th"?

captioning my own video

Hi~ thanks for your nice work~
I want to caption a self-captured video, could you please give some detailed instructions on how to adapt the pretrained model provide in the code to finish this task? For example, the feature extraction method, feature data format, and how to visualize the final result? Thanks a lot!

Download dataset without Baidu account

Can you provide datasets to other domain such as google drive/ dropbox ? To download from Baidu require account and I'm not from China nor have China phone number.
Thank you.

Questions about the MSR-VTT dataset

Hi, Shizhe, thanks for your great work. I downloaded the MSR-VTT dataset you provided, and I have a question. I found that not every video corresponds to 20 captions. Some videos only correspond to less than 20 captions. I would like to ask if you specifically selected these captions and how to choose them?

About training time

Thanks for your great work!

I have a question that how long to train your model on such 3 dataset?

And the BaiduNetdisk is empty.

Is there a schedule for code release？

About the recurrence of paper results

Thank you for your great codes ! And after running your codes in my server for several times, I am surprised to find out that I cannot reproduce your result in paper. The best result of final recall sum that I got in MSRVTT is 170.1 while your paper's result is 172.4, and i did not modify anything of your codes...
Could you please share the best parameters of your codes ? or introduce the solution of my problem ?

When I run semantic_role_labeling.py ,I got a error.

allennlp.common.checks.ConfigurationError: srl not in acceptable choices for dataset_reader.type

My config code for predictor is

**archive=load_archive('bert-base-srl-2019.06.17')
predictor=Predictor.from_archive(archive,'video-text classifier')**

different time get different scores

Hi, cshizhe, thanks for your great work.
when testing performance on MSRVTT dataset, I found that the performance in different test are same, but the sent_scores, verb_scores and noun_scores were different. I don't know why.

there are some outputs in different test :
.......
tensor(-197.5491, device='cuda:0') tensor(4066.6943, device='cuda:0') tensor(4957.7461, device='cuda:0')
tensor(-172.1141, device='cuda:0') tensor(4193.5151, device='cuda:0') tensor(5157.7603, device='cuda:0')
tensor(-68.0737, device='cuda:0') tensor(1171.2622, device='cuda:0') tensor(1342.9297, device='cuda:0')
tensor(82.5919, device='cuda:0') tensor(4531.4185, device='cuda:0') tensor(5212.8369, device='cuda:0')
tensor(-43.9712, device='cuda:0') tensor(4319.0312, device='cuda:0') tensor(5150.5146, device='cuda:0')
tensor(1.5257, device='cuda:0') tensor(4386.4746, device='cuda:0') tensor(5333.5151, device='cuda:0')
tensor(-22.8292, device='cuda:0') tensor(1247.3308, device='cuda:0') tensor(1393.1257, device='cuda:0')
tensor(23.0804, device='cuda:0') tensor(1473.0065, device='cuda:0') tensor(1647.1292, device='cuda:0')
tensor(-31.6811, device='cuda:0') tensor(1406.5350, device='cuda:0') tensor(1616.0713, device='cuda:0')
tensor(-41.8293, device='cuda:0') tensor(1422.7487, device='cuda:0') tensor(1656.0972, device='cuda:0')
tensor(-10.5121, device='cuda:0') tensor(397.1695, device='cuda:0') tensor(444.0505, device='cuda:0')
ir1,ir5,ir10,imedr,imeanr,imAP,cr1,cr5,cr10,cmedr,cmeanr,cmAP,rsum
ir5-rsum,epoch.28.th,22.89,51.07,63.17,5.00,40.16,36.14,22.30,51.10,62.90,5.00,39.20,35.62,273.43

different time:
........
tensor(-89.9776, device='cuda:0') tensor(4095.6599, device='cuda:0') tensor(5116.2510, device='cuda:0')
tensor(-145.8661, device='cuda:0') tensor(4161.9165, device='cuda:0') tensor(5351.6670, device='cuda:0')
tensor(-40.3292, device='cuda:0') tensor(1177.1305, device='cuda:0') tensor(1314.6021, device='cuda:0')
tensor(-58.3337, device='cuda:0') tensor(4536.5352, device='cuda:0') tensor(4928.3350, device='cuda:0')
tensor(35.2728, device='cuda:0') tensor(4343.3838, device='cuda:0') tensor(5280.2969, device='cuda:0')
tensor(2.8130, device='cuda:0') tensor(4361.0112, device='cuda:0') tensor(5508.0010, device='cuda:0')
tensor(37.5651, device='cuda:0') tensor(1243.3253, device='cuda:0') tensor(1373.3599, device='cuda:0')
tensor(-25.2279, device='cuda:0') tensor(1490.6547, device='cuda:0') tensor(1566.4670, device='cuda:0')
tensor(7.1009, device='cuda:0') tensor(1408.7480, device='cuda:0') tensor(1670.6154, device='cuda:0')
tensor(-34.9750, device='cuda:0') tensor(1403.9734, device='cuda:0') tensor(1701.3884, device='cuda:0')
tensor(-7.8403, device='cuda:0') tensor(396.0836, device='cuda:0') tensor(424.8773, device='cuda:0')
ir1,ir5,ir10,imedr,imeanr,imAP,cr1,cr5,cr10,cmedr,cmeanr,cmAP,rsum
ir5-rsum,epoch.28.th,22.89,51.07,63.17,5.00,40.16,36.14,22.30,51.10,62.90,5.00,39.20,35.62,273.43

I found that only the positive examples are given attention in the paper, is there any data leakage?

The biadu link for annotations, pretrained features is gone.

About Predictor.from_path

When I run this code:
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz", cuda_device=opts.cuda_device)
Send an error:
Traceback (most recent call last):
File "./semantic_role_labeling.py", line 52, in
main()
File "./semantic_role_labeling.py", line 19, in main
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz", cuda_device=opts.cuda_device)
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/predictors/predictor.py", line 275, in from_path
load_archive(archive_path, cuda_device=cuda_device),
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/models/archival.py", line 192, in load_archive
model = Model.load(
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/models/model.py", line 398, in load
return model_class._load(config, serialization_dir, weights_file, cuda_device, opt_level)
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/models/model.py", line 295, in _load
model = Model.from_params(vocab=vocab, params=model_params)
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/common/from_params.py", line 576, in from_params
return retyped_subclass.from_params(
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/common/from_params.py", line 611, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp_models/structured_prediction/models/srl_bert.py", line 56, in init
self.bert_model = BertModel.from_pretrained(bert_model)
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 628, in from_pretrained
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
transformers == 2.9.1
allennlp==1.0.0

About other dataset

Hi, I'm very interested in your work, and I want to use other datasets like Charades on your model. But there are several files like annotations which other datasets don't have. What should I do could get these annotations and how to get the role graph? Could you provide your tools metioned in your paper? Thank you very much if you can reply me.

A question about mpdata.py and rolesgraph.py in reader folder

There is a doubt in this get data function: why only obtain one caption in a video ?

def getitem(self, idx):
out={}
if self.is_train:
video_idx,cap_idx=self.pair_idxs[idx]
video_name=self.video_names[video_idx]
mp_feature=self.mp_features[video_idx]
sent=self.captions[cap_idx]
cap_ids,cap_len=self.process_sent(sent,self.max_words_embedding)
out['captions_ids']=cap_ids
out['captions_lens']=cap_len
else:
video_name=self.video_names[idx]
mp_feature=self.mp_features[idx]

    out['names']=video_name
    out['mp_fts']=mp_feature

    return out

Where I can downlaod Youtube2Text dataset, the link: http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/default.aspx is error

about the frame rate

Hi, cshizhe

I find that the number_of_feature/video_duration of videos are different, can you tell me the temporal interval of visual features?

Thanks

Youtube2Text data set

Recently I saw your paper fine-video-text Retrieval with Hierarchical Graph Reasoning. I saw you used Youtube2Text dataset in your paper. However, I did not find the video features and sentence features of Youtube2Text data set in baidu cloud link. Could you please provide me with the download link of Youtube2Text data set? Thank you very much!

About semantic_role_labeling.py

Hi, when I generate my own rolegraph, there are something wrong. With the predictor's model adress https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz you provided in semantic_role_labeling.py, I got the predictor's output like {'verbs': [{'verb': 'talks', 'description': 'a woman talks about a futuristic bicycle design', 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}], 'words': ['a', 'woman', 'talks', 'about', 'a', 'futuristic', 'bicycle', 'design']}, all tags are O. So are there someting wrong with the model? I try other models like https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz, which is used in semantic roles labeling provided in https://demo.allennlp.org/semantic-role-labeling/MjMyODEwNg==, it works correctly, the output is {'verbs': [{'verb': 'is', 'description': 'someone [V: is] blowing a little boys face with a leaf blower', 'tags': ['O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}, {'verb': 'blowing', 'description': '[ARG0: someone] is [V: blowing] [ARG1: a little boys face] [ARGM-MNR: with a leaf blower]', 'tags': ['B-ARG0', 'O', 'B-V', 'B-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'B-ARGM-MNR', 'I-ARGM-MNR', 'I-ARGM-MNR', 'I-ARGM-MNR']}, {'verb': 'face', 'description': 'someone is blowing [ARG0: a little boys] [V: face] with a leaf blower', 'tags': ['O', 'O', 'O', 'B-ARG0', 'I-ARG0', 'I-ARG0', 'B-V', 'O', 'O', 'O', 'O']}], 'words': ['someone', 'is', 'blowing', 'a', 'little', 'boys', 'face', 'with', 'a', 'leaf', 'blower']}.

About data file

I found that some files are missing in the data file downloaded from BaiduNetdisk. There are 6 files in MSRVTT/annotation/RET (int2word.npy, ref_cpation.json, sent2rolegraoh.augment.json, sent2srl.json and word2int.json), but some are not found in other dataset. For example, there are only 2 files in MSVD/annotation/RET (ref_cpation.json, sent2rolegraoh.augment.json).

OSError: Unable to open file (unable to open file: name = 'data/VATEX/ordered_feature/SA/resnet152.pth/trn_ft.hdf5'

Could you tell me to use I3D feature?

About the dataset

Hi,
I click the BaiduNetdisk url, but it appears the following information:

此链接分享内容可能因为涉及侵权、色情、反动、低俗等信息，无法访问！