GithubHelp home page GithubHelp logo

cshizhe / hgr_v2t Goto Github PK

View Code? Open in Web Editor NEW
206.0 14.0 21.0 495 KB

Code accompanying the paper "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning".

License: MIT License

Python 85.51% Jupyter Notebook 14.49%

hgr_v2t's Introduction

Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning

This repository contains PyTorch implementation of our paper Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning (CVPR 2020).

Overview of HGR Model

Prerequisites

Python 3 and PyTorch 1.3.

# clone the repository
git clone [email protected]:cshizhe/hgr_v2t.git
cd hgr_v2t
export PYTHONPATH=$(pwd):${PYTHONPATH}

Datasets

We provide annotations, pretrained features on MSRVTT, TGIF, VATEX and Youtube2Text video captioning datasets, which can be downloaded from BaiduNetdisk (code: vxpi).

Annotations

  • groundtruth: annotation/RET directory
  1. ref_captions.json: dict, {videoname: [sent]}
  2. sent2rolegraph.augment.json: {sent: (graph_nodes, graph_edges)}
  • vocabularies: annotation/RET directory int2word.npy: [word] word2int.json: {word: int}

  • data splits: public_split directory trn_names.npy, val_names.npy, tst_names.npy

Features

For MSRVTT, TGIF and Youtube2Text datasets, we extract features with Resnet152 pretrained on ImageNet. For VATEX dataset, we use the I3D features released by VATEX challenge organizers.

  • mean pooling features: ordered_feature/MP directory

format: np array, shape=(num_fts, dim_ft) corresponding to the order in data_split names

  • frame-level features: ordered_feature/SA directory

format: hdf5 file, {name: ft}, ft.shape=(num_frames, dim_ft)

Fine-grained Binary Selection Annotation

We construct the fine-grained binary selection dataset based on the testing set of Youtube2Text dataset. The annotations are in the Youtube2Text/annotation/binary_selection directory.

Training & Inference

Semantic Graph Construction

We provided constructed role graph annotations. If you want to generate role graphs for new datasets, please follow the following instructions.

  1. semantic role labeling:
python misc/semantic_role_labeling.py ref_caption_file out_file --cuda_device 0
  1. convert sentence into role graph:
cd misc
jupyter notebook
# open parse_sent_to_role_graph.ipynd

Training and Evaluation

  1. The baseline VSE++ model:
cd t2vretrieval/driver

# setup config files
# you should modify data paths in configs/prepare_globalmatch_configs.py
python configs/prepare_globalmatch_cofig.py $datadir
resdir='' # copy the output string of the previous step

# training
python global_match.py $resdir/model.json $resdir/path.json --is_train --resume_file $resdir/../../word_embeds.glove42b.th

# inference
python global_match.py $resdir/model.json $resdir/path.json --eval_set tst
  1. Our HGR model:
cd t2vretrieval/driver

# setup config files
# you should modify data paths in configs/prepare_mlmatch_configs.py
python configs/prepare_mlmatch_configs.py $datadir
resdir='' # copy the output string of the previous step

# training
python multilevel_match.py $resdir/model.json $resdir/path.json --load_video_first --is_train --resume_file $resdir/../../word_embeds.glove42b.th

# inference
python multilevel_match.py $resdir/model.json $resdir/path.json --load_video_first --eval_set tst

Citations

If you use this code as part of any published research, we'd really appreciate it if you could cite the following paper:

@article{chen2020fine,
  title={Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning},
  author={Chen, Shizhe and Zhao, Yida and Jin, Qin and Wu, Qi},
  journal={CVPR},
  year={2020}
}

License

MIT License

hgr_v2t's People

Contributors

cshizhe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hgr_v2t's Issues

about visualizing examples

Hello, thanks for your great work, I'm very interested in visualizing the examples.How can I visualize the retrieved videos?Could you please upload the code?

Can you provide the split information of Vatex dataset?

I find that the vatex dataset you used in hgr is VATEX v1.0 which does not provide the annotations on testing set.
Then you randomly split the validation set into two equal parts with 1,500 videos as validation set and other 1,500 videos as testing set.
I want to follow your dataset partitioning, but i can not find any split information in this repo.
Could you please provide the 'csv' or 'json' files of vatex dataset which contain the partition information.

Question about inconsistent results with the other papers

Hi cshizhe.
In your paper, video-to-text retrieval results of all methods on TGIF are much lower than the results in the PVSE paper.
Because there is no description about the result, I can't understand the discrepancy of the results.
Can you explain about this?
I do have to train your code on TGIF and get the result but I think it's more certain.

Thank you in advance.

Can not find "word_embeds.glove42b.th"

Hi Shizhe,

Thanks for your great work! I noticed in the training script, it needs to load a pre-train model:

--resume_file $resdir/../../word_embeds.glove42b.th

This leads to initialize the text embedding module?

Besides, I can not find this file from "MSRVTT/results/RET.released/" and can only find one "MSRVTT/results/RET/word_embeds.glove32b.th". Is there any difference between word_embeds.glove42b.th and word_embeds.glove32b.th? Could you please share the "word_embeds.glove42b.th"?

captioning my own video

Hi~ thanks for your nice work~
I want to caption a self-captured video, could you please give some detailed instructions on how to adapt the pretrained model provide in the code to finish this task? For example, the feature extraction method, feature data format, and how to visualize the final result? Thanks a lot!

Download dataset without Baidu account

Can you provide datasets to other domain such as google drive/ dropbox ? To download from Baidu require account and I'm not from China nor have China phone number.
Thank you.

Questions about the MSR-VTT dataset

Hi, Shizhe, thanks for your great work. I downloaded the MSR-VTT dataset you provided, and I have a question. I found that not every video corresponds to 20 captions. Some videos only correspond to less than 20 captions. I would like to ask if you specifically selected these captions and how to choose them?

About training time

Thanks for your great work!

I have a question that how long to train your model on such 3 dataset?

And the BaiduNetdisk is empty.

About the recurrence of paper results

Thank you for your great codes ! And after running your codes in my server for several times, I am surprised to find out that I cannot reproduce your result in paper. The best result of final recall sum that I got in MSRVTT is 170.1 while your paper's result is 172.4, and i did not modify anything of your codes...
Could you please share the best parameters of your codes ? or introduce the solution of my problem ?

When I run semantic_role_labeling.py ,I got a error.

allennlp.common.checks.ConfigurationError: srl not in acceptable choices for dataset_reader.type

My config code for predictor is

**archive=load_archive('bert-base-srl-2019.06.17')
predictor=Predictor.from_archive(archive,'video-text classifier')**

different time get different scores

Hi, cshizhe, thanks for your great work.
when testing performance on MSRVTT dataset, I found that the performance in different test are same, but the sent_scores, verb_scores and noun_scores were different. I don't know why.

there are some outputs in different test :
.......
tensor(-197.5491, device='cuda:0') tensor(4066.6943, device='cuda:0') tensor(4957.7461, device='cuda:0')
tensor(-172.1141, device='cuda:0') tensor(4193.5151, device='cuda:0') tensor(5157.7603, device='cuda:0')
tensor(-68.0737, device='cuda:0') tensor(1171.2622, device='cuda:0') tensor(1342.9297, device='cuda:0')
tensor(82.5919, device='cuda:0') tensor(4531.4185, device='cuda:0') tensor(5212.8369, device='cuda:0')
tensor(-43.9712, device='cuda:0') tensor(4319.0312, device='cuda:0') tensor(5150.5146, device='cuda:0')
tensor(1.5257, device='cuda:0') tensor(4386.4746, device='cuda:0') tensor(5333.5151, device='cuda:0')
tensor(-22.8292, device='cuda:0') tensor(1247.3308, device='cuda:0') tensor(1393.1257, device='cuda:0')
tensor(23.0804, device='cuda:0') tensor(1473.0065, device='cuda:0') tensor(1647.1292, device='cuda:0')
tensor(-31.6811, device='cuda:0') tensor(1406.5350, device='cuda:0') tensor(1616.0713, device='cuda:0')
tensor(-41.8293, device='cuda:0') tensor(1422.7487, device='cuda:0') tensor(1656.0972, device='cuda:0')
tensor(-10.5121, device='cuda:0') tensor(397.1695, device='cuda:0') tensor(444.0505, device='cuda:0')
ir1,ir5,ir10,imedr,imeanr,imAP,cr1,cr5,cr10,cmedr,cmeanr,cmAP,rsum
ir5-rsum,epoch.28.th,22.89,51.07,63.17,5.00,40.16,36.14,22.30,51.10,62.90,5.00,39.20,35.62,273.43

different time:
........
tensor(-89.9776, device='cuda:0') tensor(4095.6599, device='cuda:0') tensor(5116.2510, device='cuda:0')
tensor(-145.8661, device='cuda:0') tensor(4161.9165, device='cuda:0') tensor(5351.6670, device='cuda:0')
tensor(-40.3292, device='cuda:0') tensor(1177.1305, device='cuda:0') tensor(1314.6021, device='cuda:0')
tensor(-58.3337, device='cuda:0') tensor(4536.5352, device='cuda:0') tensor(4928.3350, device='cuda:0')
tensor(35.2728, device='cuda:0') tensor(4343.3838, device='cuda:0') tensor(5280.2969, device='cuda:0')
tensor(2.8130, device='cuda:0') tensor(4361.0112, device='cuda:0') tensor(5508.0010, device='cuda:0')
tensor(37.5651, device='cuda:0') tensor(1243.3253, device='cuda:0') tensor(1373.3599, device='cuda:0')
tensor(-25.2279, device='cuda:0') tensor(1490.6547, device='cuda:0') tensor(1566.4670, device='cuda:0')
tensor(7.1009, device='cuda:0') tensor(1408.7480, device='cuda:0') tensor(1670.6154, device='cuda:0')
tensor(-34.9750, device='cuda:0') tensor(1403.9734, device='cuda:0') tensor(1701.3884, device='cuda:0')
tensor(-7.8403, device='cuda:0') tensor(396.0836, device='cuda:0') tensor(424.8773, device='cuda:0')
ir1,ir5,ir10,imedr,imeanr,imAP,cr1,cr5,cr10,cmedr,cmeanr,cmAP,rsum
ir5-rsum,epoch.28.th,22.89,51.07,63.17,5.00,40.16,36.14,22.30,51.10,62.90,5.00,39.20,35.62,273.43

About Predictor.from_path

When I run this code:
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz", cuda_device=opts.cuda_device)
Send an error:
Traceback (most recent call last):
File "./semantic_role_labeling.py", line 52, in
main()
File "./semantic_role_labeling.py", line 19, in main
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz", cuda_device=opts.cuda_device)
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/predictors/predictor.py", line 275, in from_path
load_archive(archive_path, cuda_device=cuda_device),
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/models/archival.py", line 192, in load_archive
model = Model.load(
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/models/model.py", line 398, in load
return model_class._load(config, serialization_dir, weights_file, cuda_device, opt_level)
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/models/model.py", line 295, in _load
model = Model.from_params(vocab=vocab, params=model_params)
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/common/from_params.py", line 576, in from_params
return retyped_subclass.from_params(
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp/common/from_params.py", line 611, in from_params
return constructor_to_call(**kwargs) # type: ignore
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/allennlp_models/structured_prediction/models/srl_bert.py", line 56, in init
self.bert_model = BertModel.from_pretrained(bert_model)
File "/usr/local/miniconda3/envs/myenv/lib/python3.8/site-packages/transformers/modeling_utils.py", line 628, in from_pretrained
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
transformers == 2.9.1
allennlp==1.0.0

About other dataset

Hi, I'm very interested in your work, and I want to use other datasets like Charades on your model. But there are several files like annotations which other datasets don't have. What should I do could get these annotations and how to get the role graph? Could you provide your tools metioned in your paper? Thank you very much if you can reply me.

A question about mpdata.py and rolesgraph.py in reader folder

There is a doubt in this get data function: why only obtain one caption in a video ?

def getitem(self, idx):
out={}
if self.is_train:
video_idx,cap_idx=self.pair_idxs[idx]
video_name=self.video_names[video_idx]
mp_feature=self.mp_features[video_idx]
sent=self.captions[cap_idx]
cap_ids,cap_len=self.process_sent(sent,self.max_words_embedding)
out['captions_ids']=cap_ids
out['captions_lens']=cap_len
else:
video_name=self.video_names[idx]
mp_feature=self.mp_features[idx]

    out['names']=video_name
    out['mp_fts']=mp_feature

    return out

about the frame rate

Hi, cshizhe

I find that the number_of_feature/video_duration of videos are different, can you tell me the temporal interval of visual features?

Thanks

Youtube2Text data set

Recently I saw your paper fine-video-text Retrieval with Hierarchical Graph Reasoning. I saw you used Youtube2Text dataset in your paper. However, I did not find the video features and sentence features of Youtube2Text data set in baidu cloud link. Could you please provide me with the download link of Youtube2Text data set? Thank you very much!

About semantic_role_labeling.py

Hi, when I generate my own rolegraph, there are something wrong. With the predictor's model adress https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz you provided in semantic_role_labeling.py, I got the predictor's output like {'verbs': [{'verb': 'talks', 'description': 'a woman talks about a futuristic bicycle design', 'tags': ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}], 'words': ['a', 'woman', 'talks', 'about', 'a', 'futuristic', 'bicycle', 'design']}, all tags are O. So are there someting wrong with the model? I try other models like https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.03.24.tar.gz, which is used in semantic roles labeling provided in https://demo.allennlp.org/semantic-role-labeling/MjMyODEwNg==, it works correctly, the output is {'verbs': [{'verb': 'is', 'description': 'someone [V: is] blowing a little boys face with a leaf blower', 'tags': ['O', 'B-V', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']}, {'verb': 'blowing', 'description': '[ARG0: someone] is [V: blowing] [ARG1: a little boys face] [ARGM-MNR: with a leaf blower]', 'tags': ['B-ARG0', 'O', 'B-V', 'B-ARG1', 'I-ARG1', 'I-ARG1', 'I-ARG1', 'B-ARGM-MNR', 'I-ARGM-MNR', 'I-ARGM-MNR', 'I-ARGM-MNR']}, {'verb': 'face', 'description': 'someone is blowing [ARG0: a little boys] [V: face] with a leaf blower', 'tags': ['O', 'O', 'O', 'B-ARG0', 'I-ARG0', 'I-ARG0', 'B-V', 'O', 'O', 'O', 'O']}], 'words': ['someone', 'is', 'blowing', 'a', 'little', 'boys', 'face', 'with', 'a', 'leaf', 'blower']}.

About data file

I found that some files are missing in the data file downloaded from BaiduNetdisk. There are 6 files in MSRVTT/annotation/RET (int2word.npy, ref_cpation.json, sent2rolegraoh.augment.json, sent2srl.json and word2int.json), but some are not found in other dataset. For example, there are only 2 files in MSVD/annotation/RET (ref_cpation.json, sent2rolegraoh.augment.json).

Regarding MP features used for global matching

Hi,

How are the features (MP) used for global matching extracted? Are these obtained by spatio-temporally average pooling the features obtained from ResNet-152 pretrained on ImageNet?

How to get the word embedding weights for a new Dataset?

Hi, Shizhe, thanks for the wonderful work!

For a new dataset, how can I get the word2int.json, int2word.npy and word.embedding.glove42.th?
I assume that you used a Glove model for word embedding weight initialization.
Could you provide an instruction of it?

VATEX has no Resnet152 feature

OSError: Unable to open file (unable to open file: name = 'data/VATEX/ordered_feature/SA/resnet152.pth/trn_ft.hdf5'

Could you tell me to use I3D feature?

About the dataset

Hi,
I click the BaiduNetdisk url, but it appears the following information:

此链接分享内容可能因为涉及侵权、色情、反动、低俗等信息,无法访问!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.