mynlp / cst_captioning Goto Github PK

PyTorch Implementation of Consensus-based Sequence Training for Video Captioning

Makefile 7.25% Python 92.75%

cst_captioning's Introduction

Consensus-based Sequence Training for Video Captioning

Code for the video captioning methods from "Consensus-based Sequence Training for Video Captioning" (Phan, Henter, Miyao, Satoh. 2017).

Dependencies

Python 2.7
Pytorch 0.2
Microsoft COCO Caption Evaluation
CIDEr

(Check out the coco-caption and cider projects into your working directory)

Data

Data can be downloaded here (643 MB). This folder contains:

input/msrvtt: annotatated captions (note that val_videodatainfo.json is a symbolic link to train_videodatainfo.json)
output/feature: extracted features
output/model/cst_best: model file and generated captions on test videos of our best run (CIDEr 54.2)

Getting started

Extract video features

Extracted features of ResNet, C3D, MFCC and Category embeddings are shared in the above link

Generate metadata

make pre_process

Pre-compute document frequency for CIDEr computation

make compute_ciderdf

Pre-compute evaluation scores (BLEU_4, CIDEr, METEOR, ROUGE_L) for each caption

make compute_evalscores

Train/Test

make train [options]
make test [options]

Please refer to the Makefile (and opts.py file) for the set of available train/test options

Examples

Train XE model

make train GID=0 EXP_NAME=xe FEATS="resnet c3d mfcc category" USE_RL=0 USE_CST=0 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG MAX_EPOCHS=50

Train CST_GT_None/WXE model

make train GID=0 EXP_NAME=WXE FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=1 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG MAX_EPOCHS=50

Train CST_MS_Greedy model (using greedy baseline)

make train GID=0 EXP_NAME=CST_MS_Greedy FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=0 SCB_CAPTIONS=0 USE_MIXER=1 MIXER_FROM=1 USE_EOS=1 LOGLEVEL=DEBUG MAX_EPOCHS=200 START_FROM=output/model/WXE

Train CST_MS_SCB model (using SCB baseline, where SCB is computed from GT captions)

make train GID=0 EXP_NAME=CST_MS_SCB FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=1 USE_MIXER=1 MIXER_FROM=1 SCB_BASELINE=1 SCB_CAPTIONS=20 USE_EOS=1 LOGLEVEL=DEBUG MAX_EPOCHS=200 START_FROM=output/model/WXE

Train CST_MS_SCB(*) model (using SCB baseline, where SCB is computed from model sampled captions)

make train GID=0 MODEL_TYPE=concat EXP_NAME=CST_MS_SCBSTAR FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=1 USE_MIXER=1 MIXER_FROM=1 SCB_BASELINE=2 SCB_CAPTIONS=20 USE_EOS=1 LOGLEVEL=DEBUG MAX_EPOCHS=200 START_FROM=output/model/WXE

If you want to change the input features, modify the FEATS variable in above commands.

Reference

@article{cst_phan2017,
    author = {Sang Phan and Gustav Eje Henter and Yusuke Miyao and Shin'ichi Satoh},
    title = {Consensus-based Sequence Training for Video Captioning},
    journal = {ArXiv e-prints},
    archivePrefix = "arXiv",
    eprint = {1712.09532},
    year = {2017},
}

Todo

Test on Youtube2Text dataset (different number of captions per video)

Acknowledgements

Torch implementation of NeuralTalk2
PyTorch implementation of Self-critical Sequence Training for Image Captioning (SCST)
PyTorch Team

cst_captioning's People

Contributors

Stargazers

Watchers

Forkers

tsingzao ahyuan kekedan miracle24 xiadingz dimplesl simnyatsanga amirunpri2018 sususushi jssprz andrew-zhu acodec ammieqi qbenliu plsang crystalsixone

cst_captioning's Issues

Multi-GPU training support

Hi,

I am trying to use multiple GPUs on my workstation for your code. I thus use GID=0,1,2,3 in the command to start a training session. However, it seems that it's still using only 1 GPU.

Going through your code, I was unable to find DataParallel anywhere in the code. I am wondering whether if your code originally supports multi-GPU training.

If not, I might be able to take a look at.

WXE get the best result

hello, the result made me confused and i can not figure it out . CST_MS_SCB and other full RL training methods counldn't improve the result from WXE. Is there anyone met the same problem?

feature fusion?

Hello, I found one video only has one feature(C3D,Resnet),not all the fatures of frames we choosed. Could you tell me how to make them together?

options

we are not able to run the code, can you help us with it?

give an example to use make test for your model?

can you give an example to use make test for your model? I want to see the final score.
Thanks.

about the average baseline metric of all for RL?

can you share the average metric(BLEU, ROUGE,METEOR,CIDEr) of annotated captions?
Thanks very much!

No such file or directory: 'data/output/metadata/msrvtt_train_ciderdf.pkl.p'

Hi,

I was just trying to run with your default setting:

Train CST_GT_None/WXE model

make train GID=0 EXP_NAME=WXE FEATS="resnet c3d mfcc category" USE_RL=1 USE_CST=1 USE_MIXER=0 SCB_CAPTIONS=0 LOGLEVEL=DEBUG MAX_EPOCHS=50

Here is the error I got:

Traceback (most recent call last):
  File "train.py", line 529, in <module>
    rl_criterion=rl_criterion)
  File "train.py", line 118, in train
    'CIDEr': CiderD(df=opt.train_cached_tokens),
  File "cider/pyciderevalcap/ciderD/ciderD.py", line 25, in __init__
    self.cider_scorer = CiderScorer(n=self._n, df_mode=self._df)
  File "cider/pyciderevalcap/ciderD/ciderD_scorer.py", line 69, in __init__
    pkl_file = pickle.load(open(os.path.join('data', df_mode + '.p'),'r'))
IOError: [Errno 2] No such file or directory: 'data/output/metadata/msrvtt_train_ciderdf.pkl.p'

There seems to be an issue with the way how we load pickle file within CIDer. Making following change solved the problem.

In "cider/pyciderevalcap/ciderD/ciderD_scorer.py"

# Line #69 is wrong:
# pkl_file = pickle.load(open(os.path.join('data', df_mode + '.p'),'r'))

# It should be changed to:
pkl_file = pickle.load(open(os.path.join(df_mode),'r'))

I train a WEX model and get a cider score about 50, then train CST_MS_Greedy according to your options, bug cider score doesn't grow up by reinforcement learning. You model provided can't be loaded for test also. Can you give a hint about how to use your model or how to produce cider score 54.2?

Problem about multi processing

Thanks for your great work at first.
I run you code and find out the usage rate of my GPU is always 0% when calculate the scores of the val captions. I train to use multiprocessing and need to add "num_works" in torch.utils.data.DataLoader but I find you write the class all by yourself.
So is there a way to use the mutiprocessing?

the give val-jason file has some problems, can you share the val-jason file?

Thanks a lot!

Mean pooling for ResNet features?

Hi, I was wondering if you have used mean pooling to blend ResNet features for every frame into a 2048-D vector (representing the ResNet features for that video chunk)? If not, can you describe how did you merged features across the frames for each clip?

symbolic link in val_videodatainfo.json

Thanks for sharing this amazing git repo for your paper.

Regarding the symbolic link in val_videodatainfo.json

It seems that there is an issue when viewing/downloading from your google drive for this particular file.

This issue is preventing me from correctly generating metadata using the command below.

make pre_process

Could you kindly double-check the file? Thanks!

how do you extract c3d mfcc and category features?

can you give the hint about how to extract these features?