- Python 2.7
- Pytorch 0.2
- Microsoft COCO Caption Evaluation
- CIDEr
- torch, torchvision, numpy, scikit-image, nltk, h5py, pandas, future
- tensorboard_logger--for use tensorboard to view training loss
(Check out the coco-caption
and cider
projects into your working directory)
- VGG16 pretrained on ImageNet [PyTorch version]: https://download.pytorch.org/models/vgg16-397923af.pth
- Resnet-101 pretrained on ImageNet [PyTorch version]: https://github.com/ruotianluo/pytorch-resnet
- MSVD: https://www.microsoft.com/en-us/download/details.aspx?id=52422
- MSR-VTT: http://ms-multimedia-challenge.com/2017/dataset
- Flickr30k: flickr30k.tar.gz, flickr30k-images.tar
Obtain the dataset you need:
Generate metadata
- run
func_standalize_format
- run
func_preprocess_datainfo
- run
func_build_vocab
- run
func_create_sequencelabel
- run
func_convert_datainfo2cocofmt
- run
func_compute_ciderdf
# Pre-compute document frequency for CIDEr computation - run
func_compute_evalscores
# Pre-compute evaluation scores (BLEU_4, CIDEr, METEOR, ROUGE_L) for each caption - run
func_extract_video_features
# extract video features
Please refer to the opts.py
file for the set of available train/test options
# Train XE model
./train.sh 0 [GPUIDs]
# Train CST_GT_None/WXE model
./train.sh 1 [GPUIDs]
# Train CST_MS_Greedy model (using greedy baseline)
./train.sh 2 [GPUIDs]
# Train CST_MS_SCB model (using SCB baseline, where SCB is computed from GT captions)
./train.sh 3 [GPUIDs]
#Train CST_MS_SCB(*) model (using SCB baseline, where SCB is computed from model sampled captions)
./train.sh 4 [GPUIDs]
./test.sh 0 [GPUIDs]
- Torch implementation of NeuralTalk2
- PyTorch implementation of Self-critical Sequence Training for Image Captioning (SCST)
- PyTorch Team
- "Consensus-based Sequence Training for Video Captioning" (Phan, Henter, Miyao, Satoh. 2017).