GithubHelp home page GithubHelp logo

jssprz / attentive_specialized_network_video_captioning Goto Github PK

View Code? Open in Web Editor NEW
15.0 2.0 3.0 123 KB

Source code of the paper titled *Attentive Visual Semantic Specialized Network for Video Captioning*

License: MIT License

Python 100.00%
video-captioning msvd msr-vtt icpr2020 deep-learning video-to-text video-description

attentive_specialized_network_video_captioning's Introduction

Attentive Visual Semantic Specialized Network for Video Captioning

PRs Welcome PRs Welcome Video Captioning and DeepLearning Source code of a ICPR'20 paper MIT License

This repository is the source code for the paper named Attentive Visual Semantic Specialized Network for Video Captioning. In this paper, we present a new architecture that we call Attentive Visual Semantic Specialized Network (AVSSN), which is an encoder-decoder model based on our Adaptive Attention Gate and Specialized LSTM layers. This architecture can selectively decide when to use visual or semantic information into the text generation process. The adaptive gate makes the decoder to automatically select the relevant information for providing a better temporal state representation than the existing decoders. We evaluate the effectiveness of the proposed approach on the Microsoft Video Description (MSVD) and the Microsoft Research Video-to-Text (MSR-VTT) datasets, achieving state-of-the-art performance with several popular evaluation metrics: BLEU-4, METEOR, CIDEr, and ROUGE_L.

Table of Contents

  1. Model
  2. Requirements
  3. Manual
  4. Qualitative Results
  5. Quantitative Results
  6. Citation

Model

Proposed Adaptive Visual Semantic Specialized Network (AVSSN) Adaptive Attention Gate

Requirements

  1. Python 3.6
  2. PyTorch 1.2.0
  3. NumPy

Manual

Download code

git clone --recursive https://github.com/jssprz/attentive_specialized_network_video_captioning.git

Download Data

mkdir -p data/MSVD && wget -i msvd_data.txt -P data/MSVD
mkdir -p data/MSR-VTT && wget -i msrvtt_data.txt -P data/MSR-VTT

For extracting your own visual features representations we provide the visual-feature-extracotr package.

Training

If you want to train your own models, you can reutilize the datasets' information stored and tokenized in the corpus.pkl files. For constructing these files you can use the scripts we provide in video_captioning_dataset module. The content of these files is organized as follow:

0: train_data: captions and idxs of training videos in format [corpus_widxs, vidxs], where:

  • corpus_widxs is a list of lists with the index of words in the vocabulary
  • vidxs is a list of indexes of video features in the features file

1: val_data: same format of train_data.

2: test_data: same format of train_data.

3: vocabulary: in format {'word': count}.

4: idx2word: is the vocabulary in format {idx: 'word'}.

5: word_embeddings: are the vectors of each word. The i-th row is the word vector of the i-th word in the vocabulary.

We use the val_references.txt and test_references.txt files for computing the evaluation metrics only.

Testing

1. Download pre-trained models (at epoch 15)

wget https://s06.imfd.cl/04/github-data/AVSSN/MSVD/captioning_chkpt_15.pt -P pretrain/MSVD
wget https://s06.imfd.cl/04/github-data/AVSSN/MSR-VTT/captioning_chkpt_15.pt -P pretrain/MSR-VTT

2. Generate captions for test samples

python test.py -chckpt pretrain/MSVD/captioning_chkpt_15.pt -data data/MSVD/ -out results/MSVD/
python test.py -chckpt pretrain/MSR-VTT/captioning_chkpt_15.pt -data data/MSR-VTT/ -out results/MSR-VTT/

3. Metrics

python evaluate.py -gen results/MSVD/predictions.txt -ref data/MSVD/test_references.txt
python evaluate.py -gen results/MSR-VTT/predictions.txt -ref data/MSR-VTT/test_references.txt

Qualitative Results

qualitative results

Quantitative Results

Dataset epoch B-4 M C R
MSVD 100 62.3 39.2 107.7 78.3
MSR-VTT 60 45.5 31.4 50.6 64.3

Citation

@article{PerezMartin2020AttentiveCaptioning,
  title={Attentive Visual Semantic Specialized Network for Video Captioning},
  author={Jesus Perez-Martin and Benjamin Bustos and Jorge Pérez},
  booktitle={25th International Conference on Pattern Recognition},
  year={2020}
}

attentive_specialized_network_video_captioning's People

Contributors

jssprz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

attentive_specialized_network_video_captioning's Issues

Training code

Please provide the training code. You had mentioned that you will be providing the training code after Nov 16. Its been a month and no reply from you regarding the training code.

I request you to provide the training code.
Waiting for the reply from you.
Thank You

training

can you tell how to train the model for more epochs?

Results from pretrained models don't match paper

Thanks for you work on this project!

I followed the instructions in the readme to get your code running, and I wasn't able to reproduce the results from the paper:

MSVD:
RESULTS: Bleu_1: 0.858 Bleu_2: 0.756 Bleu_3: 0.665 Bleu_4: 0.573 METEOR: 0.385 ROUGE_L: 0.749 CIDEr: 0.992
Expected: Bleu_4: 62.3 METEOR: 39.2 CIDEr: 107.7 ROUGE_L: 78.3

MSR-VTT:
RESULTS: Bleu_1: 0.812 Bleu_2: 0.679 Bleu_3: 0.547 Bleu_4: 0.428 METEOR: 0.288 ROUGE_L: 0.617 CIDEr: 0.469
Expected: Bleu_4: 45.5 METEOR: 31.4 CIDEr: 50.6 ROUGE_L: 64.3

I noticed that these are epoch 15 checkpoints, but in the paper, the models were trained for ~70 epochs, are you willing to make these final models available, or the code infrastructure for training a new model?

parameters

which are the tunable parameters in this model??

Problems with downloading files

Hello, thank you very much for the code you provided, but I found that I could not download these two files:
MSR-VTT/corpus.pkl
MSR-VTT/captioning_chkpt_15.pt

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.