The code for our CVPR 2019 paper,
Dense Relational Captioning: Triple-Stream Networks for Relationship-Based Captioning.
Done by Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So Kweon.
Link: arXiv , Dataset, Pre-trained model.
We introduce “relational captioning,” a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in an image. The figure shows the comparison with the previous frameworks.(28/08/2019)
- Our code is updated from evaluation-only to trainable version.
- Codes for backpropagation part are added to several functions.
(06/09/2019)
- Fixed the bug of UnionSlicer code.
- Added eval_utils_mAP.lua.
Some of the codes are built upon DenseCap: Fully Convolutional Localization Networks for Dense Captioning [website]. We appreciate them for their great work.
Our code is implemented in Torch, and depends on the following packages: torch/torch7, torch/nn, torch/nngraph, torch/image, lua-cjson, qassemoquab/stnbhwd, jcjohnson/torch-rnn. You'll also need to install torch/cutorch and torch/cunn;
After installing torch, you can install / update these dependencies by running the following:
luarocks install torch
luarocks install nn
luarocks install image
luarocks install lua-cjson
luarocks install https://raw.githubusercontent.com/qassemoquab/stnbhwd/master/stnbhwd-scm-1.rockspec
luarocks install https://raw.githubusercontent.com/jcjohnson/torch-rnn/master/torch-rnn-scm-1.rockspec
luarocks install cutorch
luarocks install cunn
luarocks install cudnn
You can download a pretrained Relational Captioning model from this link: Pre-trained model:
Download the model and place it in ./
.
This is not the exact model that was used in the paper, but with different hyperparameters. it achieve a recall of 36.2 on the test set which is better than the reall of 34.27 that we report in the paper.
To evaluate a model on our Relational Captioning Dataset, please follow the following steps:
- Download the raw images from Visual Genome dataset version 1.2 website. Place the images in
./data/visual-genome/VG_100K
. - Download our relational captioning label from the following link: Dataset. Place the json file at
./data/visual-genome/1.2/
. - Use the script
preprocess.py
to generate a single HDF5 file containing the entire dataset. - Run
script/setup_eval.sh
to download and unpack METEOR jarfile. - Use the script
evaluate_model.lua
to evaluate a trained model on the validation or test data. - If you want to measure the mAP metric, change the line9 from
imRecall
tomAP
and runevaluate_model.lua
.
To train a model on our Relational Captioning Dataset, you can simply follow these steps:
- Run
script/download_models.sh
to download VGG16 model. - Run
train.lua
to train a relational captioner.
If you find our work useful in your research, please consider citing:
@inproceedings{kim2019dense,
title={Dense relational captioning: Triple-stream networks for relationship-based captioning},
author={Kim, Dong-Jin and Choi, Jinsoo and Oh, Tae-Hyun and Kweon, In So},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
pages={6271--6280},
year={2019}
}