show_tell's Introduction

Image captioning and Analysis

Implementation of 'Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge' paper by [Vinyals et al.] (https://ieeexplore.ieee.org/abstract/document/7505636) published in IEEE TPAMI.

Note: This is a work in progress. I will write a detailed blog post on medium.com explaining all the code and detailing the steps.

This will not be an exact implementation of the paper and differs in these following ways:

The authors use an ensemble of models but we use only one model. The authors have found that using ensembles enables them to achieve a boost of 2 points with respect to BLEU metrics. However, as we shall see, certain optimizations allow us to acheive better scores on certain metrics than the scores mentioned in the paper.
The authors extract a single image feature vector from the penulimate layer of CNN but we extract a set of feature vectors from a lower layer. We have noticed that this allows us to acheive better performance.
We use pre-trained RESNET-152 Convolutional Neural Network instead of Inception used in the paper and it helps to improve results.
Implementation details: We do not use batch normalization to process inputs as no noticeable improvements were observed.

In addition to implementing the paper, we perform additional analysis with the following changes:

We experiment with image data augmentation and observe that it helps boost performance (depending on which technique used.)
We experiment with pre-trained word embeddings. Some pre-trained word embeddings help in increasing evaluation scores.
We experiment with different CNNs to extract image feature vectors.
We experiment with different versions of Recurrent Neural Networks: LSTM, GRU, Bidirectional RNN(/LSTM/GRU).

Recommend Projects