- View Notebook
- Details about different runs of the project can be found on Weights & Biases
- To run in Colab you need to add your kaggele's API token file
- Final Architecture used:
- Encoder: InceptionV3
- Attention: Bahdanau's Soft attention
- Decoder: LSTM unit
- Embeddings: Glove Embedding (glove6b300d)
- Some Outputs from the final run
-
Real Caption: a man on snow skis who is performing a jump Prediction Caption: a man flying through the sky
-
Real Caption: a couple of elephants that are by the pond Prediction Caption: a group of elephants relax along water in a body of water
- ToDo
- Applying beam Search
- Applyling LearningRateScheduler
- Making an interface
- Tuning different Hyperparameters
- Comments:
- Code for ExponentialDecay added but not used in the run as evaluating takes a 3 hours on colab.
- I have added a manual early stopping and saving weights for each epoch (all.zip)
- Try decreasing vocab_size and increasing number of images used.
- I couldn't find any resources for dynamically caching images and loading them directly during run time to save storage space. Numpy's memmap seems a good starting point.
- References
- CS231n Winter 2016 Recurrent Neural Networks, Image Captioning - YouTube
- paperswithcode - Image Captioning
- [1502.03044] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
- Understanding LSTM Networks -- colah's blog
- Attention in Neural Networks - YouTube
- Attention? Attention!
- Multimodal Deep Learning - Towards Data Science
- Multi-Modal Methods: Image Captioning (From Translation to Attention)
- A Comprehensive Study of Deep Learning for Image Captioning
- Neural Image Caption Generation with Visual Attention (algorithm) | AISC - YouTube
- Image captioning with visual attention | TensorFlow Core