GithubHelp home page GithubHelp logo

nicolafan / image-captioning-cnn-rnn Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 1.0 2.25 MB

Tensorflow/Keras implementation of an image captioning neural network, using CNN and RNN

License: MIT License

Python 88.21% Makefile 11.79%
cnn computer-vision deep-learning image-captioning keras machine-learning nlp rnn tensorflow

image-captioning-cnn-rnn's Introduction

  • ๐Ÿ‘‹ Hi, Iโ€™m @nicolafan
  • ๐Ÿ‘€ Iโ€™m interested in Machine Learning applications with Python, web development and algorithms.
  • ๐Ÿ’ž๏ธ Iโ€™m looking to collaborate on coding projects on GitHub, I think this will be a method for deeply understanding what I'm studying.
  • ๐Ÿ“ซ How to reach me: forums or my email address (or send me a DM on Twitter).
  • ๐Ÿ“š Currently I'm a Computer Science student in Italy.

image-captioning-cnn-rnn's People

Contributors

israelabebe avatar nicolafan avatar vijaybirju avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

israelabebe

image-captioning-cnn-rnn's Issues

Load model weights

Load some model weights so that users can directly use the model without previous training.

Update the Makefile

I create this issue just to remember to update it when a clear pipeline, with CLI arguments etc., is defined.

Long evaluation time

Should think about some way to make evaluation (computation of the BLEU scores over different beam widths) faster, since yesterday it required around 4 hours for an evaluation on a grid search over 3 beam widths.

Implement model prediction

Implement the content of the predict_model.py file.

In the file a stateful model should be loaded to use it for prediction.

The current implementation requires that an image has to be given as an input to the model together with a sequence of length MAX_CAPTION_LENGTH, where the first element of the sequence is 1 *corresponding to the <start>token) and all the others are zeros (for masking).

Then we take the model output at the first timestep, corresponding to the softmax probability distribution over the vocabulary, and sample one word from this probability distribution (remember that here the 0-th neuron corresponds to the token <start> which has index 1 in the vocabulary and so on). Sampling can be performed:

  • by taking a random element according to the probability distribution
  • by taking the argmax of the probability distribution
  • with a beam search.

These three methods should all be implemented.

When we sample a word, we create the input for the next timestep, made of the same image (which will be encoded by the model but not used - adjust this) and a new caption sequence where the first element is the index of the sampled word and all the other elements are 0. The fact that the model is stateful, means that it will keep the state it had at the previous timestep and can continue the prediction of the sequence without problems. We will sample until MAX_CAPTION_LENGTH or until <end> gets produced by the model.

Support mini-batches at prediction time

This issue is related to the performances that we want to improve with #1 .

The current implementation of the beam search only works with batch size equal to 1. There are some slight changes that can be made to support mini-batches with a size greater than 1.

Improve (?) implementation of the BLEU metric

The BLEU metric is a numerical metric used in image captioning.

They need to be implemented inside the src/models/metrics.py file (not sure if it is the correct place by the way).
I think this should not be implemented as a tensor metric that can be used by Tensorflow, but as a metric that has to be applied directly to strings.

Basically, we will provide the ground truth caption string and a string predicted by the model. How the prediction string is produced depends on the implementation to try: sampling, beam search, or max likelihood, but is not of interest for the BLEU Implementation.

"Separate" the encoder from the entire model

Use the Subclassing API to divide the ShowAndTell model into two sub-models: the encoder and the decoder. This is required since at prediction time the model is applied at each step of beam search on the same image over and over again. The image encoding is unique and so it can be calculated just once, at the beginning of the process.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.