GithubHelp home page GithubHelp logo

image-caption-generator's Introduction

Image-Caption-Generator

Generating Captions for images using Deep Learning

Output Samples:

alt text alt text

Procedure:

1. Data Collection:

Data used in the project is Flickr 8k dataset which can be download by filling this form provided by the University of Illinois at Urbana-Champaign. This dataset contains 8000 images each with 5 captions. ###These images are bifurcated as follows:

  • Training Set — 6000 images
  • Dev Set — 1000 images
  • Test Set — 1000 images

Also training a model with large number of images may not be feasible on a system which is not a very high end PC/Laptop. For faster computation I used Google colab and ran the model in GPU.

2. Data understanding:

Read the file “Flickr8k.token.txt” which contains image_id each with 5 captions cleaned it for furthur use in the dict form where image_id is the key which maps to the list containing the 5 captions.

3. Data Cleaning

4. Vocabulary:

Create a vocabulary of all the unique words present across all the 8000*5 (i.e. 40000) image captions (corpus) in the data set. We write all these captions along with their image names in a new file namely, “descriptions.txt” and save it on the disk. Made Model robust to outliers by ensuring that the words that occur more than 10 times in the captions are added to the vocabulary.

5.Load training and descriptions:

we add two tokens in every captions namely, ‘startseq’ -> This is a start sequence token which will be added at the start of every caption. ‘endseq’ -> This is an end sequence token which will be added at the end of every caption.

6. Data processing - Images and Transfer Learning:

Images are imput in the model in the form of vectors. we need to convert every image into fixed size vector. For this purpose we opt TRANSFER LEARNING by using Xception model(CNN) created by Google Research. This model was trained on Imagenet dataset to perform image classification on 1000 different classes of images. However, our purpose here is not to classify the image but just get fixed-length informative vector for each image. This process is called automatic feature engineering. Hence, we just remove the last softmax layer from the model and extract a 2048 length vector (bottleneck features) for every image as follows: Automatic feature Engineering We save all the bottleneck train features in a Python dictionary and save it on the disk using Pickle file, namely “encoded_train_images.pkl” whose keys are image names and values are corresponding 2048 length feature vector.Similarly we encode all the test images and save them in the file “encoded_test_images.pkl”.

7. Data Preprocessing:

During the training period, captions will be the target variables (Y) that the model is learning to predict. So,we will represent every unique word in the vocabulary by an integer (index). and we will also the find the maximum length of any caption.

8. Data Preparation:

Let us first see how the input and output of our model will look like. To make this task into a supervised learning task, we have to provide input and output to the model for training. We have to train our model on 6000 images and each image will contain 2048 length feature vector and caption is also represented as numbers. This amount of data for 6000 images is not possible to hold into memory so we will be using a generator method that will yield batches. The generator will yield the input and output sequence. For example: The input to our model is [x1, x2] and the output will be y, where x1 is the 2048 feature vector of that image, x2 is the input text sequence and y is the output text sequence that the model has to predict. Partial Caption Partial Caption wordtoix Partial Caption after padding made each caption of max length

9. Model Architechture

Since the input consists of two parts, an image vector and a partial caption, we cannot use the Sequential API provided by the Keras library. For this reason, we use the Functional API which allows us to create Merge Models. First, let’s look at the brief architecture which contains the high level sub-modules: Model Input_1 -> Partial Caption Since we are processing sequences, we are employing a Recurrent Neural Network to read these partial captions.The LSTM (Long Short Term Memory) layer is nothing but a specialized Recurrent Neural Network to process the sequence input (partial captions in our case) Input_2 -> Image feature vector Output -> An appropriate word, next in the sequence of partial caption provided in the input_1 Detailed

10. Evaluation

The model generates a 12-long vector(in the sample example while 1652-long vector in the original example) which is a probability distribution across all the words in the vocabulary. For this reason we greedily select the word with the maximum probability, given the feature vector and partial caption. If the model is trained well, we must expect the probability for the word “the” to be maximum: This is called as Maximum Likelihood Estimation (MLE) i.e. we select that word which is most likely according to the model for the given input. And sometimes this method is also called as Greedy Search, as we greedily select the word with maximum probability. ###So we stop when either of the below two conditions is met:

  • We encounter an ‘endseq’ token which means the model thinks that this is the end of the caption. (You should now understand the importance of the ‘endseq’ token)
  • We reach a maximum threshold of the number of words generated by the model. If any of the above conditions is met, we break the loop and report the generated caption as the output of the model for the given image.

image-caption-generator's People

Contributors

priyaj28 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.