Image-Caption-Generator

Generating Captions for images using Deep Learning

Output Samples:

Procedure:

1. Data Collection:

Data used in the project is Flickr 8k dataset which can be download by filling form provided by the University of Illinois at Urbana-Champaign. This dataset contains 8000 images each with 5 captions. ###These images are bifurcated as follows:

Training Set — 6000 images
Dev Set — 1000 images
Test Set — 1000 images

Also training a model with large number of images may not be feasible on a system which is not a very high end PC/Laptop. For faster computation I used Google colab and ran the model in GPU.

2. Data understanding:

Read the file “Flickr8k.token.txt” which contains image_id each with 5 captions cleaned it for furthur use in the dict form where image_id is the key which maps to the list containing the 5 captions.

3. Data Cleaning

4. Vocabulary:

Create a vocabulary of all the unique words present across all the 8000*5 (i.e. 40000) image captions (corpus) in the data set. We write all these captions along with their image names in a new file namely, “descriptions.txt” and save it on the disk. Made Model robust to outliers by ensuring that the words that occur more than 10 times in the captions are added to the vocabulary.

5.Load training and descriptions:

we add two tokens in every captions namely, ‘startseq’ -> This is a start sequence token which will be added at the start of every caption. ‘endseq’ -> This is an end sequence token which will be added at the end of every caption.

6. Data processing - Images and Transfer Learning:

Images are imput in the model in the form of vectors. we need to convert every image into fixed size vector. For this purpose we opt TRANSFER LEARNING by using Xception model(CNN) created by Google Research. This model was trained on Imagenet dataset to perform image classification on 1000 different classes of images. However, our purpose here is not to classify the image but just get fixed-length informative vector for each image. This process is called automatic feature engineering. Hence, we just remove the last softmax layer from the model and extract a 2048 length vector (bottleneck features) for every image as follows: We save all the bottleneck train features in a Python dictionary and save it on the disk using Pickle file, namely “encoded_train_images.pkl” whose keys are image names and values are corresponding 2048 length feature vector.Similarly we encode all the test images and save them in the file “encoded_test_images.pkl”.

7. Data Preprocessing:

During the training period, captions will be the target variables (Y) that the model is learning to predict. So,we will represent every unique word in the vocabulary by an integer (index). and we will also the find the maximum length of any caption.

8. Data Preparation:

Let us first see how the input and output of our model will look like. To make this task into a supervised learning task, we have to provide input and output to the model for training. We have to train our model on 6000 images and each image will contain 2048 length feature vector and caption is also represented as numbers. This amount of data for 6000 images is not possible to hold into memory so we will be using a generator method that will yield batches. The generator will yield the input and output sequence. For example: The input to our model is [x1, x2] and the output will be y, where x1 is the 2048 feature vector of that image, x2 is the input text sequence and y is the output text sequence that the model has to predict.

9. Model Architechture

Since the input consists of two parts, an image vector and a partial caption, we cannot use the Sequential API provided by the Keras library. For this reason, we use the Functional API which allows us to create Merge Models. First, let’s look at the brief architecture which contains the high level sub-modules: Input_1 -> Partial Caption Since we are processing sequences, we are employing a Recurrent Neural Network to read these partial captions.The LSTM (Long Short Term Memory) layer is nothing but a specialized Recurrent Neural Network to process the sequence input (partial captions in our case) Input_2 -> Image feature vector Output -> An appropriate word, next in the sequence of partial caption provided in the input_1

10. Evaluation

The model generates a 12-long vector(in the sample example while 1652-long vector in the original example) which is a probability distribution across all the words in the vocabulary. For this reason we greedily select the word with the maximum probability, given the feature vector and partial caption. If the model is trained well, we must expect the probability for the word “the” to be maximum: This is called as Maximum Likelihood Estimation (MLE) i.e. we select that word which is most likely according to the model for the given input. And sometimes this method is also called as Greedy Search, as we greedily select the word with maximum probability. ###So we stop when either of the below two conditions is met:

We encounter an ‘endseq’ token which means the model thinks that this is the end of the caption. (You should now understand the importance of the ‘endseq’ token)
We reach a maximum threshold of the number of words generated by the model. If any of the above conditions is met, we break the loop and report the generated caption as the output of the model for the given image.

priyaj28 / image-caption-generator Goto Github PK

image-caption-generator's Introduction

Image-Caption-Generator

Output Samples:

Procedure:

1. Data Collection:

2. Data understanding:

3. Data Cleaning

4. Vocabulary:

5.Load training and descriptions:

6. Data processing - Images and Transfer Learning:

7. Data Preprocessing:

8. Data Preparation:

9. Model Architechture

10. Evaluation

image-caption-generator's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

Jobs