Skip to content

A model for generating textual description of a given image based on the objects and actions in the image

Notifications You must be signed in to change notification settings

savya08/Image-Captioning

Repository files navigation

Image-Captioning

A model for generating textual description of a given image based on the objects and actions in the image

Dataset

The dataset used is Flicker8k (https://www.kaggle.com/ming666/flicker8k-dataset).

Implementation

The image captioning task is divided into following parts:

  1. Preparing Data
  2. Preprocessing Data
  3. Building Vocabulary
  4. Image Captioning

Preparing Data

Data preparation involves mapping captions to their respective image's id.
Module: Prepare Data.ipynb
Output: captions.txt

Preprocessing Data

The captions are augmented with <start> token in the beginning and <end> token at the end and the images are passed through an InceptionV3 model to generate an encoding for each image.
Module: Preprocess Data.ipynb
Output: train_captions.txt, test_captions.txt, train_images.pkl, test_images.pkl

Building Vocabulary

A vocabulary is prepared from the augmented captions (the one including <start> and <end>). A word in the captions with a frequency of more than 10 is added to the vocabulary.
Module: Build Vocab.ipynb
Output: vocabulary.txt

Image Captioning

Module: Image Captioning.ipynb
The image captioning module uses the outputs of the other modules to learn to generate captions for an input image. This step performs the following tasks:

1. Create mapping:

The words in the vocabulary are mapped to an integer value (or index) and two mappings are created - word_to_index and index_to_word.

2. Create embedding:

Word embeddings are created using pre-trained GloVe word representations. An embedding matrix is created wherein at each word-index (obtained from word_to_index mapping) the embeddings of the word are stored.

3. Build model:

The model takes as input an image vector from the training set (train_images.pkl) and a partial caption. The partial caption is initialised as <start> and is built successively by a feed forward neural network. The model predicts the next word in the sequence of words forming the partial caption which is, thereafter, added to the partial caption to generate a new partial caption. The process is repeated until <end> is generated as the predicted word.
The architecture is as shown:
image Loss function: Categorical Cross Entropy
Optimizer: Adam (with a learning rate of 0.0001)
Epochs: 120

4. Make predictions:

Beam search with a beam width of 3 is used to predict next word in the caption.

Results

image
two dogs are playing with each other in the snow .
image
a little boy is sitting on a slide in the playground .
image
two poodles play with each other in the snow .
image
football players are tackling a football player carrying a football .
image
a snowboarder jumps over a snow covered hill .
image
a boy in a blue shirt is doing a trick on his skateboard .
image
a climber is scaling a rock face whilst attached to a rope .

Silent Features

  • Weights are saved to the disk after each epoch and reloaded to resume training. This is done to save training time.
  • A custom data generator is used to feed input to the model. This enables us to train the model without loading the entire dataset into memory.
  • Beam search is used instead of greedy search to generate better captions.

References