A model for generating textual description of a given image based on the objects and actions in the image
The dataset used is Flicker8k (https://www.kaggle.com/ming666/flicker8k-dataset).
The image captioning task is divided into following parts:
- Preparing Data
- Preprocessing Data
- Building Vocabulary
- Image Captioning
Data preparation involves mapping captions to their respective image's id.
Module: Prepare Data.ipynb
Output: captions.txt
The captions are augmented with <start> token in the beginning and <end> token at the end and the images are passed through an InceptionV3 model to generate an encoding for each image.
Module: Preprocess Data.ipynb
Output: train_captions.txt, test_captions.txt, train_images.pkl, test_images.pkl
A vocabulary is prepared from the augmented captions (the one including <start> and <end>). A word in the captions with a frequency of more than 10 is added to the vocabulary.
Module: Build Vocab.ipynb
Output: vocabulary.txt
Module: Image Captioning.ipynb
The image captioning module uses the outputs of the other modules to learn to generate captions for an input image. This step performs the following tasks:
The words in the vocabulary are mapped to an integer value (or index) and two mappings are created - word_to_index and index_to_word.
Word embeddings are created using pre-trained GloVe word representations. An embedding matrix is created wherein at each word-index (obtained from word_to_index mapping) the embeddings of the word are stored.
The model takes as input an image vector from the training set (train_images.pkl) and a partial caption. The partial caption is initialised as <start> and is built successively by a feed forward neural network. The model predicts the next word in the sequence of words forming the partial caption which is, thereafter, added to the partial caption to generate a new partial caption. The process is repeated until <end> is generated as the predicted word.
The architecture is as shown:
Loss function: Categorical Cross Entropy
Optimizer: Adam (with a learning rate of 0.0001)
Epochs: 120
Beam search with a beam width of 3 is used to predict next word in the caption.
- Weights are saved to the disk after each epoch and reloaded to resume training. This is done to save training time.
- A custom data generator is used to feed input to the model. This enables us to train the model without loading the entire dataset into memory.
- Beam search is used instead of greedy search to generate better captions.