Image-Captioning

A model for generating textual description of a given image based on the objects and actions in the image

Dataset

The dataset used is Flicker8k (https://www.kaggle.com/ming666/flicker8k-dataset).

Implementation

The image captioning task is divided into following parts:

Preparing Data
Preprocessing Data
Building Vocabulary
Image Captioning

Preparing Data

Data preparation involves mapping captions to their respective image's id.
Module: Prepare Data.ipynb
Output: captions.txt

Preprocessing Data

The captions are augmented with <start> token in the beginning and <end> token at the end and the images are passed through an InceptionV3 model to generate an encoding for each image.
Module: Preprocess Data.ipynb
Output: train_captions.txt, test_captions.txt, train_images.pkl, test_images.pkl

Building Vocabulary

A vocabulary is prepared from the augmented captions (the one including <start> and <end>). A word in the captions with a frequency of more than 10 is added to the vocabulary.
Module: Build Vocab.ipynb
Output: vocabulary.txt

Image Captioning

Module: Image Captioning.ipynb
The image captioning module uses the outputs of the other modules to learn to generate captions for an input image. This step performs the following tasks:

1. Create mapping:

The words in the vocabulary are mapped to an integer value (or index) and two mappings are created - word_to_index and index_to_word.

2. Create embedding:

Word embeddings are created using pre-trained GloVe word representations. An embedding matrix is created wherein at each word-index (obtained from word_to_index mapping) the embeddings of the word are stored.

3. Build model:

The model takes as input an image vector from the training set (train_images.pkl) and a partial caption. The partial caption is initialised as <start> and is built successively by a feed forward neural network. The model predicts the next word in the sequence of words forming the partial caption which is, thereafter, added to the partial caption to generate a new partial caption. The process is repeated until <end> is generated as the predicted word.
The architecture is as shown:
Loss function: Categorical Cross Entropy
Optimizer: Adam (with a learning rate of 0.0001)
Epochs: 120

4. Make predictions:

Beam search with a beam width of 3 is used to predict next word in the caption.

Results


two dogs are playing with each other in the snow .

a little boy is sitting on a slide in the playground .

two poodles play with each other in the snow .

football players are tackling a football player carrying a football .

a snowboarder jumps over a snow covered hill .

a boy in a blue shirt is doing a trick on his skateboard .

a climber is scaling a rock face whilst attached to a rope .

Silent Features

Weights are saved to the disk after each epoch and reloaded to resume training. This is done to save training time.
A custom data generator is used to feed input to the model. This enables us to train the model without loading the entire dataset into memory.
Beam search is used instead of greedy search to generate better captions.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Dataset		Dataset
Model Weights		Model Weights
Utility Data		Utility Data
Build Vocab.ipynb		Build Vocab.ipynb
Image Captioning.ipynb		Image Captioning.ipynb
Prepare Data.ipynb		Prepare Data.ipynb
Preprocess Data.ipynb		Preprocess Data.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image-Captioning

Dataset

Implementation

Preparing Data

Preprocessing Data

Building Vocabulary

Image Captioning

1. Create mapping:

2. Create embedding:

3. Build model:

4. Make predictions:

Results

Silent Features

References

About

Releases

Packages

Languages

savya08/Image-Captioning

Folders and files

Latest commit

History

Repository files navigation

Image-Captioning

Dataset

Implementation

Preparing Data

Preprocessing Data

Building Vocabulary

Image Captioning

1. Create mapping:

2. Create embedding:

3. Build model:

4. Make predictions:

Results

Silent Features

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages