3.Business Problem


5.Exploratory Data Analysis

6.Data Preparation


8.Encoder- Decoder Model

9.Encoder Decoder Model with Attention Mechanism




13.Future Work



Image Captioning is a challenging problem in Artificial Intelligence which refers to the process of generating a text from the given Image based on the contents in the Image. For Example take the image below and how we describe it ?

we describe it as a person driving a car because we see the image and describes what is there in the image. And what if we are given a x ray and asked to describe it, it is impossible for us unless you are a doctor or an expert in the field. But an expert radiologist will describe it in a different way because he is an expert in this field. For people with less experience may find it difficult and time consuming.

It would be great if we have a system like that which looks at the x ray and tells what is there in it. It will solve this issue. So we need a Deep learning model to do this.


We create a deep learning model for this problem so we need understanding of Neural Networks, RNN’s, Transfer Learning, Encoder decoder models and Attention Mechanisms. We use python and some packages for it and understanding of those is must.


We have images of chest x rays of people and corresponding reports for it. We need to train a model to generate report automatically from the image of chest X ray.


The data we use for this problem was provided by Indiana University Hospital Network. The data consists of two parts i.e images, reports.

  1. Images :
  2. Reports:

We can download data from the above links. We get all the x ray images and reports from it.

The image dataset contains multiple chest x-rays of a single person. For instance: side-view of the x-ray, multiple frontal views etc. Just as a radiologist uses all these images to write the findings, the models will also use all these images together to generate the corresponding findings. There are 3955 reports present in the dataset, with each report having one or more images associated with it.

The reports are in xml format and this report consists of image Id’s and findings for every person. We need to extract the image id’s and reports of every person.

XML file
xml file 2

Here from the above two examples we can see the image id’s and findings and we need to extract them for every person present in our dataset.

5.Exploratory Data Analysis:

Now we have images and reports separately and we need to prepare a structured data for modelling. So after some preprocessing we found that we have multiple images for a single report and for some reports we had only one image. Model has to see all the images present for generating the report.

Images per person

We had 2 images for most number of reports and some has 3 and 4 images too. So we chose to give 2 images per person as input to model and if the report has only one image we replicate the same image for the second one.

After doing all this our dataset looks like this with 4 columns person id and image 1, image 2 and corresponding findings.

Here the images are saved with the path they are located as the image name as it will be easier while loading data.

Performance Metric:

For comparing the generated medical reports and actual reports, we can use the BLEU(Bilingual Evaluation Understudy ) score. If the BLEU score is 1, then the generated report and actual report are the same. If the BLEU score is 0, then the generated report and actual report are a perfect mismatch. Since the output from the decoder model is a one-hot encoded vector we can also use categorical cross-entropy as the loss function.


After the above steps we will have our data in the structured format. We will divide data based on the points not with any random split to avoid data leakage.

Now we have our train, validation and test data. The findings which contain text reports has to be properly cleaned and processed before feeding into the model.

So we do in the following ways:

  1. First we convert all characters into lowercase.
  2. Then we perform basic decontractions , i.e words like won’t, can’t and so on will be converted to will not, cannot and so on respectively.
  3. Remove punctuation from text. Note that full stop will not be removed because the findings contain multiple sentences, so we need the model to generate reports in a similar way by identifying sentences.
  4. Remove all numbers from the text.
  5. Remove all words with length less than or equal to 2. For example, ‘is’, ‘to’ etc are removed. These words don’t provide much information. But the word ‘no’ will not be removed since it adds value. Adding ‘no’ to a sentence changes its meaning entirely. So we have to be careful while performing these kind of cleaning steps. You need to identify which words to keep and which ones to avoid.
  6. It was also found that some texts contain multiple full stops or spaces or ‘X’ repeated multiple times. Such characters are also removed.

After cleaning and some preprocessing we came to know that there are about 1424 unique words in the train data. Another interesting point is there are only about 50% and 46% of words which are present in validation and test data are present in train data. This might affect the performance of our models.

The model we build will generate report based on the images and it will generate one word at a time. It has to know when to start and when to stop so we will add strings ‘EOS’ and ‘SOS’ at the start and end of report for every report in our dataset.

After that we had to encode the text because the model has to understand what we are feeding in to it. For encoding text we have to create a consistent mapping from words to unique integer values known as tokenization. Tokenization is a way of separating a piece of text into smaller units called tokens. Tokens can be either words or characters but in our case it’ll be words. Keras provides an inbuilt library for this purpose.


Images are our input to the model along with the report. We need to convert these images into a fixed size vectors to give as input to our model. We use TRANSFER LEARNING for this purpose as we do not have large data.

Usually for transfer learning in computer vision tasks models like VGG16,VGG19,Inception v3 etc.. are used .But these models do not work in our case because these models are trained on datasets which are very different from the x ray images present in our dataset. So we need to use other model for our purpose.

Fortunately we have a model which will help in our task that is cheXNet model. CheXNet is a 121-layer convolutional neural network trained on ChestX-ray14, currently the largest publicly available chest X-ray dataset, containing over 100,000 frontal-view X-ray images with 14 diseases. However, our purpose here is not to classify the images but just to get the bottleneck features for each image. Therefore the last classification layer of this network is not needed.

We can download weights of trained cheXNet model here.

This is the cheXNet model and we remove the the last layer and take the features from that layer. As we have 2 images as input to our model. So, here is how the bottleneck features are obtained:

Each image is resized to (224,224,3) and is passed through the CheXNet and a 1024 length feature vector is obtained. Later both these feature vectors are concatenated to obtain a 2048 feature vector.


From what we have discussed till now, we can say that we need an encoder-decoder architecture for this problem.

  • Encoder: This model is used to encode the input into fixed-length vectors.
  • Decoder: This model maps the vector representation to a variable-length target sequence.

In our case or in the case of any image captioning problem, the encoder is used to convert the images into vectors. The decoders use Recurrent Neural Networks or LSTMs or GRUs to convert the encoder output into target sentences.

For this problem I built two models for this problem. Let’s see them.


First, let us consider a simple encoder-decoder model.

Encoder model:

As discussed earlier the encoder model is used to encode the image into fixed-sized vectors. We have considered the CheXNet model for extracting the image features. Since we have two images per patient, we concatenate each image feature as shown below

Now, we can pass this vector through Dense layers to get a vector of reduced dimension. This vector will be the final encoder output.

Decoder Model:

Now, we need to convert this encoder output into text. For that, we use LSTM networks which are very suited for working with text data. Here we use LSTM as a sequence to sequence model. The inputs to the LSTM networks are given in time steps, and one word is obtained as output at a time. At each time step, the encoder outputs and the embedding vector of the word at a time (t-1) are given as the input and the LSTM layer will predict a vector representation for the word at time t. This vector is then passed through a softmax layer, which converts it into a one-hot encoded form. By using the argmax function, we will get the corresponding word from our vocabulary

example for encoder decoder model

Embedding layer:

A word embedding is a class of approaches for representing words and documents using a dense vector representation. Keras offers an Embedding layer that can be used for neural networks on text data. It can also use a word embedding learned elsewhere. It is common in the field of Natural Language Processing to learn, save, and make freely available word embeddings.

In our model, with the embedding layer, each word has been mapped into a 300 dimensional representation using a pre-trained FAST embedding model. While using a pre-trained embedding, keep in mind that the weights of the layer should be frozen by setting the argument ‘trainable=False’ so that the weights don’t get updated while training.

Model Architecture:

Model Architecture


A Masked Loss Function was created for this problem. For eg:

If we have a sequence of tokens- [3],[10],[7],[0],[0],[0],[0],[0]

We only have 3 words in this sequence, the zeros correspond to the padding which is actually not a part of the report. But the model will think that the zeros are also a part of the sequence and will start learning them. When the model starts to correctly predict the zeros, the loss will decrease because for the model it is learning correctly. But for us the loss should only decrease if the model is predicting the actual words(non-zeros) correctly.

Therefore we should mask the zeros in the sequence so that the model don’t give its attention to them and only learns the needed words in the report.

Loss Function

We trained the model for 20 epochs and plotted the loss plots below.


The data for training is prepared using the prep_data function.

Data prep

Testing the Model:

A greedy search algorithm builds up a solution piece by piece, always choosing the next piece that offers the most obvious and immediate benefit. Here the prediction is done through the following steps:

  1. For a given patient, get the image features using the encoder model
  2. Pass the encoder output and the token_index of the word “sos” (start of the sentence) to the decoder model and this will predict the probability distribution of each word across the vocabulary. We select the word with maximum probability as the next word
  3. The predicted word along with the input to the decoder is the next input sentence
  4. Step 2 and 3 are repeated till the word “eos” (end of the sentence) is reached


The average bleu score we got on the test data is 0.38.The score is too low, we can see that our model is not working properly. We need a better model for this problem. Let’s try with ATTENTION MECHANISM in encoder decoder model.

Attention Mechanism:

The problem with the previous model is that, when the model is trying to generate the next word of the report, this word is describing only part of the image. But in previous model the generation of each word is based on whole image. But if we use attention mechanism it helps to focus on same parts of the image while generating a particular word. Suppose the i’th word in the actual output is a description of some part of x -ray not whole x-ray. If we use the whole image we might not get the word required. So if we use attention mechanism in our model then results will improve.


Here we preserve the spatial information of the images while extracting the features from the CheXNet model. This allows the model to identify spatial patterns (edges, shading changes, shapes, objects, etc). Now let us understand the output from the CheXNet model.

We can see that the final global average pooling layer is removed and the spatial information is retained. We remove the last activation layer to obtain the bottleneck features. So for every input image, we will get a (7,7,1024) dimensional vector as the output. Since we have two images per patient, we will concatenate these features to get a (7,14,1024) dimensional tensor as the final feature vector. Here 7 and 14 represent the actual locations that correspond to certain portions in the image, and 1024 indicate the depth. We can think of it as (7*14) locations each having a 1024 dimensional representation. So we reshape the tensors into (98,1024).


This model also has an encoder-decoder architecture. The encoder is the same as in the previous model, but the decoder has an extra attention unit in it.

We will be using the sub-classing API of keras which gives us more customisability and control over our architecture. You can read more about the sub class API here from the documentation itself.

Let us see how this attention-based model is working.

  1. Extract image features from CheXNet model
  2. Pass these features to the encoder model which gives the encoder output
  3. The encoder output and the previous decoder hidden state are passed to the attention model which calculate the attention weights
  4. The attention weights and encoder output are used to calculate the context vector
  5. The context vector and embedding vector of the previous decoder input are concatenated and passed to the GRU unit
  6. The GRU output is passed to the final dense layer

Now let us try to understand each of these units.

Encoder class:


The encoder layer just performs some operations on our image and outputs the features which will be fed as input to the decoder.

Attention Class:

The attention class will use the previous hidden state of the decoder model and the encoder output to calculate the attention weights and context vector.

One step decoder and decoder classes:

One step decoder class will perform the decoding of the encoder outputs. The decoder class will call this one step decoder at every time step. One step decoder, in turn, calls the attention model and returns the final output at that time step.

Each output word predicted by the one-step decoder is stored using the final decoder model and the final output sentence is returned.

Encoder-Decoder Model:


After training for 50 epochs we saved the model weights for future use and evaluated the model with beam search algorithm. The results seem to be better compared to the first model. We can see the loss plot and few examples below:


ACTUAL REPORT:  <sos> lungs are clear without focal consolidation effusion pneumothora .  normal heart size .  bony thora and soft tissues grossly unremarkable .  <eos>
GENERATED REPORT: <sos> the mediastinal contours . no pleural effusion . degenerative changes . the right humerus .
ACTUAL REPORT: <sos> the cardiomediastinal silhouette normal size and contour . no focal consolidation pneumothora large pleural effusion . tspine osteophytes . <eos>
GENERATED REPORT: <sos> the mediastinal contours . no pleural effusion . no displaced rib .

Evaluating the model using Beam Search Algorithm:

Let us predict the sentences using Beam Search Algorithm. Instead of greedily choosing the most likely output for the next step, the beam search expands all possible outcomes and keeps the k most likely or probable outputs. Here k is known as the beam width, is a user-specified parameter, and it controls the number of beams or parallel searches through the sequence of probabilities. k=1 is nothing but a greedy search algorithm itself. As k increases the performance of the model can be improved, but the time complexity can be increased. So it is a choice between performance and time complexity.

In this case study, we consider k=3.


This model got an average bleu score of 0.66 which is pretty good compared to the previous model.

Here, even our attention model can’t predict each and every image accurately. These might be some of the reasons we do not have a good prediction in some of the cases. Keep in mind that we are just training this model on 2758 data points. To learn more complex features, the model will need more data.

You can find the whole code here and reach me here


I Deployed the best performing model using the flask API. Below is the preview of it.


You can find code for deploying here


Let’s summarize what we done from first:

  • We just saw an application of image captioning in the medical field. We understood the problem and the need for such an application.
  • Created an Encoder-Decoder model which gave us decent results.
  • Improved the base results by building an Attention model.


  • We had a smaller dataset for this problem. A larger dataset might produce better results.
  • Using more advanced techniques like transformers or BERT might yield better results.





Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store