Parallel Image Captioning Using 2D Masked Convolution

Featured Application: This work can be used in applications that require image-context analyzing function such as image/video indexing, content descriptor for visually impaired, and automatic image/video annotation. Abstract: Automatically generating a novel description of an image is a challenging and important problem that brings together advanced research in both computer vision and natural language processing. In recent years, image captioning has signiﬁcantly improved its performance by using long short-term memory (LSTM) as a decoder for the language model. However, despite this improvement, LSTM itself has its own shortcomings as a model because the structure is complicated and its nature is inherently sequential. This paper proposes a model using a simple convolutional network for both encoder and decoder functions of image captioning, instead of the current state-of-the-art approach. Our experiment with this model on a Microsoft Common Objects in Context (MSCOCO) captioning dataset yielded results that are competitive with the state-of-the-art image captioning model across different evaluation metrics, while having a much simpler model and enabling parallel graphics processing unit (GPU) computation during training, resulting in a faster training time.


Introduction
Image captioning (a task that allows a computer to automatically describe the content of an image in natural language) is a very challenging problem. To do this, the machine is required to not only recognize the objects contained in the image but to also understand the relationship between each object in the scene, along with certain attributes such as color, quantity, size, etc. As challenging as it seems, image captioning also holds great interest in the research community because it has many benefits, ranging from assisting the visually impaired in understanding image content to improving the image indexing and annotation tasks [1].
For many years, the main approach to image captioning has always been a combination of a convolutional neural network (CNN) as an encoder and a recurrent neural network (RNN) as a decoder. This approach has shown great results in the image captioning task, and it has seen many different adaptations and improvements from the research community. However, RNNs have known limitations, such as the parallel training problem and vanishing gradients, which result from the sequential processing nature of the model [2]. Even though long short-term memory (LSTM) [3] was later used in place of the RNN to solve the vanishing gradient problem, LSTM itself requires a large amount of memory, since it has to compute multiple linear layers per cell during its sequential learning. For this reason, many studies have been done to find a way to solve this problem [4,5].
Recent research in neural machine translation has seen many successful attempts to replace LSTM with a convolutional-based architecture [6,7]. Inspired by these studies, we also propose an end-to-end convolutional-based image captioning approach that has a result comparable to the traditional CNN-LSTM model while also enabling parallel graphics processing unit (GPU) computing, which reduces the model training time on a Microsoft Common Objects in Context (MSCOCO) dataset from two days to less than 20 h. On the MSCOCO test set, we achieved evaluation scores of 73.6 with the BLEU-1 metric, 25.9 with METEOR, and 99.4 with CIDEr, which outperforms the conventional LSTM and non-LSTM approaches by a considerable margin, and is comparable to the state-of-the-art image captioning model [8].
This paper makes two main contributions to image captioning research. First, we created a convolutional-based image captioning model that gives state-of-the-art performance over non-LSTM based models in some evaluation metrics. Secondly, we created a model that requires fewer parameters and less training time, compared to the state-of-the-art LSTM model, while also enabling parallel GPU computation.

Image Captioning
The method of describing the visual data in natural language form has been studied since the early days of deep learning. The first approach would be in the form of an image classification task [9], where a model tries to classify an image into different categories. After that, other authors [10,11] proposed models that make use of a region proposal to generate bounding boxes on different areas of the image.
Recently, interest in the field of image captioning has started to grow tremendously, where researchers have started using recurrent connection such as LSTM for their caption generation model. One of the earlier adoption of this approach was proposed by Vinyals et al. [12]. This method is a simple multimodal architecture that uses CNN to extract features of the image, which are the input for the initial state of the LSTM model. During each sequence, input at time t largely depend on the output of the LSTM at time t − 1 and the training process happen sequentially, one word at a time, until it reaches a special end of sentence token. This method effectively generates a meaningful caption, but lacks the understanding of spatial relationship between the image and current work token. To solve this problem, Xu et al. [13] extended the work of Vinyals et al. [12] by proposing two attention methods that make use of the feature from the last convolutional layer of CNN to calculate weighted context vector of the image, allowing the model to focus on the salient regions of the image when generating the caption. Beside from improving the architectural part of the model, Rennie et al. [14] proposed an optimization approach to improve the result of the image captioning model by using reinforcement learning to normalize the reward during the test-time inference. Using this method, the model can identify and give positive weight to the best samples and reduce the possibility of generating an inferior sample. Many attention methods in image captioning use top-down approaches, which consider the whole context of the image when generating the salient image regions, while giving little focus to the objects that appear in the image. Therefore, Anderson et al. [8] proposed a method combining both top-down and bottom-up attention mechanisms into the decoder layer of image captioning. For the bottom-up attention, the authors used Faster-RCNN [15] model to detect all the regions of objects presented in the image and used mean-pooled convolutional feature to convert these regions into image feature. This proposed method establishes state-of-the-art result by achieving first place on the current MSCOCO test server.
Despite making different contribution to the image captioning research, all the models mentioned above share the same sequential structure regarding the training process where a token at time t can be processed only after the process at time t − 1 is completed. Due to this sequential computation of the LSTM, the model cannot utilize the parallel feature of multi-GPUs, which can be an issue when performing on longer sequence task that require larger memory. To solve this problem, there is also a lot of research on using a convolution and self-attention model for this task. Aneja et al. [16] first proposed a convolutional approach based on earlier work [7] by replacing LSTM cells with a CNN and an attention mechanism, while Zhu et al. [17] used the self-attention approach from Vaswani et al. [18] for the decoder part of the image captioning model.

Sequence-to-Sequence Model
Sequence-to-sequence tasks, such as neural machine translation, are dominated by the RNN-based encoder-decoder model. However, since recurrent connection in sequential learning has a lot of issues, many of which are related to the sequential process of the network, new research has focused on replacing the RNN with different approaches, such as the convolutional network and attention. Successful non-RNN work is presented in [19], where the authors used the concept of the masked convolutional network to restrict information of future tokens from the current state that is being predicted. By doing so, the model only uses information about the current and the past time steps to learn the word sequence. Later, the authors of [6,7] adopted this method, and other authors further improved the model by using gated linear units (GLUs) [20], allowing the network model to learn the long-range dependencies of the input word so that it could perform even better than most of the RNN approaches to the machine-translation task. Besides the convolutional approach, Vaswani et al. [18] proposed an attention-based model that uses positional encoding and multi-headed self-attention to learn the sequence order of the input tokens and generate targeted sentences without using any sequence-aligned recurrent or convolutional neural network.

Materials and Methods
The main goal of our model is to automatically generate a contextual description when receiving the input using a convolutional approach for both encoder and decoder functions of the model. To do so, we replace LSTM in the decoder with the mask convolutional approach used in [6]. Unlike the LSTM approach where caption model process one work token at a time, our model can process all tokens in the sentence at once in a feed forward manner. By doing so, the model can also take advantage of multi-GPUs computing power allowing large batch input for each training iteration and faster convergence speed. Figure 1 shows the detailed implementation of the model.
The overall process of our image caption model is as follows: during training, our CNN encoder receives input image and learns its feature representation, while our decoder takes the output feature from the encoder and the corresponding captions to establish a relationship, and then outputs a text description of the image. We discuss each part of the model and their implementation in the next sections.

Image Encoder
Our encoder plays the role of feature extractor that learns the representative feature from the input image. In this work, we have chosen to use a variant of the CNN architecture called Resnet-101 [21], which makes use of a deep residual connection to learn the high-level information of the image. To get the feature of the image, we use the output from the last convolutional layer of the Resnet-101 model and apply adaptive average pooling to get a 512-dimensional feature map, Q, sized at 14×14, which we use for self attention. In addition, we apply a spatial average across all pixels to get a 2048-dimensional feature representation and a linear layer to map our image feature to a 512-dimensional vector, which is used as input for the decoder.

Decoder
The decoder in our model consists of three main components: a word-embedding layer, a 2D masked convolutional layer, and the prediction layer. First, when receiving input from the model, the word-embedding layer encodes the caption of the image into an n-dimensional vector space. Then, this output from the word-embedding layer is concatenated with the image feature from our CNN encoder, which results in a tensor that is further processed through the 2D masked convolutional layer in order to generate a meaningful output caption. Each of the layer processes are thoroughly described below.

Word Embedding
Before the sentence from the dataset can be input into the word-embedding layer, it needs to be converted into the |V| × 1-dimensional space using a 1-hot encoder, where |V| is the size of the vocabulary dictionary that we generate using all the words existing in the dataset. Then, each word vector is input into this embedding layer, which has a size of 512 × |V|, outputting a 512-dimensional vector that is concatenated with the image features from the CNN encoder.

Positional Encoding
Besides using word embedding to embed a sentence into a word vector, we also apply positional embedding to the 1-hot encoded vector of the sentence, enabling the model to understand the positional information of the input token and to make sense of the parts of the sentence passed to the model. In this work, we use the sinusoidal version of the positional encoding from [18], because it allows the model to generalize into a longer sequence during reference time. Sinusoidal positional encoding makes use of the sine and cosine functions to generate a wavelength that represents the frequency for different word tokens. The positional encoding can be calculated with the equations below: PE (pos,2j+1) = cos(pos/10000 2j/d model ) (2) where pos is the position, j is the dimension of our input, and d model represents the embedding dimension of our word vector. After applying positional encoding to our word vector, we use its output and apply summation with the output from word embedding, resulting a 512-dimensional vector, emb i .

2D Masked Convolutional Layer
Unlike the traditional image captioning approach where the input is processed through the LSTM model, we decided to use this convolutional approach as a replacement. This approach was inspired by the method of Elbayad et al. [6], which uses this masked convolutional layer for both encoder and decoder in the sequence-to-sequence problem of neural machine translation. Consistent with Elbayad et al. [6], our convolutional layer is adopted from the DenseNet architecture [22], in which each layer's input is taken from the preceding layer to produce a long-distance connection that is useful for solving the vanishing gradient problem and that improves the gradient flow. Figure 2 shows the use of this architecture. In this architecture, we first apply batch normalization and rectified linear units (ReLUs) to our input from the encoded word-embedding vector. Then, we apply a 1D convolution layer to downsample this output to reduce the computational cost. This layer is followed by more batch normalization and ReLUs, and we apply the masked convolution layer, which has the kernel size k = 3. This convolution-layer mask allows us to mimic the LSTM approach by restricting the future information flow from our input, and, hence, only making the prediction based on past data. In addition to this, we apply the tanh activation function between the convolutional layers of our model, which gives the output of a 512-dimensional vector for each word in the sentence. For our model, we stack this 2D masked convolutional layer l = 4 times, which increases h l i , the state size of the input elements, to 9. Between each layer, we also apply GLU nonlinearity so that the model focuses only on the most important elements.
Aside from the GLU unit, a residual connection was also added to each layer of this 2D masked convolutional layer to allow the model to have a deep residual connection [21] between each of the output states. The residual connection can be expressed in the equation below: where l is the layer of our 2D masked convolution, i represents the input token, and b is the bias.

Multi-Step Attention
Attention plays a really important role in the language generation task since it allows the model to understand which part of the sentence it should focus on during each sequence. Besides the language task, the attention mechanism also proved to be effective in the image captioning model [13]. In this method, we implement the multi-step attention layer from Gehring et al. [7] that uses emb i to help providing word context, and image feature Q and residual connections, which allow the model to attend to all parts of the image across different layers. The implementation of multi-step attention is shown in the equations below: where A l i represents the decoder state summary of state i for layer l, emb i is the output word embedding vector, v l i is the non-linear dot product between state summary and image feature Q, z u j represents the output of each 2D masked convolution, u, for sequence j, a l ij is the attention element, m is the length of the sentence, and c l i is the attended vector for word j. Figure 3 shows the detailed implementation of the 2D convolution architecture of the decoder model. Figure 3. The detailed implementation of our 2D convolution. After receiving the output from word embedding, we apply a 2D masked convolutional layer followed by a gated linear unit and an attention layer. We stack this 2D convolution four times before applying output embedding to get the output word sequences.

Prediction Layer
For each word, we apply max pooling to the output vector that we get from the convolutional layer. Then, we apply linear embedding with a size of |V| × 512 to our pooled feature to map this output to the |V|-dimensional vector. Finally, we apply Softmax to this output in order to obtain the probability distribution of the word in the dictionary.
To evaluate and learn the parameters of each iteration of our model, we use cross entropy loss as our loss function. Let us denote θ as all parameters required of our 2D convolutional model; then, our objective is to minimize the loss, as follows: where s i is the output token for sequence i, m is the total length of the sequence, j is word sequences before i, and L(S|I, θ) is the cross-entropy loss for our sentence.

Dataset
We used the MSCOCO 2014 dataset [23] as the experimental dataset for our model. MSCOCO is a large-scale image dataset that consists of 123,287 images, each of which has five corresponding captions. We split our dataset into training/validation/testing splits using the method from [24]. The splitting resulted in 113,287 training images, 5000 validation images, and 5000 testing images. Before feeding the images into our model, we do some scaling and apply normalizing techniques in order to make the size of the images the same as the CNN model. In the training set, we also preprocessed all the captions of each image by replacing all words with an occurrence of less than five with the <UNK> token and by appending <START> and <END> tokens to the vocabulary set, resulting in a vocabulary size of 8856 words.

Training and Evaluation
For training, we used stochastic gradient descent with an adaptive learning rate as our optimization algorithm, in which the learning rate starts at 4 × 10 −4 and is scaled by a factor of 0.1 every 10 epochs. We also used the method from Vinyals et al. [25] on our image encoder model by fine-tuning the model after 15 epochs in order to reduce the noise coming from the initial gradient of the decoder model. In addition, to avoid overfitting, we also applied a dropout to each layer of our convolution, as well as the prediction layer, with a dropout rate of 0.1. For this model, we used the same embedding dimension, d = 512, for all the word embedding, the attention, and the convolutional model. Using a batch size of 64, our model was trained for 30 epochs on a 2 Nvidia Titan X GPU parallel graphics card for less than 20 h.
To evaluate the performance of our model, we used multiple conventional language evaluation metrics: BLEU [26], METEOR [27], and CIDEr [28]. Each metric was evaluated using the caption generated by our model and the ground truth caption in the validation and testing datasets.
During each epoch, we evaluated our model based on the BLEU-1 metric, and saved the model with the highest score.

Inference
To generate a caption sentence for the new image, first, we encoded the image into the feature space using our Resnet-101 CNN encoder. After getting the feature map, we inputed it along with the <START> token to our model in a simple feed-forward manner. Similar to the RNN/LSTM approach, the inference for our CNN approach was in a sequential manner, meaning that output from the first token was used as feed-forward pass input for the next token. We performed this operation sequentially until the model reached the <END> token or the maximum sentence length.
To improve the performance of our language model generation, we also implemented beam search, which was used during the inference/testing time of our model. In each sequence generation, beam search maintained a list of the top-k sequence tokens generated by the model, rather than selecting the highest probable token. After all the top-k sentences were generated, we chose the sentence with the best overall score that was calculated by the beam search algorithm. In this work, we experimented with different sizes, k, for the beam, ranging from k = 1 (greedy decoding) to k = 5. After comparing the result with the different beam sizes, we found that k = 3 gives the optimal performance, with the best overall result across all the evaluation metrics on the MSCOCO test dataset. Figure 4 shows the top three caption results generated by beam search.

Result and Discussion
To show how well our model performed, we compared the result of the MSCOCO testing dataset against some of the popular traditional image caption models with LSTM as well as other non-attention methods, including those in [16,17], using the above evaluation metrics. As shown in Table 1, our model obtained a better result than most of the LSTM models across all the evaluation metrics, while it still fell far behind the state-of-the-art result of the LSTM approaches in [8,14]. However, from our understanding, the significant performance increase compared to the methods in [8,14] resulted from a critical schedule sampling method as well as from REINFORCE algorithm [29] training, which reduced the variance of the gradient and increased the probability of sampling the correct caption. Thus, we believe that, by carefully fine-tuning the model, the results from our model would be improved and could reach those of the state-of-art LSTM approaches. Figure 5 shows the qualitative comparison between our generated caption and ground truth.  We also compared the result of our model to the non-LSTM approaches in [16,17]. Aneja et al. [16] implemented their image captioning model with a technique similar to our approach, which replaces the LSTM cells with a masked convolution network. While the architecture in [16] is conceptually simpler and uses fewer parameters, the results with our model are much better, outperforming the other approach across all evaluation metrics. As for Zhu et al. [17], their approach to the image captioning model is much different from our model, which makes use of the multi-headed self-attention module in [18] to generate the word sequence without the presence of LSTM. With this architecture, the best-performing model generated caption results that are comparable to our model across most of the evaluation metrics, but performed better based on CIDEr and worse based on BLEU-1. The reason for the better performance in CIDEr may result from the model being trained and optimized based on CIDEr cross entropy. Figure 6 shows the training progression of our model. Another thing to notice is the performance increase in speed during the training process. Since our model uses a convolutional approach, we can feed our words as input to the model all at once, unlike the LSTM approach, which works in a sequential process, one word at a time. By feeding all the words at once, our model also benefits from parallel GPU computing, allowing us to use a large batch size for training and to reduce the training time per epoch by 35%. Table 2 shows a performance comparison between our model and other image captioning models. Compared to other models, our model may have more parameters than the method in [12], but the training for each epoch can be completed in a comparable time while achieving a significantly better result across all metrics. This performance improvement becomes even more transparent when comparing with the method in [13], since our model can be trained twice as fast with fewer parameters and better result. The method in [8] may have a better result, but the training speed is slower than than our proposed method. This also does not consider that the method in [8] uses Faster R-CNN to pre-process all their bottom-up attention feature before their input to captioning model. Using this method can significantly speed up the training since it does not need to extract the image feature during the training, as the model longer performs in an end-to-end manner. If we consider the duration spent during processing the bottom-up attention feature, the training of the model would take much longer than stated in the table. In our study, we ran up to four GPUs in parallel; however, we found that the speed-to-accuracy result was the best with two GPUs. Using more than that reduced the accuracy of our predicted sentences. Table 3 shows the performance of our image captioning model on the multi-GPU training of different GPU setups. We hypothesize that the performance obtained may be because of the information loss from data communications across GPUs during feed-forward and back-propagation in the training process, which led to this performance loss [30,31]. Table 2. Comparison of the number of parameters and the training speeds per epoch of different image captioning models. The work in [12,13] presents our implementation based on the paper using PyTorch, whereas the work in [8] is the open-source implementation using Caffe.

Conclusions
In this study, we successfully implemented an image captioning model using a convolutional neural network that enables parallel GPU processing for this task. Our model can achieve increasing performance in training speed, while obtaining results comparable to the traditional approaches.
From this study, we believe there is still room for improvement of our model. One of the first things we want to do is to fine-tune the hyper-parameters to find the optimal performance. Another thing that can improve this model is to use the reinforcement learning approach of Rennie et al. [14] during training, which we believe will improve the results. We also want to explore on the differences of the reinforcement technique between the LSTM and convolutional approaches and whether there would be any changes or improvement to speed and result of the model. In addition, we also want to try captioning longer-sequence tasks. All of this will be kept in mind for future work.
Funding: This research received no external funding.