Structure Preserving Convolutional Attention for Image Captioning

: In the task of image captioning, learning the attentive image regions is necessary to adaptively and precisely focus on the object semantics relevant to each decoded word. In this paper, we propose a convolutional attention module that can preserve the spatial structure of the image by performing the convolution operation directly on the 2D feature maps. The proposed attention mechanism contains two components: convolutional spatial attention and cross-channel attention, aiming to determine the intended regions to describe the image along the spatial and channel dimensions, respectively. Both of the two attentions are calculated at each decoding step. In order to preserve the spatial structure, instead of operating on the vector representation of each image grid, the two attention components are both computed directly on the entire feature maps with convolution operations. Experiments on two large-scale datasets (MSCOCO and Flickr30K) demonstrate the outstanding performance of our proposed method.


Introduction
Image captioning is to automatically generate a natural language sentence given an image [1][2][3][4][5][6], for which an encoder-decoder framework with attention mechanisms has achieved great progress in recent years. Usually, Convolutional Neural Network (CNN) is used to encode visual features and a recurrent neural network (RNN) is used to generate a caption [7,8]. The attention mechanism [5,6,9] can generate a dynamical and spatial localized image representation focusing on certain parts of an input image. As a typical solution of most existing work, the image parts are encoded as a set of vectorial representations corresponding to different grids on the feature maps, which are considered as independent from each other [5,10], we called this grid-based attention. The grid-based attention realized by fully connected layer treats the image features as a set of independent vectors, each of which corresponds to a region in the image grids and then calculates attention weights for each vector and aggregates them with weighted sum. However, this operation totally breaks the spatial structure between each grid, which could be harmful to the model to fully understand the scene.
This motivates us to explore an alternative operation for grid-based attention in image captioning. Instead of operating on the vector representation of each image grid, our attention is computed directly on the entire 2D feature map with convolution operations. As opposed to the standard formulation, this alternation is capable of preserving spatial locality, and therefore it may strengthen the role of visual structures in the process of caption generation. In this paper, we propose a convolutional attention module called Structure Preserving Convolutional Attention (SPCA) that can preserve the spatial structure of the image by convolution operations directly on the 2D feature maps. Our SPCA has two submodules: convolution spatial attention and cross channel attention, which can adaptively determine the intended regions to describe the image according to the current decoding state. As shown in Figure 1, the top row is the visualized results with grid-based attention. We can observe that the attentive regions are inaccurate, because the spatial structure of the image features is not preserved when calculating the attention, resulting in partial deviation. However, in our SPCA (the bottom row), the resulting attention area is precise. To verify the effectiveness of the proposed attention module for image captioning, we apply it to two distinctive models, including a standard 1D-Long Short Term Memory (LSTM) model [5] and a recently proposed novel model which represents the latent states of LSTM with 2D dimensional maps. Experiments on two large-scale datasets (MSCOCO and Flickr30K) show that our attention model performs great in the two models.
The contributions of this paper are presented as follows: • We propose a convolutional spatial attention for preserving spatial structures in the attention map. • Two attention components, namely cross-channel attention and convolutional spatial attention, are designed to adaptively determine 'what' and 'where' to focus on when predicting each word.

•
Extensive experiments on Flickr30K [11] and MSCOCO [12] show the effectiveness of spatial and channel attention mechanism. In addition, our approach demonstrates great performance and generalization ability when applied to two distinctive models with both 1D and 2D LSTM latent states.

Image Captioning
Image captioning is a task of generating short descriptions for given images, and it has been an active research topic in computer vision. To generate captions, early techniques mainly rely on detection results, first extracting a set of attributes related to elements within an image and then generating language description. In recent years, in view of Deep Neural Networks'(DNNs) great successes in computer vision, a number of works [2,5,6,[13][14][15][16] have developed neural network based methods to generate image captions. Specifically, these methods all use encoder-decoder paradigm [17], which uses Convolution Neural Networks (CNNS) to encode the images as features, and then generates captions with Recurrent Neural Networks (RNNs) or one of its variants, e.g., Gated Recurrent Unit (GRU) and Long Short Term Memory (LSTM).

Attention Mechanism in Captioning
Visual attention has been widely used in various image captioning models in order to allow models to selectively concentrate on objects of interest. Xu et al. [5] combine the memory vector of LSTM with visual features from CNN and feed the fused features to an attention network to compute the weights for features at different spatial locations. Yang et al. [18] propose a reviewer module that applies the visual attention mechanism multiple times while generating the next word. In [10], an adaptive attention mechanism is proposed to determine when to look and where to look at, and no visual information words such as 'a' and 'the' should attend to the visual features. Chen et al. [13] introduce channel-wise attention which is operated on different filters within a convolutional layer. Most of these models generate visual attention in a vector and pay slight attention to temporal information. Without spatial attention structure and temporal information, the compute attention will fail to catch objects accurately and pay attention to what we are not interested in in the next step.

2D-Latent-State LSTM
As in [19], a 2D-latent-state LSTM is proposed. For image caption task, it is important to capture and preserve properties of the visual content in the latent states, representing the latent states with 2D maps and connecting them via convolutions. As opposed to the standard formulation, this variant is capable of preserving spatial locality, and therefore it may strengthen the role of visual structures in the process of caption generation. This motivates us to rethink the attention mechanism, and a convolution spatial attention is proposed as follows.

Overview
We start by briefly describing the encoder-decoder image captioning framework [5,6], and then we describe our SPCA modules.
Encoder-decoder framework: Given an image and the corresponding caption, the encoder-decoder model is directly optimized by the following objective: where θ is the parameters of the model, I is the given image, and y = y 1 , . . . , y n is the corresponding caption words. We adopt LSTM for decoding image features into a sequence of words. The update for its hidden units and cells of an LSTM are defined as: where x t is the embedded word representation, c t is the context representation, [ ; ] represents concatenate. Furthermore, h t is the hidden state and memory cell at time t.
Commonly, the context representation c t , is an important factor in the neural encoder-decoder framework, which provides visual evidence for generating caption. Attention mechanism [5] has been proven to be crucial in producing c t : where ATT is the attention function, V ∈ R C×W×H (C, W, H represents the channel, width and height, respectively) is the image feature map, which is the output of a CNN image encoder.
As for conventional spatial attention models, ATT is a grid-based attention. In these models, the context representation c t is a vector, which is formulated as: where V i ∈ R C is a vector representation corresponding to the i-th grid of the image features, W v , W h and µ are parameters to be learnt. α ti is the attention weight for V i .

Structure Preserving Convolutional Attention
As defined in Equation (6), attention weight α ti is calculated on the vector representation of each grid independently by a fully connected layer. However, this operation totally discards the spatial structure of the image regions, which are known to be significant in image captioning. To overcome such a problem, instead of calculating on vector level, our SPCA attention is computed with convolution operations on the 2D feature maps, and thus can preserve the spatial structure of the image regions. Our SPCA module is composed of two attention components, named convolutional spatial attention and cross channel attention. Figure 2 depicts the framework of our attention module.

Convolution Spatial Attention
Cross Channel Attention Note that our SPCA's input latent state h, input image feature V and output feature C are all represented as 3D tensors of size C × W × H. Such a tensor can be considered as a multi-channel map, which comprises C channels, each with a size of H × W. In general 1D-LSTM which has latent state h ∈ R 1×C , we utilize tile operation to copy h into a new tensor h ∈ R C×W×H . As for output C in our SPCA, we finally use a pooling layer to reduce dimension as C ∈ R 1×C . However, we can omit these operations at 2D-LSTM.

Structure Preserving Convolutional Attention
At the t-th step, given the obtained image feature whose size is V ∈ R C×W×H . After the above mentioned dimensional change operation, we concatenate V in channel dimensions with 2D latent state h t−1 ∈ R C×W×H to form a new feature maps F t ∈ R 2C×W×H . Then, our attention process can be summarized as: where ⊗ denotes element-wise multiplication. SPCA C , SPCA S represents the attention function for channel and spatial attention modules. M C t ∈ R C×1×1 , M S t ∈ R 1×W×H are the attention weights of resulting channel and spatial attentions, respectively. C t ∈ R C×W×H is the resulting context representation.

Convolutional Spatial Attention
This submodule generates M S t to tell 'where' in the image should be focused on. While grid-based attention has been proven to be effective on image captioning, it does not take into consideration the spatial structure of image regions and treat them as independent vectors. In fact, lack of spatial structure will lead to inaccurate positioning, which affects the quality of the generated captions.
To enable preservation of the spatial structure of image regions, a convolutional spatial attention is proposed.
We use a convolution operation instead of the original fully connected layer and use 2D latent state and 2D image features instead of those used before. On the one hand, our convolutional spatial attention preserves the structure of the image. On the other hand, we use convolution operation with 3 × 3 kernel size to provide larger receptive field to accurately determine 'where' we should focus at t step.
In short, our convolutional spatial attention is computed as: where M S t ∈ R 1×W×H is the convolution attention weights, Conv1 and Conv2 are both convolution operation with 3 × 3 kernel sizes, Relu denotes RELU activation function and σ denotes the sigmoid function.

Cross Channel Attention
This submodule generates M C t to tell 'what' content in the image to describe when generating the current word. While spatial attention has been widely used in previous work, cross channel attention has not been paid much attention. As is described in [19], different channels have different activated regions, which means only several channel will be activated when predicting a word. Thus, we add information h t−1 to decide 'what' should be followed at the next step. The cross channel attention is also computed based on F t ∈ R 2C×W×H . First we apply average pooling for each channel to obtain the channel feature F c ∈ R 2C×1×1 , then we obtain the cross channel attention map M C t ∈ R C×1×1 by convolution layers. In short, the cross channel attention is computed as: where r denotes the reduction ratio, Conv1's input dimension is 2C and output dimension is 2C/r, Conv2's input dimension is 2C/r and output dimension is C and Kernel sizes of them are both 1 × 1, Relu denotes RELU activation function and σ denotes the sigmoid function.

Dataset and Evalution
We evaluate our method on two well-known datasets: (1) MS-COCO [12], where most images contain multiple objects in complex natural scenes with abundant context information. Each image has five corresponding captions. We use the split as follows [2]. It uses all 113,287 training set images for training and selects 5000 images for validation and 5000 images for a test from an official validation set.

Implementation Details
In our captioning model, for the encoding part, given the powerful capabilities of Resnet-101 and its convenience in controlling variables to compare other methods, we finally adopted widely-used CNN architecture: Resnet-101 [24] to extract image features as input for our SPCA module. When extracting the features, no cropping or re-scaling is applied to the original images, instead, an adaptive spatial average pooling layer is utilized to produce features with a fixed size of 2048 × 14 × 14. For the decoding part, we used the LSTM [25] to generate caption words. For both 1D-LSTM [5] and 2D-LSTM [19], word embedding and attention dimensions are set as 512. We use the Adam [26] to optimize our network with learning rate set as 4 × 10 −4 . During the supervised learning, the learning rate is decayed by a factor of 0.5 every five epochs. Each mini-batch contains 20 images. In the test period we adopted BeamSearch [6] strategy, which selects the best caption from some candidates-the beam size is 2. We show the parameters and speed results in Table 1.

Attention Structure Selection
We explore the respective roles of cross channel attention and convolutional spatial attention, and the performance of their various combinations. We choose 2D-LSTM model as our baseline because it has 2D-latent-state that is suitable for our purposes. At the same time, it does not add an attention module so that we can observe the role of different attractions in the captioning system. As we show in Figure 3.  Figure 2. We calculate channel attention and spatial attention in parallel, as Equation (10). We use convolutional spatial attention to preserve spatial structure of attention and concatenate hidden state at each step to get more temporally contextual information. Based on the results listed in Table 2, we have the following observations: (1) Comparing to the performance of S, the performance of C shows that more channel information can help improve in image captioning. Furthermore, all of S-C, C-S and S C can achieve better performance than only S or C, which proves spatial attention and channel attention are complementary to each other, one for 'where', the other for 'what'. (2) For both datasets, S C gets better scores than the other sequentially combining pattern.

Convolution Kernel Size
In our SPCA the kernel size in convolutional operation has an effect on the receptive field of the attention to get more completely semantic, thus in this experiment we explored the effect of different kernel sizes on model performance. We set three different kernel sizes including 1 × 1, 3 × 3 and 5 × 5. From the result in Table 3, we have the following observations: performance of 1 × 1 is the worst, 5 × 5 is the second and 3 × 3 is the best. For 1 × 1 kernel size, it equals to grid-based attention, which ignores the spatial structures during the attention computed. For 5 × 5 kernel size, it indicates the larger receptive field cannot get better performance. Large receptive field will provide more global information, but including much background information to confuse our attention mechanisms. So the 3 × 3 kernel size gets the best performance-not only does it preserve the spatial structure of the feature, but it also removes the interference information.

Performance Comparisons
We compare our proposed SPCA with the state-of-art methods on image captioning, including Google NIC [6], Soft-Attention [5], SCA-CNN [13], ATT2in [27] and 2D-latent state [19]. The first three models are based on 1D-LSTM and the last one is for 2D-LSTM. As listed in Table 4, our SPCA outperforms the other models. The '*' represents the model we reproduce ourselves whose setting as above. We apply our SPCA module based on 1D-LSTM: Soft Attention and enhance performance by replacing the VGG [28] based visual encoder with a more powerful ResNet-101 [24] based one. As for 2D-LSTM: 2D-latent state, we only train the model solely on cross-entropy loss without reinforcement learning or fine-tune CNN. (1) Comparing with Soft-attention: we get the higher score, indicating the fact that SPCA preserves the spatial structures during calculating spatial attention with convolutional operation. (2) Comparing with 2D-latent state: a significant increase in scores, due to the fact that our SPCA exploits spatial and cross-channel attentions. (3) Comparing with the other model: our SPCA method greatly exceeds the previous method in performance. We provide some qualitative examples in Figure 4 for a better understanding of our model. We visualized results at one word prediction step. For example, in the first sample, when SPCA module tries to predict the word 'throwing', our attention focuses on the player's hand and the upper part of his body accurately. However, in soft-attention when predicting the word 'bat', it regards him as a 'batter' rather than a 'bowler'. This indicates that our SPCA preserves the image structure with 2D latent state and learns a more accurate attention region.
GTs: A baseball player preparing to throw a pitch during a game.
A baseball player lunges and reaches back with the ball. A baseball player pitching a baseball on top of a field. Soft-attention: A swinging baseball player is bat at ball. Our: A man in a baseball uniform throwing a ball . GTs: A gray cat sitting on a bed and staring at the camera.
A gray tiger cat sitting at a wooden table on a chair. A grey cat sitting in chair next to a table. Soft-attention: A cat sitting on top of a wooden table. Our: A cat siting on the chair next to table.
GTs: A woolly sheep stands in the grass looking at the camera.
A sheep looks at the camera, by the side of the road. A very large sheep is standing in the grass. Soft-attention: A large brown cow standing in a field. Our: A sheep standing in a field of grass.

GTs: A young boy holding a snow board and a pair of shoes.
A boy is holding his shoes and a snowboard. A guy smiling, wearing a blue jacket, holding a snowboard. Soft-attention: A man holding a pair of shoes with a mirror. Our: A man holding a snowboard in a room.

Conclusions
In this paper, we propose a structure preserving convolutional attention (SPCA) module for image captioning. There are two submodules: convolutional spatial attention and cross channel attention, which can adaptively determine 'where' and 'what' should be attended to, respectively. Different from the grid-based attention, our SPCA preserves spatial structure and fuses channel information at each step when calculating the attention. Thus, we can get more accurate attention and achieve better performance than popular benchmarks.

Conflicts of Interest:
The authors declare no conflict of interest.