2.1. Attention Model
At present, the basic architecture of all attention-based models is encoder-decoder architecture [
9,
10,
13,
18,
19]. In this framework, the encoder is responsible for extracting image information, while the decoder is responsible for decoding the extracted information and generating the caption. Usually, the encoder uses convolutional neural networks, such as VGG [
25], ResNet [
26] and so on. The decoder uses an RNN (recurrent neural network), like GRU (gated recurrent unit) [
27] and LSTM (long-short term memory) [
28]. LSTM is one of the most commonly used decoders in image captioning [
1,
2,
10,
11,
21,
22,
29,
30,
31]. In this paper, we also use LSTM as the decoder and its working principle is as follows:
where
,
,
,
,
,
are the update gate, forget gate, output gate, candidate memory cell, memory cell and hidden state of the LSTM, respectively.
and
are parameters to be learned.
is a sigmoid activation function.
is the network input at
t time. The operator
denotes the Hadamard product (pointwise product).
The key to LSTM is the
, which can easily control the flow of information. Through structures called gates, the LSTM can add or remove information to
. There are three gates in LSTM, namely the forget gate, update gate and output gate. As shown in Equations (1)–(3), each gate is composed of a simple sigmoid neural network layer. The output value of the sigmoid function is 0–1, 0 means that all information cannot pass this gate, 1 means that all information can pass this gate. The structure of LSTM can also be represented by
Figure 1:
The first step in LSTM is to use the forget gate to decide how much information in is thrown away. The second step is to use update gate to decide how much new information in is added to . Through the above two steps, can be obtained. Finally, by putting through the tanh function and using output gate, we can get the output .
In a translation problem, a sentence is treated as a time series. A word is in the form of a vector, first through a linear layer (embedding layer), and then input to LSTM for subsequent translation. Essentially, the encoder-decoder model is a translation model in image caption, and the model translates an image into a sentence. If the full connection layer extracted from CNN is used as the input of LSTM at −1 time, then the NIC model [
9] can be obtained, which is as follows:
where
are learning parameters and
CNN encodes the image as a vector input to LSTM.
It should be noted that in the NIC model, the information of an image is input only once, and the subsequent process generates sentences only by LSTM. Obviously, this method does not enable the network to make full use of the extracted image information, because the model only “looks” at the image once. In the NIC model, the fully connected layer in CNN is used as image coding. Assuming that the size of the final convolutional layer of the CNN is 7 × 7 × 512, and the fully connected layer is an average pooling of the feature maps, that is, to average the values of 7 × 7 regions on each feature map. Therefore, the NIC model treats these 49 regions equally without any differentiated view, which is different from the way people describe an image.
The attention model has been improved for the NIC model and contains the input of image information at every time step in the decoding process. The encoder extracts the convolution layer of CNN (usually the last convolution layer of the network) rather than the fully connected layer. We also assume that the size of the final convolutional layer of the CNN is 7 × 7 × 512, and the attention model will first learn 7 × 7 weight values, indicating different attentions to different regions. Next, it will perform pooling according to these weight values. Therefore, the attention model treats these 49 regions with different attentions.
The core of the attention model is that when using the extracted convolution layer, the attention model does not necessarily pay attention to all areas of an image, and more likely it is only using the information of some areas of the image. For example, when generating the word “airplane”, the model only needs to focus on the area containing the airplane in the image, instead of observing the entire image. This mechanism is consistent with human visual mechanisms. When a human observes an image, the focus of vision only stays in some areas, not all areas.
Since LSTM inputs additional image information each time, the same changes need to be made to Equations (1)–(4). Taking Equation (1) as an example, its new expression should be as follows:
where
is the encoding vector with attention. Its value is determined by the hidden layer of LSTM and the whole image, which can be simply expressed as:
where
is an attention function.
is the convolution layer extracted by CNN, and
k is the size of the feature maps extracted by CNN, which is equal to the height of the feature maps multiplied by the width of the feature maps, and
(C is the number of feature maps). According to the difference of
, attention mechanisms can be divided into two kinds, one is stochastic “hard” attention, the other is deterministic “soft” attention [
10].
Because the implementation of soft attention is relatively simple and is effectively equivalent to hard attention, only soft attention is introduced here. Soft attention learns
k weights at each time
t, and those weights represent the attention level of
k areas in
I, which can be expressed as follows:
where
is the weight to be learned.
and
are parameters to be learned.
is a matrix with all elements set to 1, which is used to adjust the dimensions of the matrix. Since the sum of
k weights is 1, the
softmax function is used. It can be seen that in the attention model, the representation of an image is actually a region weighting of
I. The model learns
k weights at each time
t, thus achieving an image representation with attention.
2.2. The Multi-Level Attention Model for Remote Sensing Image Caption
Single attention structure is insufficient to express visual and semantic features. Therefore, we propose a model with three attention structures, which represent the attention to different areas of the image, the attention to different words, and the attention to vision and semantics, as shown in
Figure 2.
For the attention1 structure, it is similar to the general attention structure in
Section 2.1, as shown in
Figure 3, but the difference between our attention1 structure and the general attention model lies in the change of the function
. The input parameter
of
changes to
. This means that the model first calculates
at time
t and learns the visual expression according to it.
In the attention2 structure, we mainly consider the guidance of the generated words to the subsequent word generation. This is similar to the language model, but here it is not only to predict the next word according to the vector obtained by attention2, but to select the vector again through attention3. The schematic diagram of attention2 is shown in
Figure 4.
In attention2, we use
in LSTM to represent word information and add
information at each moment to learn a set of weights. Then, the weights act on
to get the final expression for semantic features, which is expressed by the vector
. This process can be expressed by the following equations:
where
is the weight used to express attention to words.
and
are the parameters to be learned. These weights act on
to get the expression
for the semantic vector.
We achieve the attention representation for different regions of the image by attention1, and the attention representation for different words by attention2. Next, we need to add the attention3 structure. This structure is mainly to consider some fixed sentence expressions, such as “be able to”, that is to say, after “be able” appears, when predicting the next word, the attention to the image can be small. On the contrary, when predicting words like “airplane”, the attention3 would pay more attention to images than words. This attention structure can guide the model to automatically choose whether to focus on image information or focus on sentence structure information when generating a caption as shown in
Figure 5.
The structure of attention3 can be expressed by the following equations:
where
and
are the parameters to be learned. The range of
is [0, 1], and
. If
is 1, it means that the model is completely dependent on the image information, and if its value is 0, it means that the model is completely dependent on the sentence information.
obtained by the multi-level attention structure is used to predict the next word at time
t.
In fact, the main difference between our model and the NIC model is that our model adds three attention structures. And the equations involved in three attention structures are mainly the way of obtaining attention weights, which is achieved by the combination of liner layers and a
softmax function. Assume that the size of the feature map extracted by CNN is 7 × 7 × 512, the hidden state dimension in LSTM is 512, and the number of neurons in the attention network is 256. Taking Equations (10)–(12) as an example, the process of getting attention weights based on these equations can be shown in
Figure 6; all three attention structures can be built like this.
As we can see from
Figure 6, from only using linear layer and
softmax functions, attention structure can be realized. The linear layer is for the transformation of dimensions, and the
softmax function is to get attention weights.
and
in equations are the parameters of the linear layer, which need to be learned. Thanks to the application of the back propagation (BP) algorithm [
32], we only need to build a forward network including these three attention structures to learn these parameters.