With the development of computers and sensors, modern remote sensing technologies have seen rapid successful progress. Remote sensing images have become an important tool for people to access geospatial information. In recent years, remote sensing images have played a very significant role not only in military applications but also in the field of business applications. The applications cover a wide range of vital areas such as national census, geological survey, water conservancy construction, oil exploration, map mapping, environmental testing, earthquake prediction, railway, and highway location, and archaeological research [1
Nowadays, remote sensing images are of high resolution. However, their interpretation and understanding are still limited to the feature level, such as scene classification [4
] and object detection [7
] with little reasoning and understanding of the scene. This handling cannot solve the “semantic gap” problem between low-level features and high-level abstract or summarization. Therefore, correctly interpreting high-resolution remote sensing images at different levels in a large dataset has become one of the most challenging scientific problems in the field.
Furthermore, due to the high complexity of the scene and the difficulty of sample labeling, there are very few studies on semantic description of remote sensing images, and the existing results are concentrated on image semantic extraction and image retrieval. Liu et al. [9
] proposed a semantic mining-based remote sensing image retrieval model. Zhu et al. [10
] proposed a new strategy of SAL-LDA (Semantic Allocation Level-Latent Dirichlet allocation) at the semantic distribution level. Yang et al. [11
] used the Conditional Random Field (CRF) Framework to model the underlying features and context information of remote sensing images. Wang et al. [12
] proposed a semantic-based remote sensing image data retrieval solution. Chen et al. [13
] proposed to construct a typical object semantic relation network based on the graph model theory. Li et al. [14
] proposed a target semantic model based on the target detection method, with a better reflection of the essential difference between the target themes of the semantic meaning of the target category. The methods are obtained through statistical learning methods.
In order to solve the “semantic gap” problem and make better use of remote sensing images, for example, remote sensing images understanding and description can help to analyze battlefield images to achieve a real-time interpretation of the geographical environment. Therefore, we first need to understand the image content and generate a natural language description of the content, namely image captioning. Approaches in a multimodal space have achieved significant progress [15
]. These methods combine image features with text description data, and generate the description of a new image by learning the corresponding relationship between them.
For remote sensing image captioning, very few research works have been published. Notable among the modest literature, Qu et al. [19
] proposed a multimodal neural network model for semantic understanding of high resolution remote sensing images. Shi et al. [20
] presented a remote sensing image captioning framework using a convolutional neural network (CNN). Lu et al. [21
] explored some models with multimodal and attention mechanism, and exposed a dataset, Remote Sensing Image captioning Dataset (RSICD).
Visual attention [22
] comes from the study of human vision. In cognitive science, due to the bottleneck of information processing, humans will selectively focus on some part of the scene while ignoring other information, which is referred to as the attention mechanism. In image captioning, attention mechanisms can be introduced to process visual information selectively, which allow the computer to place computational resources on an object at a time, guided by image attributes. Attention mechanisms have been widely used [21
]. For example, before an attention mechanism is introduced into an encoder-decoder framework, the encoder can only convert the image features into one intermediate vector, which is decoded to generate all the words in the target sentence. When there are many or complex objects in an image, one intermediate vector is not sufficient to represent the corresponding image features. The attention mechanism enables the encoder to look at different regions in the image and generate multiple intermediate vectors, thus improving the quality of the generated sentences. This is especially important when there is confusion in the image. For remote sensing images, a large number of ground objects with multiple scales, as well as the mutual occlusion between various objects, lead to the confusion of information in the image. Therefore, the attention mechanism pays much attention to the dynamic representation of images. Furthermore, as high-level concepts, attributes are able to represent global features [25
]. Unlike natural images in which specific objects attract the attention, many items in remote sensing images need to be taken into account. Due to the complex presentation of remote sensing images, the original attention mechanism in processing remote sensing images lacks attention to some inconspicuous objects. Therefore, we need to integrate global information to the model in order to improve the attention accuracy.
In this paper, we propose a new model with an attribute attention mechanism for remote sensing image captioning. In this field, compared with the attention mechanism used in Lu et al. [21
], the proposed model can: (1) focus on the whole image information while paying attention to the relationship between the input images and the word that the decoder generates, (2) weight the relative strength of the attention paid onto different attributes, and (3) use features that correspond to the detected visual attributes rather than the pre-trained features at a particular spatial location [21
]. Using an encoder-decoder framework, our model is shown in Figure 1
. First, the input remote sensing image is mapped onto the feature maps using CNN. Then, the feature maps are passed to the fully connected layer of CNN, and we take the output of the last fully connected layer (or softmax layer) as attributes. By introducing the attributes, the attention mechanism perceives the whole image while knowing the correspondence between regions and words. Due to the existence of the attention mechanism, at each moment, different intermediate vectors can be generated. Finally, the intermediate vectors and the embedded texts are input into the long short-term memory network (LSTM) for training. We chose LSTM rather than the standard recurrent neural network (RNN) as the decoder because it is a four-layer network with a special way to interact in the repetitive module of LSTM which uses a very simple structure e.g., a tanh layer in the standard RNN. Furthermore, the gate structure is applied to LSTM to discard and store the information and determine the state of the cell. In [15, 21], the reported experiments also showed that an LSTM performs better than a standard RNN.
This paper presented a novel framework for description generation of remote sensing images based on the attribute attention model. We use a CNN to produce image features, which are the feature maps of the last convolution layer region proposals for the input remote sensing image. In the encoder-decoder framework with the attention mechanism, the attention mechanism is able to influence the intermediate vector by assigning different weights to different areas of the image, which compensates the deficiency of the original attention mechanism in complex remote sensing images. To apply the high-level features of remote sensing images, we use suitable attributes to embed the global information so that it can affect the parameter of the attention mechanism, focusing on more objects and scenes. We feed the intermediate vector into the decoder and adopt the LSTM to generate the description of the given remote sensing image. We observed that better results can be achieved if the attribute size is appropriate.
From the observation of the experiments, our model can reach high scores on the different assessment criteria, and obtained satisfactory results when the remote sensing image is less complicated. In the future, we will focus on understanding semantics information, and use this to better train our models.