Boosting Memory with a Persistent Memory Mechanism for Remote Sensing Image Captioning

: The encoder–decoder framework has been widely used in the remote sensing image captioning task. When we need to extract remote sensing images containing speciﬁc characteristics from the described sentences for research, rich sentences can improve the ﬁnal extraction results. However, the Long Short-Term Memory (LSTM) network used in decoders still loses some information in the picture over time when the generated caption is long. In this paper, we present a new model component named the Persistent Memory Mechanism (PMM), which can expand the information storage capacity of LSTM with an external memory. The external memory is a memory matrix with a predetermined size. It can store all the hidden layer vectors of LSTM before the current time step. Thus, our method can effectively solve the above problem. At each time step, the PMM searches previous information related to the input information at the current time from the external memory. Then the PMM will process the captured long-term information and predict the next word with the current information. In addition, it updates its memory with the input information. This method can pick up the long-term information missed from the LSTM but useful to the caption generation. By applying this method to image captioning, our CIDEr scores on datasets UCM-Captions, Sydney-Captions, and RSICD increased by 3%, 5%, and 7%, respectively.


Introduction
Different from other remote sensing tasks in the vision field, such as object detection [1] or semantic segmentation [2], the remote sensing image caption task [3] involves generating a sentence that describes the image accurately and comprehensively. In addition, remote sensing images increase the difficulty of image captioning due to their wide coverage and long distance from shooting sites.
Many models are based on an end-to-end encoder-decoder framework [4,5] where a convolutional neural network (CNN) [6] extracts the image features and a Recurrent Neural Network (RNN) generates a caption with the features. However, in some cases, the RNN will have the problem of long-term dependence on information and the bad memory effect of early information. For the vanishing gradient problem of the RNN, the earliest solution method [7] is to replace the RNN with Long Short-Term Memory (LSTM). The LSTM network can add and remove information from the cell through the gated unit. However, it is hard for LSTM to store specific facts accurately in the image caption task. Because of the forget gate layer, the LSTM will overwrite information that is irrelevant to the current time. If the model needs the overwritten information to predict the next word, the model cannot obtain it. In this case, the information unrelated to the current time can help to predict the next word. Therefore, it is necessary to pick up and utilize this useful memory information when necessary to produce a comprehensive and accurate sentence.
In order to overcome this limitation, adding an external storage structure to the LSTM network is a new perspective. Graves et al. [8] firstly propose a Neural Turing Machine and apply it to a copy task and associative task. They extend the capabilities of neural networks by coupling them to external memory resources, which they can interact with attentional processes. Preliminary results demonstrate that Neural Turing Machines can infer simple algorithms such as copying, sorting, and associative recall from input and output examples. Another work [9] implements differentiable function. It takes this idea and applies it to a complex text reasoning task. Chunseong et al. [10] use the memory as a context repository of prior knowledge for personalized image captioning. Taking inspiration from [8] and [11], we can generate more comprehensive sentences by giving the network more information with external storage in a remote sensing image caption task.
In this paper, we mainly propose a new model component named the Persistent Memory Mechanism (PMM) as shown in Figure 1. The baseline model architecture is illustrated on the left of Figure 1. The improved model framework we propose is shown on the right of Figure 1. Our new method is based on the encoder-decoder framework. The entire module component can be trained end-to-end. The advantage of the PMM is that the model can extract useful information related to the predictions at the current time from the external memory of stored information. In addition, the external memory is a memory matrix that uses a combined addressing system of content and location-based addressing. It can store all of the hidden layer vectors of LSTM before the current time step. In conclusion, our main contributions of this paper are as follows:  Figure 1. On the left is the framework of the baseline model [11] and on the right is the framework with the Persistent Memory Mechanism (PMM). LSTM stands for Long Short-Term Memory.

Development of Natural Image Captioning
Many different models have been proposed for image captioning with the rapid development of computer vision and natural language processing. They can be divided into two categories in general: template-based methods [12][13][14][15][16] and neural-based methods [17][18][19].
The template-based method is the earliest method for image caption. This method [12][13][14] needs to generate different caption templates for images. Then it fills in the blanks with the outputs of objection, attribute classification, and scene recognition. However, these methods use fixed templates to generate captions. These monotonous captions cannot meet people's needs. Some works [15,16] have improved the templates with techniques such as adopting more powerful language templates, and so on. However, these methods have limitations. The types of sentence templates are limited, and the lengths of sentence templates are not variable. What is more, template-based methods cannot be trained end-to-end.
The neural-based method adopts encoder-decoder frameworks. which are widely used in image captioning tasks. This framework [17][18][19] is introduced from machine translation. Generally, the encoder uses a CNN such as VGG [6] or ResNet [20] to extract image features. The decoder uses a network such as RNN and LSTM to generate captions. The earliest image caption model [21,22] uses a feed-forward neural network in the decoder. Then, some other methods [23,24] firstly use a recurrent neural network (RNN) instead of the feed-forward neural network. However, this brings the problem of gradient dissipation. To avoid this problem, Vinyals et al. [7] try to use an LSTM instead of the RNN. They propose a neural image captioning (NIC) model to generate sentences for describing natural images. This great work can solve this problem by learning to control the input and hidden state. Recently, more works [25,26] have introduced attention mechanisms. These methods can adaptively choose image features or word features to help LSTM predict the next word. At each time step, these attention models will focus on different regions in images and give different weights to image features. Attention mechanisms are widely used in encoders to extract image features or in decoders to generate sentences.

Development of Remote Sensing Image Captioning
Remote sensing image captioning has been gradually studied by people with the development of natural image captioning. However, there are few studies on remote sensing image captioning due to the lack of sufficient datasets and the special characteristics of remote-sensing images.
Qu et al. [27] propose a multimodal neural network model with an encoder-decoder framework for semantic understanding of high resolution remote-sensing images. This method adopts a CNN to extract image features, and the image features are then fused with the hidden state to predict the word step by step. Finally, the model connects all of the predicted words to obtain the final sentence. In addition, this is the earliest work in which the neural network model is applied to a remote sensing image caption task. [28] find that the image-caption methods for natural images can be transferred to remote-sensing images. They adopt an encoder-decoder model based on the attention mechanism for remote sensing image captioning. What is more, they propose a new, large remote-sensing dataset named RSICD to fully advance the task of remote-sensing captioning. More people have begun to study other methods that can be applied to remote sensing image captioning. For example, Zhang et al. [29] propose a new label-attention mechanism in the encoder part. Then the image features can be addended by attention masks to improve the salience of key regions in images. Finally, sentences are generated by decoders with the help of context vectors. These works all improve the image feature extraction part, but they ignore the importance of the decoder, which is used for sentence generation. However, as far as we know, no one has improved the remote sensing image-captioning task in the decoder section by adding an external memory database.

Method
The overall framework still adopts the mainstream encoder-decoder architecture as shown in Figure 2. The RNN firstly completes some processing based on the global image features and the word predicted at the previous moment. Then the processed result is taken as the input vector of the PMM as the serial number 1 in Figure 2 shows. The PMM will search the relevant storage information from its own storage memory database based on this input information, as the serial number 2 in Figure 2 shows. The searching result and the input information at the current time are sent to a softmax layer as the output vector of the PMM, as serial numbers 3 and 4 in Figure 2 show. At the same time, the PMM will update its own storage memory with the input information, as serial number 5 in Figure 2 shows. In addition, the storage memory of the PMM is a vector matrix that is updated at each time step, as the orange rectangle in Figure 2 shows. Finally, the RNN will generate the next word according to the local image features and the output of the PMM. Model applying the PMM can be trained end-to-end. More details will be elaborated on in the following sections.
Ours: many buildings and green trees are in a school with a playground and a tennis court. Ground Truth: many green trees and buildings are in a school.   Figure 2. Diagram of the proposed framework. The data processed by the recurrent neural network (RNN) is fed into the PMM. Then the PMM outputs the corresponding information found in its external memory in a numerical order. Meanwhile, the memory database is updated to facilitate the search at the next time step.
In Section 3.1, we firstly describe the Neural Image Caption model proposed in [11], which uses an encoder-decoder architecture with region-based attention. Then we apply the Persistent Memory Mechanism (PMM) to it in Section 3.2.

Encoder-Decoder Model for Image Captioning
Given an image I and the ground truth sequence y = {y 1 , · · · , y t }, the purpose of the encoder-decoder model is to maximize the following objectives: where θ are model parameters. Using the chain rule, we can expand the log likelihood of the joint probability distribution. According to the attended image features i 1:T and each sequence y 1:T , we can obtain the following formula: In this framework, each conditional probability with the recurrent neural network is modeled as: where f is a nonlinear function that outputs the probability of y t , and v t are the visual context vectors at time t extracted from image I. h t are the output vectors of the RNN at time t. In this paper, we adopt the captioning model, which is composed of two LSTM layers using a standard implementation. Here, we have neglected the propagation of memory cells for notational convenience. h t is modeled as the following notation at every single time step: where x t are the input vectors of LSTM. In our model, the attention LSTM understands the general content of the image. The language LSTM uses the image features and information from the PMM to generate the final description. We use a convolutional neural network to extract global image features and attended image features, respectively, in our experiments first. Then these features are fed into the decoder for training. In order to persistently store the memory information of the previous moment to guide the LSTM to predict some words, we introduce the persistent memory mechanism.

Model with Persistent Memory Mechanism
After we extract the local image In order to make full use of context information, the input vector for the first LSTM consists of three parts. It contains the output of the second LSTM at the previous moment, the global image features of the input image, and the word encoding generated at the previous moment. It can be expressed as follows: In order to generate a detailed and accurate description of the current moment, the input vector of the second LSTM also includes three parts. It contains the output of the first LSTM, the attention features of the input image, and the extracted vector that best matches the stored memory information at the current moment. It can be expressed as follows: 3.

Generation of δ v
Given the hidden state of the first LSTM, we can generate the attention distribution ε t for each of the attended image features v t at each time step t as follows: where W v , W h ∈ R k×d and w d ∈ R k are learnable model parameters. The attended image feature used as input to the language LSTM is finally represented as a convex combination of all input features:

Generation of M t
In this section, we will describe the PMM module in detail. The component block diagram is shown in Figure 3. The part in the blue box is a separate component. The controller interacts with the external information via an input and output vector, which can be a recurrent controller or a feed-forward controller. Here, we choose LSTM as the controller. LSTM has a better ability to deal with the long-term dependencies in the sequence, thus learning better about how to interact with external memory.
Firstly, the external input vector, namely the output h 1 t of the attention LSTM, passes through another LSTM that serves as the control information interaction. Then the information related to the current moment content is obtained from the external memory under the direction of the controller. The searching result and the input information at the current time are sent into a softmax layer as the output vector of the PMM. Finally, the PMM will update its memory through the input information of the current moment. The process of searching the memory database can be understood as using a soft-attention mechanism to obtain the stored memory information according to the learned searching parameters. The process of updating the memory database can be understood as adding or deleting some information from the memory database according to the information at the current moment. We will look at this component in more detail next.
The external input vector h 1 t and the previous searching state S t−1 of the component are taken as the input vector x 3 t of the current controller:  Taking inspiration from [8], we use the combined addressing system of content and location-based addressing to get the important searching parameter ω t . Then we use this parameter to search the best information from the memory database and update the memory database. At this moment, the output vector h 3 t of the controller contains the relationship between the current moment information and the memory database. Therefore, the process of getting the parameter ω t is very important. Now we will introduce how to obtain the parameter ω t . We take part of h 3 t (denoted as K t ) as the reference vector of the current moment. Then we use it to search the memory information that matches K t in the memory database (denoted as D t ) to obtain the weight θ t as follows: where µ is a model parameter that can amplify or attenuate the precision of ϕ. While ϕ is a function of similarity measured by cosine similarity. The function ϕ scores how well the input vector h 3 t matches the memory database D t . We can retrieve roughly the desired stored information at this point in the process. It can help the PMM to know the vector to be stored later. Then, to make full use of the relationship between the current time information and the previous time information, we introduce a parameter g t to guide the next update of parameter θ t . The advantage of this formula is that the new weight parameter θ t can be generated according to the degree of correlation between the information at the current time and the parameters ω t−1 (the important parameter at the previous time step),θ t . Then, we obtain the final weight parameter ω t : With this parameter ω t , we can obtain the memory vector S t that matches from the memory database. Then we take this memory vector S t together with the input vector h 1 t as the output M t of the memory component: At the same time, with this parameter ω t and the input vector h 1 t at the current time, we can update the data D t in the memory database as follows: where D del and D add are two different parts of the input vector h 1 t . In addition, all of the above operations are differentiable, so our method can be trained end-to-end.

Experiments
Based on the baseline model [11], our experiment is implemented by adding the PMM component and adopting a specific feature extraction method for remote-sensing images. In this section, we will briefly introduce the datasets and evaluation metrics. Then we will show our experimental details. Finally, we will give the experimental comparison results and analysis.

Dataset
In our experiment, we used three public remote-sensing datasets, namely the UCM-Captions dataset [30], the Sydney-Captions dataset [31], and the RSICD dataset [28]. Each dataset is manually tagged with five descriptive sentences.

Evaluation Metrics
In this paper, we report all of the results of our experiments using Microsoft COCO caption evaluation tools, including BLEU-n [32], Meteor [33], Rouge-L [34], CIDEr [35], and SPICE [36]. All metrics are computed with the publicly released code (https://github.com/ruotianluo/cococaption/tree/ea20010419a955fed9882f9dcc53f2dc1ac65092/pycocoevalcap). BLEU and Meteor are commonly used in machine translation of short sentences. In this paper, we take the value of n to be 4, as usual. Rouge-L is a measure based on the accuracy of co-occurrence and recall of the longest common clause. CIDEr and SPICE are important indicators of image description. CIDEr is used to measure the consistency between the description generated by the model and the truth value. SPICE is used to calculate the F-score of matching tuples between the predicted and reference scene graphs generated by captions, and this new metric is found to better correlate with human judgments.

Encoder
In the encoder, we directly adopt ResNet-101, which is pre-trained on the ImageNet dataset [37] for image feature extraction. In addition, in order to get as close to the original image information as possible, we directly use the features after ResNet-101 as the input image features. As a result, we use the feature extracted from the last convolutional layer as the attended feature v t = {v 1 , · · · , v k } , v i ∈ R 2048 . For the UCM-Captions, Sydney-Captions, and RSICD remote-sensing datasets, the dimension of the attended feature are 2048 × 8 × 8, 2048 × 16 × 16, and 2048 × 7 × 7, respectively. Then we reshape them to 64 × 2048 , 256 × 2048, and 49 × 2048 (k is 64 or 256 or 49 in the formula above), respectively.
For text descriptions, we remove words that occur less than five times in the text vocabulary of each dataset. The three text terms are then mapped to the dimension of 512, respectively, as the text input of the decoder.

Decoder
In the decoder, we set the size of the RNN to 512 to represent the number of hidden nodes per layer, and we use an RNN with a single layer of LSTM units. During the model training, both the vocabulary and the image feature fed in the decoder have an encoding size of 512. In addition, we use the Adam stochastic gradient descent algorithm with alpha 0.9 and beta 0.999. The initial learning rate is 5e-4 for the decoder model.

Results of Experiments
We divide the data set into three parts: 80% training set, 10% verification set, and 10% testing set. In our experiments, we processed the remote-sensing images with ResNet-101. Then we get the features after the final convolutional layer, and then use an adaptive pooling method to extract attended features as in [11] as our baseline model (hereinafter referred to as UpDown). We also directly use the features after the ResNet-101 as the attended local features (hereinafter referred to as DF) for comparison. As for our proposed model (Persistent Memory Mechanism), it is listed separately as a model component (hereinafter referred to as PMM). The symbol "+" represents the model used.
To make a fair comparison, we adopt the results for models trained with standard cross-entropy loss. In addition, we also use several models [26,38,39] (hereinafter referred to as SAT, Att2in, SM-Att) for comparison. In Tables 1-3, we present our experimental results on the above three remote-sensing datasets. The bold numbers stand for the best scores.

Quantitative Analysis
The experimental results from different methods are shown in Tables 1-3. Obviously, the scores of CIDEr and SPICE are all improved. The reason is that our method can store the information of the previous moment more persistently. Other methods will gradually overwrite the previous information over time. Although most of the overwritten information is useless, it is not available when we need it sometimes. In this situation, our method can describe the picture more accurately. Therefore, the scores of other evaluation metrics are also improved slightly. What is more, either the "UpDown+PMM" model or the "UpDown+DF+PMM" model can get a higher score than baseline model "UpDown", even though the baseline model itself has a much higher score than most of the existing methods in each evaluation metric. This strongly shows that the PMM indeed generates more comprehensive sentences to improve the descriptive effect, and it has successfully solved the problem about incomplete information memory in the decoder. The reason why the "UpDown+DF" model can also obtain a higher score than the baseline model is that we send them into the model for training in two parts while minimizing the loss of feature extraction, so as to make full use of the image information when the CNN extracts features. There is an interesting phenomenon in Table 1: the model "UpDown+PMM" is slightly inferior to the baseline model "UpDown". The reason is that the average length of all of the sentences in the UCM-Captions dataset is short, and the number of images in each category is small. Our PMM can better improve the description effect of the dataset with long sentences or a large number of pictures in each category.
The reasons for the difference in scores for the three datasets are the size and classes of the datasets. The larger the datasets are, the more diverse the sentences are, and the more difficult it is to generate good descriptive sentences, meaning the lower the scores are. Therefore, it's not difficult to understand that the dataset RSICD has the lowest scores. However, the UCM-Captions dataset has more pictures than the Sydney-Captions dataset, and the average length of the average sentence length of the Sydney-Captions dataset is longer than that of the UCM-Captions dataset, so the scores of the Sydney-Captions dataset are lower than those of the UCM-Captions dataset. In particular, the scores on CIDEr and SPICE evaluation metrics have been well improved. Figure 4 is part of the captions generated by Updown, our model, and the ground-truth on the three remote-sensing datasets. From the comparison of the captions at the bottom of each set of pictures, we can see that the captions generated by our model can obtain more comprehensive captions than the baseline model. Even some objects that are not in the ground-truth captions but appear in the image (such as the red parts) can be described. This shows that our model can translate image features into text more effectively and present more details. This also suggests that our model can capture useful memories that are accidentally forgotten. In addition, sometimes our captions are the same as the ground-truth captions, but the expression is different. This is because the same types of pictures may have different expressions of artificial description. However, although the results described by our model are comprehensive, there are still some problems (such as the green part of the first picture on the third row). Object "basketball" gives an error number, which indicates that our model still needs to work hard on feature extraction of small objects in remote-sensing images. This is also a problem to be solved for remote-sensing images that are shot from outer space.

Qualitative Analysis
In summary, our model can produce a caption that is closer to the ground-truth captions. Sometimes our captions are more comprehensive than the ground-truth captions. Moreover, the PMM component can also be applied to the existing model to achieve a better promotion effect.
Baseline: Some green trees are around a stadium. Ours: Many green trees are around a stadium with a football field in it. GT: Some green trees are around a large stadium.
Baseline: Two storage tanks are near several buildings and green trees. Ours: Two storage tanks are near several buildings and a piece of bare land. GT: Two storage tanks are surrounded by bare land.
Baseline: A meadows with some green bushes and white bunkers on the meadow. Ours: A meadow with some green bushes and white bunkers on it while a highway passed by. GT: There are many green plants surrounded the water while a highway bridge across the waters.
Baseline: An industrial area with many white buildings and some roads go through this area. Ours: An industrial area with many white buildings and some roads go through this area. GT: An industrial area with some white buildings and some roads go through this area.
Baseline: A residential area with houses arranged neatly and some roads go through this area. Ours: A residential area with many houses arranged neatly and some roads go through this area. GT: A residential area with many houses arranged neatly and a crossroad in the middle.
Baseline: Two straight freeways closed together with some plants beside them. Ours: Two straight freeways with some plants beside them and some cars on the roads. GT: There are two curved freeways closed together with some cars on them.
Baseline: Lots of boats docked neatly at the harbor. Ours: Many boats docked in lines at the harbor and the water is deep blue. GT: Many boats docked neatly at the harbor and the water is deep blue.
Baseline: An overpass with a road go across another roads diagonally. Ours: An overpass with a road go across another roads diagonally with some cars on the roads. GT: An overpass go across the roads with some cars on the roads.

Baseline:
A playground with a playground is surrounded by many green trees and many buildings.

Parameter of Memory Database Analysis
In this section, we analyze some parameters used in the PMM. It is mainly the size setting of the memory in model PMM. The matrix inside the memory stores the previously memorized information. The size of matrix affects the degree of information loss when the current information is updated to the database. We set the size of D t to 20 × 512 and 20 × 256 and 20 × 128 respectively for comparison. 20 is the length of each sentence. 512, 256 and 128 represent the dimensions of the database. Because RSICD datasets has more and richer contents, we choose to carry out this part of the experiment on this dataset. This section of the experiment is conducted based on model "UpDown + DF + PMM", and the experimental results are shown in the following Table 4. The bold numbers stand for the best scores:  Table 4 shows that the PMM works better when the dimension is 512. Since the output dimension of the controller is 512, the transformation loses the least information when the dimension of the database is also set to 512, so the effect is the best.

Conclusions
In this paper, we propose a novel component named the Persistent Memory Mechanism and combine it with an advanced model for a remote sensing image-captioning task. This new model can retain and output the memory information for a longer time without being affected by the vanishing gradient problem. In addition, we use a simple but effective feature extraction method for remote-sensing images in the encoder. We generate a more comprehensive and accurate sentence. More importantly, the baseline model using our method can obtain higher scores on all evaluation metrics. However, the search for information by our model will slightly increase the learning time. We will use a multiple attention mechanism to shorten the search time for information in the next step. In conclusion, the PMM can provide a new way to improve the semantic captioning effect of remote-sensing images.