A Discriminative Feature Learning Approach for Remote Sensing Image Retrieval

: Effective feature representations play a decisive role in content-based remote sensing image retrieval (CBRSIR). Recently, learning-based features have been widely used in CBRSIR and they show powerful ability of feature representations. In addition, a signiﬁcant effort has been made to improve learning-based features from the perspective of the network structure. However, these learning-based features are not sufﬁciently discriminative for CBRSIR. In this paper, we propose two effective schemes for generating discriminative features for CBRSIR. In the ﬁrst scheme, the attention mechanism and a new attention module are introduced to the Convolutional Neural Networks (CNNs) structure, causing more attention towards salient features, and the suppression of other features. In the second scheme, a multi-task learning network structure is proposed, to force learning-based features to be more discriminative, with inter-class dispersion and intra-class compaction, through penalizing the distances between the feature representations and their corresponding class centers. Then, a new method for constructing more challenging datasets is ﬁrst used for remote sensing image retrieval, to better validate our schemes. Extensive experiments on challenging datasets are conducted to evaluate the effectiveness of our two schemes, and the comparison of the results demonstrate that our proposed schemes, especially the fusion of the two schemes, can improve the baseline methods by a signiﬁcant margin.


Introduction
As a result of the rapid development of remote sensing technology, the amount of remote sensing images with higher resolution has dramatically increased.How to effectively manage and analyze remote sensing images has been a hot issue to be solved urgently.Among them, content-based remote sensing image retrieval (CBRSIR) [1,2] is a key problem in the effective use of remote sensing big data.This includes two main components, feature extraction and similarity measure.CBRSIR automatically processes the representations of image features, and it measures the similarity between images.The performance of CBRSIR mainly depends on the representation power of feature embedding.Therefore, research on CBRSIR mainly focuses on feature extraction [3][4][5][6].
With regard to feature extraction, the existing methods can mainly be divided into methods that are based on handcrafted features, and methods based on learning-based features [7].Handcrafted features are usually used to extract global features such as color, texture, shape, and local features based on SIFT [8] and SURF [9], which belong to low-level features.The Bag of Word model (BOW) [10][11][12], and the Vector of Locally Aggregated Descriptors (VLAD) [13] are proposed to encode local features, which further improve the feature representation power, they belong to the middle-level feature.Whether a global feature or a local feature, these handcrafted features are difficult for expressing image semantics precisely.That is, there is a "Semantic Gap" between low-level features and high-level semantics.With the progress of deep learning, especially the excellent performance of Convolutional Neural Networks (CNNs) in computer vision tasks such as classification [14][15][16], detection [17][18][19], and segmentation [20,21], CNNs are widely applied to image feature automatic extraction.The reason for why CNNs can achieve better performance than handcrafted features is that CNNs can extract high-level semantic features through a large number of convolutional layer stacking with non-linearity.Ge et al. [22] transfers the pre-trained CNNs trained on ImageNet to the remote sensing image data set and compares the features extracted by pre-trained CNNs with handcrafted features for CBRSIR, which leads to the conclusion that the CNNs features outperform the traditional features by a large margin.
However, it is difficult to achieve a satisfactory retrieval result only by pre-trained or fine-tuned CNNs, by facing the large-scale and high-resolution remote sensing image datasets.We think that this is mainly caused by the following two reasons.The first is insufficient representation power of feature embedding.In Figure 1, some examples of Aerial Image Data (AID) dataset [23] that are used for remote sensing image retrieval are shown.It can be inferred from Figure 1a,b that these two types of remote sensing images are characterized by color and texture features.From Figure 1c,d, we can see that they are characterized by the local shape of the aircraft and bridge instead of land and water occupying most areas of the image.Therefore, the features used for CBRSIR should be able to take into account both the global and local silent features of the image.However, pre-trained CNNs features may be difficult for meeting this requirement, due to the big difference between the ImageNet data set and the remote sensing image data set, and the difficulty of covering the images' global and local silent features simultaneously with a convolutional layer or a fully connected layer.Some works [24][25][26] try to solve the above problems from the aspect of improving the data set, by introducing multi-labels and dense labeling remote sensing datasets for training.To a certain extent, this problem can be alleviated, while the disadvantage is also obvious that annotating on multi-labels is time-consuming and costly.The second reason is the inconsistency in the purpose between the training process and the retrieval process.Feature representation used in CBRSIR is the result of training for classification.The accuracy of classification can even reach 99% in the verification set and test set, while when it is used for image retrieval, the accuracy is far less than this.This is mainly due to the difference between classification and retrieval.In classification, softmax loss is usually used for training to encourage features to be separable, which leads the inter-class be disperse.However, in CBRSIR, the similarity is measured between the images by the Euclidean distance or the cosine distance, which requires feature representation not only to be separable but also discriminative.Discriminative features mean inter-class to disperse, and intra-class to be compact as much as possible.expressing image semantics precisely.That is, there is a "Semantic Gap" between low-level features and high-level semantics.With the progress of deep learning, especially the excellent performance of Convolutional Neural Networks (CNNs) in computer vision tasks such as classification [14][15][16], detection [17][18][19], and segmentation [20,21], CNNs are widely applied to image feature automatic extraction.The reason for why CNNs can achieve better performance than handcrafted features is that CNNs can extract high-level semantic features through a large number of convolutional layer stacking with non-linearity.Ge, et al. [22] transfers the pre-trained CNNs trained on ImageNet to the remote sensing image data set and compares the features extracted by pre-trained CNNs with handcrafted features for CBRSIR, which leads to the conclusion that the CNNs features outperform the traditional features by a large margin.However, it is difficult to achieve a satisfactory retrieval result only by pre-trained or fine-tuned CNNs, by facing the large-scale and high-resolution remote sensing image datasets.We think that this is mainly caused by the following two reasons.The first is insufficient representation power of feature embedding.In Figure 1, some examples of Aerial Image Data (AID) dataset [23] that are used for remote sensing image retrieval are shown.It can be inferred from Figure 1 (a) and (b) that these two types of remote sensing images are characterized by color and texture features.From Figure 1 (c) and (d), we can see that they are characterized by the local shape of the aircraft and bridge instead of land and water occupying most areas of the image.Therefore, the features used for CBRSIR should be able to take into account both the global and local silent features of the image.However, pretrained CNNs features may be difficult for meeting this requirement, due to the big difference between the ImageNet data set and the remote sensing image data set, and the difficulty of covering the images' global and local silent features simultaneously with a convolutional layer or a fully connected layer.Some works [24][25][26] try to solve the above problems from the aspect of improving the data set, by introducing multi-labels and dense labeling remote sensing datasets for training.To a certain extent, this problem can be alleviated, while the disadvantage is also obvious that annotating on multi-labels is time-consuming and costly.The second reason is the inconsistency in the purpose between the training process and the retrieval process.Feature representation used in CBRSIR is the result of training for classification.The accuracy of classification can even reach 99% in the verification set and test set, while when it is used for image retrieval, the accuracy is far less than this.This is mainly due to the difference between classification and retrieval.In classification, softmax loss is usually used for training to encourage features to be separable, which leads the inter-class be disperse.However, in CBRSIR, the similarity is measured between the images by the Euclidean distance or the cosine distance, which requires feature representation not only to be separable but also discriminative.Discriminative features mean inter-class to disperse, and intra-class to be compact as much as possible.In this paper, we propose two schemes as the response to the above problems.We first propose a new attention module for features extraction for CBRSIR, which can pay more attention to the silent features, and suppress the less useful ones.We then provide a center loss-based multi-task learning network structure to further boost the discriminative power of the features.The framework of our proposed method is shown in Figure 2.
applied along two dimensions: channel and spatial, attending to emphasize the meaningful features along these two axes. We propose a multi-task learning network structure, introducing center loss as a network branch in the training phase, to penalize the intra-class distances of features, and to improve the discriminative ability of the deep features. The two schemes that we proposed can be combined and integrated into the same training network to further improve performance.The rest of this paper is organized as follows.Section 2 presents some published work that is related to features extraction for CBRSIR.Our proposed two schemes to generate discriminative feature representation are discussed in Section 3. Section 4 displays the experimental results and analysis.Section 5 includes a discussion, and Section 6 draws some conclusions.

Related Work
In the following section, we will present the related work on feature extraction, attention mechanism, and a loss function.

Learning-Based Feature Representation for CBRSIR
CNNs have been dominant in feature extraction, and have gradually replaced traditional methods in the field of computer vision.The achievement of CNNs is mainly due to the fact that deep network structures bring a large number of nonlinear functions, and weight parameters can be automatically learned from the training data.However, remote sensing image datasets cannot provide a large amount of data for CNNs training from scratch.CNNs pre-trained on massive datasets have been used to extract feature embedding, which has been proven to be effective and efficient, even when the training data set has a lot of difference with the remote sensing image.There are mainly two ways to exploit pre-trained CNNs, including regarding fully-connected layers or convolutional layers as the feature representation.Many works [3,22,27,28] have compared the The main contributions of this paper can be summarized as follows: • To obtain salient and effective features, we propose a new attention module, which can be easily connected with the last convolutional layer of any pre-trained CNNs and can be applied along two dimensions: channel and spatial, attending to emphasize the meaningful features along these two axes.

•
We propose a multi-task learning network structure, introducing center loss as a network branch in the training phase, to penalize the intra-class distances of features, and to improve the discriminative ability of the deep features.

•
The two schemes that we proposed can be combined and integrated into the same training network to further improve performance.
The rest of this paper is organized as follows.Section 2 presents some published work that is related to features extraction for CBRSIR.Our proposed two schemes to generate discriminative feature representation are discussed in Section 3. Section 4 displays the experimental results and analysis.Section 5 includes a discussion, and Section 6 draws some conclusions.

Related Work
In the following section, we will present the related work on feature extraction, attention mechanism, and a loss function.

Learning-Based Feature Representation for CBRSIR
CNNs have been dominant in feature extraction, and have gradually replaced traditional methods in the field of computer vision.The achievement of CNNs is mainly due to the fact that deep network structures bring a large number of nonlinear functions, and weight parameters can be automatically learned from the training data.However, remote sensing image datasets cannot provide a large amount of data for CNNs training from scratch.CNNs pre-trained on massive datasets have been used to extract feature embedding, which has been proven to be effective and efficient, even when the training data set has a lot of difference with the remote sensing image.There are mainly two ways to exploit pre-trained CNNs, including regarding fully-connected layers or convolutional layers as the feature representation.Many works [3,22,27,28] have compared the performance of different feature representations extracted among the different networks and different layers.Ge, Jiang, Xu, Jiang and Ye [22] exploit representations from pre-trained CNNs, and feature combination and compression are adopted to improve the feature representation.The experimental results demonstrate that the pre-trained features and aggregated features are simple, and are able to improve retrieval performance.Zhou, Newsam, Li and Shao [28] propose to fine-tune the pre-trained CNNs on a remote sensing dataset, and they propose a novel CNN architecture based on a three-layer perceptron that has fewer parameters and that can learn low-dimensional features.The results show that the fine-tuned CNNs and the novel CNN are effective.Li et al. [29] proposes a novel approach based on deep hashing neural networks for large-scale RMIR.Deep feature learning networks and hashing learning networks are concluded in an end-to-end network.Zhou, Deng and Shao [26] propose a novel multi-label RSIR method using fully convolutional networks (FCN).A pixel-wise labeled dataset is used for training the FCN network.The segmentation maps of each remote sensing image are predicted and region convolutional features are extracted based on the segmentation maps.The experimental results show that the method achieves state-of-the-art performance.While these methods mainly focus on the depth and the width of network architecture, we pay more attention to "attention".

Attention Mechanism
The attention mechanism is an important part of human perception.It focuses on a specific area of the image in "high resolution", and it perceives the surrounding area of the image in "low resolution", and then it continuously adjusts its focus point.Actually, the attention mechanism is involved in learning the weight distribution of different parts, which leads to different parts corresponding to different degrees of concentration.The benefits of this property have been proven in many tasks, ranging from machine translation and text summarization in sequence-based tasks to classification and segmentation in computer vision.
References [30][31][32] apply the weight-learned to the original image, and Wang et al. [33] apply the weight-learned to feature maps.In Hu et al. [34], the weight is applied to channel scales, to weight different channel features.Closer to our work, Woo et al. [35] exploits both channel and spatial-wise attention, and each of the attention mechanisms can acquire "what" and "where" to focus.All of these works are proposed for natural image processing, and they have shown their excellent performances in classification, detection and so on.There is no attention model for processing the remote sensing images.

The Loss Function
The effect of CNNs has been continuously improved, in addition to the improvement of the network structure, and the development of the loss function.
Softmax function is the most commonly used loss function to supervise the learning process for classification.Taking one image as an input, and outputting the image's identification, this kind of model (softmax loss function) is called the identification model.Siamese networks are proposed in [36], which take a pair of images as input, and is called a verification model.This model can drive the distance to be closer for positive pairs, and further for negative pairs.After that, a model combining identification and verification is adopted in Reference [37,38], which makes the feature more discriminative.Besides, Schroff et al. [39] proposes triplet loss, and this has proven its effectiveness in many datasets.A model with triplet loss takes anchor, positive and negative three images as input, to minimize the distance between the anchor and the positive, and to maximize the distance between the anchor and the negative.

Scheme 1-The Attention Module
Our attention module can be connected with the last convolutional layer of the pre-trained CNNs.As applied along two dimensions: channel and spatial, the attention module can be divided into the channel attention module and the spatial attention module.The attention module's framework is shown in Figure 3.A typical average pooling method, global average pooling, is adopted, to squeeze the spatial dimension of the feature map to acquire the channel attention map Favg (1×1×C).Average pooling has been commonly used in some works [34,35], while Reference [35] suggest exploiting both the average pooling and the max pooling simultaneously, and this proves that their strategy is more effective than using each strategy independently.Different from References [34,35], the attention module in this paper is connected to the last convolutional layer of CNNs, not to each block.This is mainly because there is insufficient remote sensing data to enable the network to train from scratch.We apply average pooling in the attention module, which is connected to the pre-trained CNNs.The design choice of the different pooling methods and the effectiveness of our attention module is shown in Section 4.2.1.Then Favg connects with a multi-layer perceptron (MLP) with a hidden layer.The size of the hidden layer is set to C/r, where r is set to 2 in this study.After the MLP is the sigmoid function.Then, input F and the channel attention map Mc are multiplied by elementwise, to acquire the output channel-refined feature map.In conclusion, the channel attention module is computed as: where W1∈R C/r×C and W0∈R C×C/r denote the weight of MLP, σ donates the sigmoid function, ⊗ denotes element-wise multiplication, and F' is the final channel-refined feature map.
3.1.2.The Spatial Attention Module   A typical average pooling method, global average pooling, is adopted, to squeeze the spatial dimension of the feature map to acquire the channel attention map Favg (1×1×C).Average pooling has been commonly used in some works [34,35], while Reference [35] suggest exploiting both the average pooling and the max pooling simultaneously, and this proves that their strategy is more effective than using each strategy independently.Different from References [34,35], the attention module in this paper is connected to the last convolutional layer of CNNs, not to each block.This is mainly because there is insufficient remote sensing data to enable the network to train from scratch.We apply average pooling in the attention module, which is connected to the pre-trained CNNs.The design choice of the different pooling methods and the effectiveness of our attention module is shown in Section 4.2.1.Then Favg connects with a multi-layer perceptron (MLP) with a hidden layer.The size of the hidden layer is set to C/r, where r is set to 2 in this study.After the MLP is the sigmoid function.Then, input F and the channel attention map Mc are multiplied by elementwise, to acquire the output channel-refined feature map.In conclusion, the channel attention module is computed as: where W1∈R C/r×C and W0∈R C×C/r denote the weight of MLP, σ donates the sigmoid function, ⊗ denotes element-wise multiplication, and F' is the final channel-refined feature map.A typical average pooling method, global average pooling, is adopted, to squeeze the spatial dimension of the feature map to acquire the channel attention map F avg (1×1×C).Average pooling has been commonly used in some works [34,35], while Reference [35] suggest exploiting both the average pooling and the max pooling simultaneously, and this proves that their strategy is more effective than using each strategy independently.Different from References [34,35], the attention module in this paper is connected to the last convolutional layer of CNNs, not to each block.This is mainly because there is insufficient remote sensing data to enable the network to train from scratch.We apply average pooling in the attention module, which is connected to the pre-trained CNNs.The design choice of the different pooling methods and the effectiveness of our attention module is shown in Section 4.2.1.Then F avg connects with a multi-layer perceptron (MLP) with a hidden layer.The size of the hidden layer is set to C/r, where r is set to 2 in this study.After the MLP is the sigmoid function.Then, input F and the channel attention map M c are multiplied by elementwise, to acquire the output channel-refined feature map.In conclusion, the channel attention module is computed as:

The Spatial Attention Module
where W 1 ∈R C/r×C and W 0 ∈R C×C/r denote the weight of MLP, σ donates the sigmoid function, ⊗ denotes element-wise multiplication, and F is the final channel-refined feature map.

The Spatial Attention Module
The channel attention module is proposed for the difference between different channels, while the spatial attention module is aimed at a different spatial location.Human beings can easily capture informative features in an image by comparing the difference between the silent targets and the background.The differences between the silent targets and the background in the images reflect the differing importance of the different spatial locations in the feature maps.
As the complementary to channel attention module, the spatial attention module takes the output of the channel attention module as the input to yield the spatial attention map M s , which focuses on weighing the importance of each spatial location in the feature map.Similar to the channel attention module, the average pooling is implemented first to acquire the map F s (H × W × 1), but the difference is that the average pooling is applied along the channel axis.The choice of the different pooling ways is also verified in Section 4.2.1.The map F s is regarded as an initial value of every pixel in the feature map.Then, a convolutional layer and a sigmoid function were used to generate the final weights of every location in the feature map.On the filter size used in the convolutional layer, the size adopted in Reference [35] is 7 × 7, but we have verified in Section 4.2.1 that a filter of size 3 × 3 can lead to a better performance than filters with other sizes in our attention module.At last, element-wise multiplication is applied to acquire the final refined output, F".The detailed operation is computed as: where σ donates the sigmoid function, and f 3×3 denotes the convolutional layer with a filter size of 3×3.The specific progress of spatial attention module can be seen in Figure 5.
Remote Sens. 2018, 10, x FOR PEER REVIEW 6 of 20 The channel attention module is proposed for the difference between different channels, while the spatial attention module is aimed at a different spatial location.Human beings can easily capture informative features in an image by comparing the difference between the silent targets and the background.The differences between the silent targets and the background in the images reflect the differing importance of the different spatial locations in the feature maps.
As the complementary to channel attention module, the spatial attention module takes the output of the channel attention module as the input to yield the spatial attention map Ms, which focuses on weighing the importance of each spatial location in the feature map.Similar to the channel attention module, the average pooling is implemented first to acquire the map Fs (H × W × 1), but the difference is that the average pooling is applied along the channel axis.The choice of the different pooling ways is also verified in Section 4.2.1.The map Fs is regarded as an initial value of every pixel in the feature map.Then, a convolutional layer and a sigmoid function were used to generate the final weights of every location in the feature map.On the filter size used in the convolutional layer, the size adopted in Reference [35] is 7 × 7, but we have verified in Section 4.2.1 that a filter of size 3 × 3 can lead to a better performance than filters with other sizes in our attention module.At last, element-wise multiplication is applied to acquire the final refined output, F''.The detailed operation is computed as： where σ donates the sigmoid function, and f 3×3 denotes the convolutional layer with a filter size of 3×3.The specific progress of spatial attention module can be seen in Figure 5.

Scheme 2 -The Center Loss-Based Multi-Task Learning Network
In most of the CNNs, the softmax function is usually used as the loss function to supervise the training of the model.The softmax loss function is shown in Equation ( 3), and it can efficiently supervise the network trained for the classification task.The center loss function was proposed in Reference [40] to minimize the intra-class variations for face recognition task, as formulated in Equation ( 4).
where Cyi donates the class center of features, n is the number of classes, m donates the size of the mini-batch.During training, the update of the Cyi should consider all of the training sets in each iteration that is not impractical.In [40], the class center Cyi is updated, based on the mini-batch instead of all training sets.A scalar λ was used to control the learning rate of Cyi.The formulation of the improved loss function is given as Equation ( 5).The Stochastic Gradient Descent (SGD) can be used to optimize the parameters in the loss function, but the process of the back propagation is complicated, the specific algorithm can be found in Reference [40].

L =Ls +λ Lc
(5)  ∑ n j=1 e w j x i +b j (3) where C yi donates the class center of features, n is the number of classes, m donates the size of the mini-batch.During training, the update of the C yi should consider all of the training sets in each iteration that is not impractical.In [40], the class center C yi is updated, based on the mini-batch instead of all training sets.A scalar λ was used to control the learning rate of C yi .The formulation of the improved loss function is given as Equation ( 5).The Stochastic Gradient Descent (SGD) can be used to optimize the parameters in the loss function, but the process of the back propagation is complicated, the specific algorithm can be found in Reference [40].
In this paper, a new network structure is proposed to leverage the center loss.The update of center C yi is not decided by the average of the features of each class, which is complicated and time-consuming.We treat class center C yi as the parameters to be learned.The Class centers are initialized to a matrix K of (n, k) where n is the number of classes and k donates the dimensions of the feature representation.The input of this branch is the label of the training image, which is the same as the output of the softmax branch.Through the input label, we can obtain the corresponding class center in the class centers matrix K. Thus, the center loss is calculated by minimizing the mean square error of the input image's feature vector and center loss as formulated in Equation ( 6).From Reference [40], we can find that a proper value of λ can improve the performance of the features.Also, the hyper-parameter λ in Equation ( 5), which decides the intra-class variations, is set to 0.0001 in our experiment, which is verified in Section 4.2.2.As shown in Figure 6, the center loss is merged as a branch of the network, the class centers are learned and the distances between the features and their corresponding centers are minimized simultaneously.Our scheme is simpler, and the experiment is verified in Section 4.2.2.In this paper, a new network structure is proposed to leverage the center loss.The update of center Cyi is not decided by the average of the features of each class, which is complicated and timeconsuming.We treat class center Cyi as the parameters to be learned.The Class centers are initialized to a matrix K of (n, k) where n is the number of classes and k donates the dimensions of the feature representation.The input of this branch is the label of the training image, which is the same as the output of the softmax branch.Through the input label, we can obtain the corresponding class center in the class centers matrix K. Thus, the center loss is calculated by minimizing the mean square error of the input image's feature vector and center loss as formulated in Equation ( 6).From Reference [40], we can find that a proper value of λ can improve the performance of the features.Also, the hyperparameter λ in Equation ( 5), which decides the intra-class variations, is set to 0.0001 in our experiment, which is verified in Section 4.2.2.As shown in Figure 6, the center loss is merged as a branch of the network, the class centers are learned and the distances between the features and their corresponding centers are minimized simultaneously.Our scheme is simpler, and the experiment is verified in Section 4.2.2.

Experimental Setup
The datasets used in our experiments are mainly AID [23] and ParttenNet [41], both of which contain a large number of remote sensing images.
The Aerial Image Data (AID) is composed of 30 categories of typical aerial scene images with a size of 600 × 600 pixels collected from Google Earth.The numbers of the images vary a lot with different classes, from 220 up to 420, for a total of 10,000 images.Examples from every category are shown in Figure 7(a).
PatternNet comprises 38 categories of high-resolution remote sensing images with a size of 256 × 256 pixels selected from Google Earth.Each category contains 800 images, for a total of 30,400 images.Examples from every category are shown in Figure 7(b).
Currently, datasets used for CBRSIR are mostly datasets for scene classification.However, there is an obvious difference between the classification and retrieval, and category labels are not available and are only used for accuracy evaluation in retrieval.However, some work conducts experiments on a data set, by dividing the data set into a training set and test set, and the category information has been utilized in the training process, which is contrary to the precondition of image retrieval.Different from this, in this paper, a challenging data set that is more in line with the preconditions of image retrieval is constructed to better verify the effectiveness of our method.The data set that is used for retrieval contains three subsets of the training set, probe set and test set, where the training

Experimental Setup
The datasets used in our experiments are mainly AID [23] and ParttenNet [41], both of which contain a large number of remote sensing images.
The Aerial Image Data (AID) is composed of 30 categories of typical aerial scene images with a size of 600 × 600 pixels collected from Google Earth.The numbers of the images vary a lot with different classes, from 220 up to 420, for a total of 10,000 images.Examples from every category are shown in Figure 7a.
PatternNet comprises 38 categories of high-resolution remote sensing images with a size of 256 × 256 pixels selected from Google Earth.Each category contains 800 images, for a total of 30,400 images.Examples from every category are shown in Figure 7b.
In the experiment, the Euclidean distance is used for the similarity measure.VGG16, VGG19 and ResNet50 are chosen as the baseline networks.The average normalized modified retrieval rank (ANMRR), the precision at k (P@k, k is the number of returned images), and the mean average precision (mAP) are used to assess the performance of CBRSIR.ANMRR and mAP can comprehensively evaluate the retrieval performance for it considering the order of all ground truths appearing in the retrieved images.Besides, the class-level precision is adopted as another evaluation criteria.The precision of the i-th class can be expressed by n/ε, where n is the number of correct retrieval images of class i in the top ε retrieved images and ε is set to 20.Although the class-level precision cannot measure the performance of retrieval comprehensively like mAP and ANMRR for the reason that it is only the precision of the top 20 retrieved images, it can depict the precision of each class, and that it can reflect Currently, datasets used for CBRSIR are mostly datasets for scene classification.However, there is an obvious difference between the classification and retrieval, and category labels are not available and are only used for accuracy evaluation in retrieval.However, some work conducts experiments on a data set, by dividing the data set into a training set and test set, and the category information has been utilized in the training process, which is contrary to the precondition of image retrieval.Different from this, in this paper, a challenging data set that is more in line with the preconditions of image retrieval is constructed to better verify the effectiveness of our method.The data set that is used for retrieval contains three subsets of the training set, probe set and test set, where the training set is different from the probe set and the test set.The label information in the training set is applied to fine-tune the pre-trained CNNs, but the labels are not available in the probe set and test set during the process of retrieval.Specifically, the data set AID is chosen as the training set to fine-tune the network pre-trained on ImageNet for its relatively huge amount of data and diverse data categories.The probe and test sets are selected from PatternNet.Twenty images are picked from each of the 38 categories in the PatternNet, a total of 760 images, forming the probe set.There are a total of 8162 images in the test set, of which 7600 are from PatternNet as ground truth, 200 in each category, and the remaining 562 images acting as interference are collected from other remote sensing image datasets and they are not related to the image of the PatternNet.These remote sensing image datasets includes RSSCN7 [42], UC Merced_landUse [43] and WHU-RS19 [44].
In the experiment, the Euclidean distance is used for the similarity measure.VGG16, VGG19 and ResNet50 are chosen as the baseline networks.The average normalized modified retrieval rank (ANMRR), the precision at k (P@k, k is the number of returned images), and the mean average precision (mAP) are used to assess the performance of CBRSIR.ANMRR and mAP can comprehensively evaluate the retrieval performance for it considering the order of all ground truths appearing in the retrieved images.
Besides, the class-level precision is adopted as another evaluation criteria.The precision of the i-th class can be expressed by n/ ε, where n is the number of correct retrieval images of class i in the top ε retrieved images and ε is set to 20.Although the class-level precision cannot measure the performance of retrieval comprehensively like mAP and ANMRR for the reason that it is only the precision of the top 20 retrieved images, it can depict the precision of each class, and that it can reflect the differences among the different classes.It is worth noting that the lower values of ANMRR reflect better performance, while the larger the better, for mAP, P@k and class-level precision.

Design Choice and Effects of Scheme 1
In this section, the design process and the effectiveness of our attention module are shown.The design process of the module mainly consists of three parts.We first compare three ways of pooling strategies: max pooling, average pooling and the joint use of both two ways as in Reference [35], which are adopted in a channel attention module.The experimental results, with different pooling strategies, are shown in Table 1.On the one hand, we can observe that CNNs features, combined with the attention module, outperform the baselines, especially in VGG16 and VGG19.While the improvement in ResNet50 is not obvious, the main reason for this is that the performance of ResNet50 is relatively good, and that room for improvement is not as big as VGG.We observe that the attention module with any pooling methods is beneficial for improving the performance, compared with the baselines.On the other hand, the results imply the advantages of average pooling over the other two methods.The choice of average pooling can achieve better performance in both mAP and ANMRR, which improves the mAP from 0.5641 in VGG16, 0.5518 in VGG19 and 0.7080 in ResNet50 to 0.6 in VGG16, 0.5858 in VGG19, and 0.7187 in ResNet50.There are improvements of almost 4%, 2.5%, and 0.7% improvement for VGG16, VGG19, and ResNet50 in the value of ANMRR.[34] 0.4911 0.4296 Vgg16_CBAM [35] 0.4910 0.4316  Description ANMRR mAP Vgg16(baseline) 0.3691 0.5641 Vgg16_AM 0.3283 0.6097 Vgg16_SE [34] 0.4911 0.4296 Vgg16_CBAM [35] 0.4910 0.4316

The Results of Scheme 2
In this part, we experimentally verify the effect of our scheme 2, by evaluating the performance of different CNNs trained under the supervision of different loss functions and making a comparison between our method and the baselines.A model combining identification loss and verification loss [37,38], and a model proposing triplet loss [39] are two popular and effective loss functions in Remote Sens. 2019, 11, 281 12 of 19 natural image processing.The comparison between our method and the two methods is shown in Table 4.We observe that the models in References [37][38][39] are not as effective as dealing with natural images problems, such as face recognition and person re-identification.We believe that this is mainly because the remote sensing dataset is not as complex as the pedestrian dataset and face dataset.This means that more complex models [37][38][39] do not achieve better results based on relatively simple remote sensing datasets.Our method, a simpler loss function, is meaningful in improving the performance compared to References [37][38][39].As depicted in Figure 10, we visualize the deep features of five classes obtained through the two training modes, to compare their differences intuitively, and Principal Component Analysis (PCA) is adopted to compress the 512-dimensional features obtained by VGG16 to two dimensions.Figure 10a,b exhibit the distribution of the softmax loss without center loss, and softmax loss with center loss.Figure 10c is the distribution of References [37,38], and (d) is Reference [39].We can observe that the distribution of the same class in Figure 10b is more compact and that the features are relatively separable compared with Figure 10a,c,d.As a brief conclusion, the features trained under the softmax loss combined with the center loss, are more discriminative for its more dispersed inter-class and more compact intra-class.Figure 11 shows the mAP of different models with different hyper-parameters λ.It is clear that the softmax loss (λ is 0) is not the best choice.The best performance is acquired when λ is set to 0.0001, and the performance decreases sharply as λ continues to increase.Figure 12 demonstrates the results of class-level precision, from which we can observe that ResNet50 outperforms the other two baselines, and the center loss can help CNNs to achieve better performance for the majority of the Figure 11 shows the mAP of different models with different hyper-parameters λ.It is clear that the softmax loss (λ is 0) is not the best choice.The best performance is acquired when λ is set to 0.0001, and the performance decreases sharply as λ continues to increase.Figure 12 demonstrates the results of class-level precision, from which we can observe that ResNet50 outperforms the other two baselines, and the center loss can help CNNs to achieve better performance for the majority of the classes.The performance of our method and the original CNNs models is summarized in Table 5.There are improvements of almost 2%, 1%, and 0.7% for the baselines VGG16, VGG19, and ResNet50, respectively, in the values of mAP and ANMRR.We conclude that center loss is meaningful for boosting the discriminative power of deep features, comparing the ANMRR and mAP with the baseline networks.

Discussion
Through the above experiments and comparisons, the two schemes that we proposed can be proven to be effective.Based on the experimental results, we make some further discussion as follows:

•
In the first scheme, the pre-trained CNNs connected with our simple attention module are regarded as the feature extractors.To give an extensive evaluation of our scheme, we conduct four comparative experiments: comparisons with fine-tuned VGG16, VGG19, and ResNet50, comparisons among different pooling methods, comparisons among different filter sizes, and comparisons with different attention modules.The results of experiment 1 in Table 1 show that our scheme is beneficial for improving the CNNs' ability of feature representation.The results of experiments 2 and 3 in Tables 1 and 2 show that the design choice of the attention module is appropriate and effective.The results of experiment 4 in Table 3 indicate that our attention module is more suitable for remote sensing images processing than SE [34] and CBAM [35], which are used for processing natural images.The attention module can further weight the features that are extracted by CNNs, to generate meaningful features that are more important, which is a possible explanation for the effects of our attention module.

•
In the second scheme, a novel multi-task learning network structure that can further boost the discriminative power of the features is proposed based on center loss.By reducing the intra-class distance, the center loss that is adopted in our novel network further compensates for the lack of softmax loss.From the comparisons in Table 4 and the distributions in Figure 9, center loss, integrated with softmax loss, can achieve better performance than other loss functions [37][38][39], and more discriminative features, rather than just separable features, can be learned under the supervision of center loss.Better performance can be found in Table 5, compared with fine-tuned CNNs, which indicates that center loss is meaningful for boosting the discriminative power of deep features.Compared to the fine-tuned CNNs, a more compact intra-class distance is the key to the better performance of our scheme.

•
The validity of the combination of scheme 1 and scheme 2 is verified in Sections 4.2.3 and 4.2.4.The re-weighted feature maps caused by the attention module, and a more discriminative feature representation caused by center loss, are combined and compared with other schemes and baselines.The results in Tables 6 and 7 show the remarkable performances of our combined schemes, which further validates the effectiveness of our two schemes and the feasibility of combining the two schemes.

•
Learning-based features have attracted increasing interest not only in remote sensing image retrieval but also in the computer vision society.For example, most baselines in instance retrieval and person re-identification are learning-based features.Although the CNNs features are commonly adopted by remote sensing image retrieval and other retrieval tasks, the difference still exists.Specifically, person re-identification aims at retrieving a person of interest in other cameras, which is based on the person detection.The target of interest has been detected, there is almost no other interference information such as a background in the remote sensing image retrieval.Our scheme 1 is proposed for this point to focus on the salient target.Compared to instance retrieval, which is aiming to retrieve images containing the same object or architecture that may be captured under different views, remote sensing image retrieval belongs to class retrieval, which aims at retrieving images of the same class with the query, and this is why datasets used for CBRSIR are mostly datasets for remote sensing scene classification.Based on this, our scheme 2 is designed to penalize the intra-class distances of features.Our method is designed for the characteristics of remote sensing image retrieval and the particularity of the dataset.

Conclusions
In this paper, we proposed two schemes to acquire the discriminative features for remote sensing images retrieval.Our first scheme attention module, a simple module with small calculations, is applied to capture the silent local features and to suppress less useful ones.Through the execution of the two channel and spatial dimensions, our attention module can emphasize the important features along those two axes.Our second scheme center loss is adopted to improve the network structure of the original classification training.The advantage of center loss is to make the deep features of the inter-class dispersed and intra-class to be as compact as possible, which is more suitable for remote sensing images retrieval.To verify the validation of the approach, a more challenging data set is built, which consists of multiple published datasets for remote sensing images retrieval and scene classification.Finally, extensive experiments on the challenging data set and comparisons with baselines demonstrate the effectiveness and superiority of our two schemes, especially the combination of two schemes that can achieve the best performance.
Though our proposed feature learning approach can achieve better performance, there are still some shortcomings that we cannot neglect.As described in Section 3.1, our attention module can only be connected to the convolutional layer of CNNs.However, both the fully connected layer and the convolutional layer can be used as the feature representations.In Reference [27], the fully connected layer of some CNNs can obtain better retrieval performance than the convolutional layer under certain conditions.So, how to overcome the limitation for the use of the attention module is one of our future focuses.In addition, the attention module is proposed for remote sensing images retrieval, but it can also be used for other tasks, such as object detection and scene classification in remote sensing image processing.

Figure 2 .
Figure 2. The framework of our proposed discriminative feature learning approach.

Figure 2 .
Figure 2. The framework of our proposed discriminative feature learning approach.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 20 into the channel attention module and the spatial attention module.The attention module's framework is shown in Figure 3.

Figure 3 .
Figure 3. Diagram of the attention module.

Figure 4 .
Figure 4. Diagram of the channel attention module.

Figure 3 .
Figure 3. Diagram of the attention module.3.1.1.The Channel Attention ModuleGiven the last convolutional layer F (H × W × C) of any CNNs as input, the channel attention module learns the channel attention map M c (1 × 1 × C).As we all know, the last convolutional layer in CNNs contains the richest high-level semantic information, and the different channels are regarded as different features.For example, there are 2048 channels in the last convolutional layer of ResNet, and 512 channels in VGG.Not all of these 2048 or 512 features have equal contributions to feature representation.Thus, the vector Mc is learned, in order to weigh the importance of different channels.The progress of channel attention module is illustrated in Figure4.

Figure 3 .
Figure 3. Diagram of the attention module.

Figure 4 .
Figure 4. Diagram of the channel attention module.

Figure 4 .
Figure 4. Diagram of the channel attention module.

Figure 5 .
Figure 5. Diagram of spatial attention module.

Figure 5 .
Figure 5. Diagram of spatial attention module.

3. 2 .
Scheme 2-The Center Loss-Based Multi-Task Learning Network In most of the CNNs, the softmax function is usually used as the loss function to supervise the training of the model.The softmax loss function is shown in Equation (3), and it can efficiently supervise the network trained for the classification task.The center loss function was proposed in Reference[40] to minimize the intra-class variations for face recognition task, as formulated in Equation (4).x i +b yi

Figure 6 .
Figure 6.Diagram of the Center Loss-based multi-task learning network structure.

Figure 6 .
Figure 6.Diagram of the Center Loss-based multi-task learning network structure.

Figure 7 .
Figure 7.Samples images from the datasets AID and PatternNet, (a) examples from dataset AID, (b) examples from dataset PatternNet.

Figure 7 .
Figure 7.Samples images from the datasets AID and PatternNet, (a) examples from dataset AID, (b) examples from dataset PatternNet.

Figure 8 .Figure 9 .
Figure 8. Class-level precisions of different baseline networks and the corresponding average pooled attention modules, where vgg16_am means the vgg16 combined with the average pooled attention module.

Figure 8 .
Figure 8. Class-level precisions of different baseline networks and the corresponding average pooled attention modules, where vgg16_am means the vgg16 combined with the average pooled attention module.

Figure 8 .Figure 9 .
Figure 8. Class-level precisions of different baseline networks and the corresponding average pooled attention modules, where vgg16_am means the vgg16 combined with the average pooled attention module.

Figure 12 .Table 5 .
Figure 12.Class-level precision of different baselines trained under different loss function.vgg16_cl means the vgg16 trained under the joint supervision of the softmax loss function and the center loss function.

Figure 12 .Table 5 .
Figure 12.Class-level precision of different baselines trained under different loss function.vgg16_cl means the vgg16 trained under the joint supervision of the softmax loss function and the center loss function.

Figure 12 .
Figure 12.Class-level precision of different baselines trained under different loss function.vgg16_cl means the vgg16 trained under the joint supervision of the softmax loss function and the center loss function.

Table 1 .
Comparisons of different attention modules by using different pooling methods.The best result for each baseline CNN (Convolutional Neural Network) is reported in bold.

Table 4 .
Comparisons of the different loss functions.The best result is reported in bold.