Distribution Consistency Loss for Large-Scale Remote Sensing Image Retrieval

: Remote sensing images are featured by massiveness, diversity and complexity. These features put forward higher requirements for the speed and accuracy of remote sensing image retrieval. The extraction method plays a key role in retrieving remote sensing images. Deep metric learning (DML) captures the semantic similarity information between data points by learning embedding in vector space. However, due to the uneven distribution of sample data in remote sensing image datasets, the pair-based loss currently used in DML is not suitable. To improve this, we propose a novel distribution consistency loss to solve this problem. First, we deﬁne a new way to mine samples by selecting ﬁve in-class hard samples and ﬁve inter-class hard samples to form an informative set. This method can make the network extract more useful information in a short time. Secondly, in order to avoid inaccurate feature extraction due to sample imbalance, we assign dynamic weight to the positive samples according to the ratio of the number of hard samples and easy samples in the class, and name the loss caused by the positive sample as the sample balance loss. We combine the sample balance of the positive samples with the ranking consistency of the negative samples to form our distribution consistency loss. Finally, we built an end-to-end ﬁne-tuning network suitable for remote sensing image retrieval. We display comprehensive experimental results drawing on three remote sensing image datasets that are publicly available and show that our method achieves the state-of-the-art performance.


Introduction
With the rapid advancement of aerospace and remote sensing technology, the comprehensive observation capability of the earth has been greatly improved. The available remote sensing images have undergone tremendous changes in terms of improved spatial resolution and improved acquisition rates, which has imposed a profound impact on the way we process and manage remotely sensed images. The increased spatial resolution provides a new opportunity to advance the analysis and understanding of remotely transmitted images. The ever-increasing data collection speed allows us to collect large amounts of remote sensing data every day, but this poses a huge challenge for managing large datasets, especially how to quickly access the data of interest.
The early remote sensing image retrieval system only provides a text-based retrieval interface, and the image is described by related text information, such as image name, geographic region and acquisition time. However, the information does not have direct relation to the visual content of the image. To solve this problem, people are working on content-based image retrieval (CBIR). CBIR is a branch of image retrieval and is a useful technique for quickly retrieving data of interest from a massive database, by extracting features (such as colors and textures) in the visual content and identifying similar or matching images in the database. In recent years, based on the advantages of CBIR, the remote sensing community has invested a lot of energy to make CBIR suitable for remote sensing image retrieval. Henceforth, image retrieval based on remote sensing images has attracted numerous scholarly studies and has achieved huge progress [1,2].
In particular, remote sensing is centered on developing effective methods to extract features because the retrieval performance largely depends on feature effectiveness. The key issue in CBIR is to find the underlying distinctiveness from the image. Traditional feature extraction techniques rely primarily on manual design features. Their design is subject to human intervention, subjective and makes it difficult to express high-level semantic information. Handcrafted features are also commonly used as remote sensing image representations in RSIR work [3,4], including spectral features [5], shape features [6] and texture features [7]. Compared with low-level features, middle-level features embed low-level feature descriptor into the encoded feature space, and use more compact feature vectors to represent complex image textures and structures. The typical methods are BoW (Bag of Word) [4], VLAD (Vector of Aggregate Locally Descriptor) [1] and FV (Fisher Vector) [2].
However, the abovementioned underlying features and middle-level features still have a "semantic gap" with high-level features. The above feature extraction method is based on the characteristics of artificial design and has limitations on the expression ability of remote sensing image content. The advancement of deep learning has driven content-based image retrieval. It abstracts feature vectors trained by a large amount of data and automatically learns the rich information contained in the data. It has been proven that deep learning has better performance than traditional manual features in image retrieval of remote sensing images [2,8,9]. Moreover, deep learning solves a variety of computer vision barriers as well as remote sensing issues, such as simultaneous extraction of roads and buildings, ultra-high-resolution optical images and hyperspectral image classification. The essence of deep learning is to discover the complex structure of the dataset by training a large amount of data. The rich information contained in the automatic learning data is abstracted into a feature vector, so that handcrafted features are not needed in remote sensing images. In the context of remote sensing image big data, deep learning technology is of great value in image retrieval of massive remote sensing data.
Deep metric learning represents a newly emerging technology that combines metric learning with deep learning. A deep neural network deploys its discriminative power to embed the image into metric space. Simple metrics, including cosine similarity and Euclidean distance, can be directly used to measure the similarity between images [10]. In recent years, deep learning has achieved great success in application areas, which include target recognition, target detection, image segmentation and natural language understanding [11][12][13], and has been gradually applied in the field of image retrieval, such as landmark image retrieval [14], natural image retrieval [14] and face recognition [10]. Despite apparent differences between remote sensing images and ordinary natural images, DML presents huge potential in content-based image retrieval of remote sensing images [15].
The loss function is crucial in the success of DML, and various loss functions have been proposed in the literature. Contrastive loss [16] records the relationship between pairs of data points by zooming in on similar samples and pushing far from dissimilar samples. The triplet loss [17] consists of an anchor point, a similar (positive) data point and dissimilar (negative) data points. It learns a distance metric that allows anchor points to be closer to similar points than dissimilar points. Because of the relationship between the positive and negative pairs, triplet loss is usually better than contrastive loss [10,18] and, inspired by this, recent works [18][19][20][21][22] proposed to consider the relationship between multiple data points, where good performance is achieved in applications such as retrieval and classification.
However, there are still some limitations in the current state of DML on remote sensing image retrieval. First, we noticed that sample mining only uses part of the positive sample information, and the differences between sample categories are ignored. Secondly, we observe that the previous loss treats each positive sample equally, thus neglecting the sample differences within the category on the loss calculation; that is, the effect of the quantity of relationships between easy samples and hard samples on the loss optimization. This deficiency affects the quality of image retrieval, especially remote sensing image retrieval. When magnifying the pictures in the remote sensing dataset, we found that the sample differences within each category are different. The specific differences are shown in Figure 1. The differences within the categories in Figure 1a,b are small, and the texture features and color characteristics are similar, but the differences within the categories in Figure 1c,d are relatively large. After comparison, we find that the selected hard positive samples deserve a larger weight because it has a larger contribution to the loss when the samples have larger differences within categories, namely the larger proportion of hard samples. Therefore, different categories should be assigned different weights when performing positive sample mining. Ideally, a hard sample with a large percentage should be given a greater weight.
Remote Sens. 2019, 11, x FOR PEER REVIEW 3 of 31 the differences between sample categories are ignored. Secondly, we observe that the previous loss treats each positive sample equally, thus neglecting the sample differences within the category on the loss calculation; that is, the effect of the quantity of relationships between easy samples and hard samples on the loss optimization. This deficiency affects the quality of image retrieval, especially remote sensing image retrieval. When magnifying the pictures in the remote sensing dataset, we found that the sample differences within each category are different. The specific differences are shown in Figure 1. The differences within the categories in Figure 1a,b are small, and the texture features and color characteristics are similar, but the differences within the categories in Figure 1c,d are relatively large. After comparison, we find that the selected hard positive samples deserve a larger weight because it has a larger contribution to the loss when the samples have larger differences within categories, namely the larger proportion of hard samples. Therefore, different categories should be assigned different weights when performing positive sample mining. Ideally, a hard sample with a large percentage should be given a greater weight. Our major contributions in this study are listed as follows: • For the remote sensing image retrieval task, we propose a novel distribution consistency loss (DCL), to learn discriminative embeddings. Different from the previous pair-based loss, it performs loss optimization based on the difference in the number of samples within the class and the sample distribution structure between the classes. It includes the sample balance loss obtained by assigning dynamic weights to selected hard samples based on the ratio of easy sample and hard sample in the class, and the ranking consistency loss weighted [23] according to the distribution of the category of the negative sample. • A sample mining method suitable for a remote sensing method is proposed. The intra-class hard sample mining method is used to select five positive samples, and each positive sample is given a dynamic weight. The hard class is used instead of hard content mining. This method selects a representative sample which obtains richer information while increasing the speed of convergence.

•
We built an end-to-end fine-tuning network architecture for remote sensing image retrieval, which applied convolutional neural network, and selected the most suitable method for remote sensing image retrieval. In DCL, the loss and gradients were computed based on sum-pooling Our major contributions in this study are listed as follows: • For the remote sensing image retrieval task, we propose a novel distribution consistency loss (DCL), to learn discriminative embeddings. Different from the previous pair-based loss, it performs loss optimization based on the difference in the number of samples within the class and the sample distribution structure between the classes. It includes the sample balance loss obtained by assigning dynamic weights to selected hard samples based on the ratio of easy sample and hard sample in the class, and the ranking consistency loss weighted [23] according to the distribution of the category of the negative sample. • A sample mining method suitable for a remote sensing method is proposed. The intra-class hard sample mining method is used to select five positive samples, and each positive sample is given a dynamic weight. The hard class is used instead of hard content mining. This method selects a representative sample which obtains richer information while increasing the speed of convergence.

•
We built an end-to-end fine-tuning network architecture for remote sensing image retrieval, which applied convolutional neural network, and selected the most suitable method for remote sensing image retrieval. In DCL, the loss and gradients were computed based on sum-pooling (SPoC) [24] features. The loss function influences the activation distribution of the feature response map, which enhanced accurate saliency and extracted more discriminative features. In addition, we also compared different combinations of image multi-scale processing, whitening and query expansion, and finally selected the most suitable multi-scale cropping method to process the input data.

•
We conducted a comprehensive experiment on the large dataset PatternNet [9], the popular UCMD dataset [24] and the NWPU-RESISC45 dataset [25]. The experimental results demonstrate that our method is significantly better than the most advanced technology available.

Related Work
In this section, we will first give the formulation of the remote sensing image retrieval method, and then introduce the existing general pair-based weighting loss. Finally, we introduce the sample mining method in metric learning.

Remote Sensing Image Retrieval Methods
The performance of remote sensing image retrieval mainly depends on the expressive power of image features. In the past ten years, people have made great efforts to extract effective features and construct remote sensing image scene datasets [9]. Compared with the traditional feature extraction method, the development of deep learning has greatly improved the quality of image feature extraction. In the context of remote sensing image big data, it is possible to extract remote sensing image features by learning from massive remote sensing image data. Therefore, the use of deep learning methods to improve the accuracy of remote sensing image retrieval tasks has a very broad prospect.
At present, the remote sensing image retrieval method based on deep learning generally uses convolutional neural networks (CNNs) to extract the features of remote sensing images, and trains the network by means of classification. Finally, remote sensing image retrieval is performed through the features extracted by the network [26]. In order to obtain more discriminative features, the two dimensions of channel and space are weighted to obtain significant features [27]. The pre-trained RSIR method uses a trained overfeat network to extract RS features. The outputs of the seventh layer (DN7) [28] and the eighth layer (DN8) [29] are considered as deep features. Yang uses the S-BOW [4] feature to represent the RS image, and selects the L1 norm distance to determine the similarity of the image. The similarity between RS images is converted into the similarity between blur vectors by region-based fuzzy matching (RFM) [30]. Xu discovered useful information of RS images through similarity. With the help of deep learning technology, he designed a feature learning method named Deep BOW under the bag of words model [31]. In addition to feature weighting, a saliency module can be added to extract convolutional layer aggregation features of multiple scales [31], or to add attention mechanisms and local convolution features [30] to achieve more accurate remote sensing image retrieval. However, the above methods require a large amount of training data. For some remote sensing image targets, the number of training images that can be obtained is small, and the deep learning training data requirements cannot be met. At the same time, the characteristics of the remote sensing image target rotation are not taken into consideration, which makes the current deep learning model inconsistent or even different for different rotation angle target features, resulting in low retrieval accuracy of remote sensing images.
Due to the large number of remote sensing images, the general linear search method is far from meeting the time performance requirements of large-scale remote sensing image retrieval and is replaced by the approximate nearest neighbor (ANN). The basic idea of ANN is to replace the exact match with the approximate optimal, and this greatly improves the retrieval efficiency while ensuring the accuracy of image retrieval. Among them, the hash learning method [32] is a commonly used method of ANN. The hash learning method is widely used in large-scale image retrieval due to its advantages in speed and storage. For example, the non-linear hashing method based on RS two kernels, achieves real-time search and fast detection through mapping image feature vectors of high dimensional image into compact truncated hash codes [32]. A hashing-based approach introduces a hashing algorithm to encode RS images [31]. However, hash learning methods typically require longer hash codes to achieve satisfactory accuracy, which results in larger storage space requirements and retrieval efficiency issues. However, with a shorter hash code, there is a problem that the retrieval recall rate is low.
In recent years, deep learning methods have also been applied to hash coding to obtain better coding effects [33]. The partial random hash of the random projection is generated in an unsupervised manner, the image is mapped to the Hamming space to obtain the low-latitude expression of the image, and then the model is trained [32]. Unsupervised strategies [15], metric-based learning of the Hash network [34], and deep hash neural networks [33] are also methods for resolving large-scale remote sensing images. Through these deep learning-based hashing methods, the image can directly obtain the corresponding hash encoding. However, in order to minimize the loss of feature information of the image, the hash coding dimension is often very high, and it is necessary to traverse the entire data set when performing image retrieval, resulting in low retrieval efficiency.
Based on the excellent performance on ImageNet [35] and other issues, convolutional neural networks (CNNs) combined with metric learning are the most effective deep learning methods in image retrieval. However, training a valid CNN from scratch requires a lot of markup images. Using CNN pre-trained on ImageNet as a feature extractor, we can learn specific features by fine-tuning CNNs that are pre-trained on the target dataset, so translate learning is often used to solve the problem of lacking enough markup images. This is very helpful in some areas where large-scale publicly available datasets are not enough, such as remote sensing. In [36], Penatti studied the generalization ability of deep features extracted by CNN by extracting and transferring deep features from everyday objects to remote sensing. Experimental results indicate that transfer learning is an effective method for cross-domain tasks. There are many pre-trained CNNs for migration learning, such as the Caffe Reference model (CaffeRef) [37], the baseline model AlexNet [38], the VGG network [39], the newly developed deeper model GoogLeNet [40], and the Residuals Network (ResNet) [41].
Recently, these pre-trained CNNs and their modified versions have been widely applied in different image retrieval tasks, ranging from computer vision [42][43][44] to remote sensing [2,45]. Chaudhuri et al. [46] put forward the SGCN architecture for evaluating the similarity between paired graphs that are trained for CBIR by contrastive loss function. Famao et al. [47] use two image-to-class distances to re-rank the initial retrieval result, referred to as the similarity between an image and an image class. For different tasks, people have made some improvements in various stages of the search. Bindita et al. [48] solve multilabel RS image retrieval problems drawing on a semi-supervised graph-theoretic method and expensive and time-consuming problems by multi-label annotation images. Babenko and Lempitsky [24] form image signatures through aggregating deep convolutional descriptors by sum-pooling of convolutional features (SPoC). Tolias et al. [49] used many multi-scale overlapping regions of the last convolutional feature map to extract the maximum activations of convolutions (MAC). In [50], a trainable generalized mean (GeM) pool layer replaces the MAC layer. This greatly improves retrieval accuracy. A new CNN architecture has recently been proposed in [51], which can learn and extract multi-scale local descriptors in the salient regions of images. RSIR can be viewed as a branch of image retrieval. They are still identified by visual content based on similar images. However, due to the particularity of RS images, it may not be suitable for the direct deployment of some commonly used technologies. In this paper, we will compare the commonly used techniques and build the most suitable retrieval method for remote sensing images.

Contrastive Loss
Based on the selected paired positive (samples belonging to the same class of the query sample) and negative (samples not belonging to the same class of the query sample) samples, the contrastive loss [16] is designed to minimize the distance between the query sample and the positive sample pair, while maximizing the distance between the query image and the negative sample pair, restricted within a predefined margin α, where {D ab } is the whole paired distance set, in which D ab = D(I a , I b ) in Equation (1) (I a and I b are the L 2 -normalized SPoC vector of image a and image b respectively). y ab ∈ {0, 1} indicates whether a pair (I a , I b ) is from the same class (y ab = 1) or (y ab = 0), in which D ab y ab = 1 is the positive paired distance set, and D ab y ab = 0 stands for the negative paired distance set. Besides, [·] + is the hinge function, and α is a threshold parameter designed according to actual needs.

Triplet Loss
In view of triplet data (I a , I b , I k ), y ab = 1, y ak = 0 triplet loss [17] is designed to learn a deep embedding, which widens the distance between negative pairs and makes the distance larger than that of a randomly selected positive one over a margin α, Concretely, the equal weight w ijk = 1 for all the selected pairs is assigned by the given triplet loss.

N-Pair Loss
Triplet loss pulls close a positive sample while pushing away a negative sample. Only three samples are in a batch that participates in the training. N-pair loss [55] increases the number of negative samples that interact with the query sample to improve the performance of triplet loss. It takes advantage of all sample pairs in the mini-batch and learns more differentiated representations based on structural information between the data. In detail, the sample includes one positive sample and negative samples selected from other different categories, which suggests one negative sample per category, and the loss function is listed as follows: where D ak a, k = 1, · · · , N; y aa = 1; and y ak = 0, if k a , and the distance of a positive pair x + i and x i is D i . However, the sample in N-pair loss is assigned the same weight for both the positive and negative pair in a triplet, and its weight value is As for anchor, P a and P b indicates the number of positive and negative sample pairs, and α, ε and are fixed hyper-parameters.
We derive the positive and negative sample pairs in Equation (5), and the weights are as follows:

Lifted Structured Loss
Different from using merely one negative sample in each class, the loss of lifted structure loss [18] relies on the advantages of training batches of minibatch SGD training, and uses random sampled image pairs or triples, constructing training batches to calculate the loss of each pairs or triplets. The loss function is given as a log-sum-exp formulation: As for a query sample, lifted structured loss explores structural relationships by identifying a positive sample from all negative samples of a mini-batch. For positive sample pairs, the weight of lifted structured loss is The weight of the negative sample pair is

Multi-Similarity Loss
Based on binomial deviance loss and lifted structured loss, multi-similarity loss [33] defines self-similarity and relative similarity, and proposes a general weighting strategy that takes advantage of both positive and negative sample pairs. The loss is calculated as follows: where α, ε and are hyperparameters used to control different pairs of weights. For positive and negative samples, the weights set for multi-similarity loss are Remote Sens. 2020, 12, 175 The aforementioned contrastive loss, triplet loss and N-pair loss give the same weight to the positive and negative sample pairs. Unlike them, Equations (6) and (7) show that binomial deviance loss considers self-similarity, and lifted structure loss sets weights for positive and negative sample pairs according to the negative relative similarity as in Equations (9) and (10). Multi-similarity loss combines the distribution of the sample itself and the surrounding samples, taking into account the self-similarity and relative similarity of the sample pairs. However, this approach ignores the distribution of samples within the class and the differences between different classes.

Sample Mining
Many of the loss functions of metric learning are built on top of sample pairs or sample triples, so the sample space is of very large magnitude. In general, there exist apparent difficulties for the model to exhaustively learn all pairs of samples during the training process, and the amount of information for most sample pairs or sample triples is small. Especially in the later stages of model training, the gradient values on these sample pairs or sample triples are almost zero. Without any targeted optimization, the convergence speed of the learning algorithm will be slow and easy to fall into the local optimum. This is not conducive to better characterization of network learning, and sample mining plays a key role in metric learning.
Hard sample mining is an important means to speed up the convergence of learning algorithms, improves the generalization ability of the network and the learning effect [10,56,57]. TriHard loss [17] is a kind of online sampling method for hard samples based on a training batch, which is improved on the basis of triple loss. For each batch, select one of the hardest positive samples and one of the hardest negative samples as an anchor point to form a triple, although this method produces only a small number of triples. When we need enough triples, we usually need a larger batch [10]. MAML [58] only selects the most hard positive sample pair and the most hard negative sample pair for each picture in the batch, which is a hard sample that is harder than TriHard. In addition, it also considers the relative distance and absolute distance, and the performance is better than TriHard loss. N-pair loss considers the query sample and the negative samples of several other different classes in each parameter update process, which speeds up the convergence of the model. Lifted structure loss is based on all positive and negative sample pairs in a mini batch to calculate loss. The triplet of the triplet loss is determined in advance, and all negative samples are considered during the construction process. Proxy NCA loss [22] selects a sample closest to a small portion of the data in the training set as a proxy when sampling. For ranked list loss [59], the sampling strategy is to select samples whose loss function is not zero. Although all samples within the threshold were mined, the differences between the negative sample classes and the effects of surrounding samples were not considered.
We fully consider the diversity and difference of samples, select multiple positive samples and negative samples of different types, as well as set the distance from the sample according to the distribution of the neighbor samples around the negative samples. Figure 2 shows a comparison of our method with other different methods.
as a proxy when sampling. For ranked list loss [59], the sampling strategy is to select samples whose loss function is not zero. Although all samples within the threshold were mined, the differences between the negative sample classes and the effects of surrounding samples were not considered.
We fully consider the diversity and difference of samples, select multiple positive samples and negative samples of different types, as well as set the distance from the sample according to the distribution of the neighbor samples around the negative samples. Figure 2 shows a comparison of our method with other different methods. In the triplet loss [17], the anchor (the query picture is in brown) is only compared to one positive sample and one negative sample. In N-pair-mc [19], Proxy-NCA [22] and Lifted Struct [18], we introduced a positive sample and multiple negative sample types. N-pair-mc randomly selects a In the triplet loss [17], the anchor (the query picture is in brown) is only compared to one positive sample and one negative sample. In N-pair-mc [19], Proxy-NCA [22] and Lifted Struct [18], we introduced a positive sample and multiple negative sample types. N-pair-mc randomly selects a negative sample from each negative sample class. The Proxy NCA pushes the negative sample and the agent away from the anchor rather than push the negative sample farther away. Lifted Struct uses all negative samples. But we selected multiple positive samples and negative samples from different classes and pull different classes apart by different distances.

Methodology
In this section, we describe how to develop a positive and negative sample mining strategy to make full use of the effective training of sample information, and then design a weighted approach of positive (negative) sample pairs based on positive (negative) sample characteristics. Finally, we propose a novel and effective metric learning loss function.
We set X = (x i , y i ) N i=1 as the input data, where x i represent the i-th image and y i is the label of the corresponding class. The total number of classes is C, where y i ∈ [1, 2, 3 · · · , C]. Then an instance x i is projected onto a unit sphere in a l-dimension space by f (·; θ): be the images in the c-th class, where the total number of images is N c . Our purpose is to learn a discriminative function to represent a higher similarity between positive sample pairs and a lower similarity between negative sample pairs. Therefore, there are at least two images in each category in order to evaluate all categories. In this case, we aim to find a paired sample from other samples in the same category.

Distribution Consistency Loss
As shown in Figure 3, our distribution consistency loss consists of two parts. The first part is for the sample balance loss of positive samples (the so-called sample balance refers to the ratio of the number of hard positive samples and easy positive samples), and the second part is for the ranking of consistency loss of negative samples. The ranking consistency here is to maintain the true distribution between classes by learning the ranking of the categories of each negative sample. The specific details are as follows.
Therefore, there are at least two images in each category in order to evaluate all categories. In this case, we aim to find a paired sample from other samples in the same category.

Distribution Consistency Loss
As shown in Figure 3, our distribution consistency loss consists of two parts. The first part is for the sample balance loss of positive samples (the so-called sample balance refers to the ratio of the number of hard positive samples and easy positive samples), and the second part is for the ranking of consistency loss of negative samples. The ranking consistency here is to maintain the true distribution between classes by learning the ranking of the categories of each negative sample. The specific details are as follows. In the test phase, we input the image through multiscale processing and input it into the fine-tuning network completed by the training phase, and then return the first K images associated with the query image. In the test phase, we input the image through multi-scale processing and input it into the fine-tuning network completed by the training phase, and then return the first K images associated with the query image.

Sample Mining
Sample mining can achieve rapid convergence of networks and improve network performance. It is widely used in metric learning [10,17,19,[60][61][62][63]. Given a query sample x c i , in order to mine the positive and negative samples, we first use the currently effective network [50] to extract the image features, and use the extracted features to calculate the Euclidean distance between all samples and the query sample X c i and do the ranking based on the distance. We define P c,i as a collection of the same category with the query image, which is expressed as P c,i = x c j j i , P c,i = N c − 1. N c,i is a collection of other images, represented as N c,i = x k j k c , N c,i = k c N k . We create a dataset of tuples x c i , P * c,i , N * c,i , where x c i represents the query image, P * c,i is the positive set selected from P c,i , and N * c,i is the negative set selected from N c,i . The training image pairs consist of these tuples, where each tuple corresponds to P * c,i positive sample pairs and N * c,i negative sample pairs. Positive set P * c,i : For the query sample x c i , we select the P * c,i in-class samples (hard samples) x c j that are furthest from it as positive samples. Negative set N * c,i : For negative samples, in order to learn the differences between classes, negative "class" mining is proposed against negative "instance" mining, which greedily selects a negative class in a relatively efficient manner. In particular, we select the nearest negative sample based on the distance between the query sample and all samples; that is, the sample that has the highest similarity to the query sample but belongs to a different category from the query sample. Next, we look for the second closest sample. When this sample belongs to the same class as the previously found sample, the sample is discarded and the searching will continue, otherwise it will be the second negative sample, and so on, until we choose N * c,i negative samples from N c,i classes.
Hard sample in positive sample weight: In the weighting process of positive samples, we delineate a hard sample boundary in the positive sample to reduce the influence of the number relationship between easy and hard in the positive sample. Assume x c i is the query sample, a hard and positive pair where we define the similarity of the two samples as S ij := < f(x i ; θ), f x j ; θ >, where ·, · resulting in an n × n similarity matrix; S whose element at (x i , x j is S ij ; and µ is a hyperparameter. The number of hard positive samples satisfying the above constraints is represented by n hard .

Distribution Consistency Loss Weighting
Through the sample mining strategy, we can select samples with representative information and discard samples with less information. We developed different soft weighting schemes for positive and negative sample pairs.
For positive sample pairs, our weighting mechanism relies on the number and distribution of easy and hard samples within the class. For an anchor, the more the number of hard samples in the class, the more information is included in the selected positive sample pair. In the process of training, we give a large weight to the sample pairs. When the number of hard samples in the class is small, the selected hard samples may be noise or the information carried is not representative. If a large weight is given at this time, the overall learning direction of the model may be deviated, resulting in invalid learning. Therefore, for classes with a small number of hard samples, we assign less weight to the selected sample pairs. Specifically, given a selected positive pair x i , x j , its weight w + ij can be computed as where ϑ is a hyperparameter.
For negative samples, we use the weight of the distribution entropy [23] to maintain the similarity ranking consistency of the class. The distribution entropy define the weight value as w − ij .

Optimization Objective
For each query x c i , we assign the weighted w + ij defined by the quantity relationship between the hard sample and the easy sample in the class of the selected positive sample, and we use a margin m to make it closer to its positive set P c,i than to its negative set N c,i . Moreover, we compel all negative samples to be farther than a dynamic boundary w − ij τ. This threshold is determined by the similarity between the negative samples selected from the different classes and the query picture. Thus, all samples from the same class were pulled into a hypersphere.
We attempt to pull all non-trivial positive points in P c,i together and learn a class hypersphere by minimizing: where f x c i and f x c j represent the feature vectors of images x c i and x c j , respectively, and f x c i − f x c j represents the Euclidean distance between f x c i and f x c j . Similarly, we intend to push all non-trivial negative points in N c,i beyond the boundary w − ij τ, by minimizing: Remote Sens. 2020, 12, 175

of 30
In DCL, we treat the two minimization objectives equally and optimize them jointly: we update f X c i according to weighted combination of other elements. For the learning of deep models, we do DCL based on a stochastic gradient descent and mini-batch. Each minibatch represents a randomly sampled subset in the whole training classes. Every image x c i in the mini-batch serves as the query (anchor) iteratively, and the other images act as the gallery. We represent the DCL of each mini-batch as The batch size is represented by N. We clarify the DCL-based learning of the deep embedding function f in Figure 4. by minimizing: where ( ) and represent the feature vectors of images and , respectively, and ( ) − represents the Euclidean distance between ( ) and . Similarly, we intend to push all non-trivial negative points in , beyond the boundary , by minimizing: In DCL, we treat the two minimization objectives equally and optimize them jointly: we update ( ) according to weighted combination of other elements. We will also discuss the process of updating data points in Section 3.2.
For the learning of deep models, we do DCL based on a stochastic gradient descent and minibatch. Each minibatch represents a randomly sampled subset in the whole training classes. Every image in the mini-batch serves as the query (anchor) iteratively, and the other images act as the gallery. We represent the DCL of each mini-batch as The batch size is represented by N. We clarify the DCL-based learning of the deep embedding function f in Figure 4.

Experiments and Discussion
In this section, we discuss in detail the application dataset, the retrieval performance metrics for quantitative evaluation and the experimental hardware conditions. Then, the relevant parameters related to the model are mined. Finally, we assessed different combinations of network settings and compared them with the current methods. In this experiment, our proposed method is applicable to all convolutional neural networks. For model training, we used the pytorch deep learning architecture to train the DCL-based deep network model. We initialized the parameters of the network by the weights of the corresponding networks pre-trained on ImageNet [35]. Due to the use of network pre-training parameters [64], the momentum was 0.9 for the VGG16 and Resnet50 networks during training. The experimental environment was an Intel Xeon(R) CPU E5-2620 V3, GPU

Experiments and Discussion
In this section, we discuss in detail the application dataset, the retrieval performance metrics for quantitative evaluation and the experimental hardware conditions. Then, the relevant parameters related to the model are mined. Finally, we assessed different combinations of network settings and compared them with the current methods. In this experiment, our proposed method is applicable to all convolutional neural networks. For model training, we used the pytorch deep learning architecture to train the DCL-based deep network model. We initialized the parameters of the network by the weights of the corresponding networks pre-trained on ImageNet [35]. Due to the use of network pre-training parameters [64], the momentum was 0.9 for the VGG16 and Resnet50 networks during training. The experimental environment was an Intel Xeon(R) CPU E5-2620 V3, GPU with 12 GB of memory, NVIDIA(R) Titan X graphics card, driver version 419.**, operating system Ubuntu 18.04 LTS, pytorch version v1.0.0, CUDA version 10.0 and cudnn version 7.5.

Datasets
To test the proposed method, we use two publicly available RSIR datasets, the UC Merced dataset (UCMD) [24], PatternNet [9] and the NWPU-RESISC45 dataset [25]. To avoid over-fitting of the feature extraction network, we conducted the image retrieval task under zero-shot settings, in which the training dataset and testing dataset contain image classes without no intersection.
UCMD [24] is a classification dataset for land use and cover. It contains 21 classes, each with 100 images. Examples from every class are shown as follows: building, agricultural, golf course, baseball diamond, medium density residential, parking lot, beach, freeway, chaparral, intersection, mobile home park, river, overpass, airplane, storage tanks, dense residential, harbor, tennis courts, sparse residential, forest and runway. The resolution of each image is 256 × 256 pixels. The images were obtained from large aerial images downloaded from the United States Geological Survey (USGS) and the spatial resolution is around 0.3m. There exist several highly overlapping classes in the UCMD dataset (i.e., sparse residential, medium residential and dense residential). Thus, the image retrieval task on this dataset is challenging. As the first publicly available remote sensing evaluation dataset, it has been widely applied to evaluate RISR methods. For UCMD, we conform to the data splitting that yields the best performance in [8], which implements data training by randomly selecting 50% images of each class and do performance evaluation by using the rest of the 50%. Figure 5 presents sample images in the dataset.
images. Examples from every class are shown as follows: building, agricultural, golf course, baseball diamond, medium density residential, parking lot, beach, freeway, chaparral, intersection, mobile home park, river, overpass, airplane, storage tanks, dense residential, harbor, tennis courts, sparse residential, forest and runway. The resolution of each image is 256 × 256 pixels. The images were obtained from large aerial images downloaded from the United States Geological Survey (USGS) and the spatial resolution is around 0.3m. There exist several highly overlapping classes in the UCMD dataset (i.e., sparse residential, medium residential and dense residential). Thus, the image retrieval task on this dataset is challenging. As the first publicly available remote sensing evaluation dataset, it has been widely applied to evaluate RISR methods. For UCMD, we conform to the data splitting that yields the best performance in [8], which implements data training by randomly selecting 50% images of each class and do performance evaluation by using the rest of the 50%. Figure 5 presents sample images in the dataset. 1  PatternNet [9] is a large-scale high-resolution remote sensing dataset collected for the purpose of RSIR. It contains 38 classes: tennis court, beach, solar panel, runway, parking space, storage tank, forest, sparse residential, football field, bridge, chaparral, coastal mansion, river, runway marking, transformer station, swimming pool, oil gas field, Christmas tree farm, oil well, airplane, wastewater treatment plant, overpass, dense residential, parking lot, harbor, freeway, baseball field, railway, golf course, basketball court, shipping yard, intersection, closed road, cemetery, mobile home park, crosswalk, ferry terminal and nursing home. Each class contains 800 images which measure 256 × 256 pixels. The images in PatternNet are either collected from Google Earth imagery or Google Map API for US cities. For PatternNet, we follow the data splitting strategy of 80% training and 20% testing as per [9]. Figure 6 presents sample images in the dataset. PatternNet [9] is a large-scale high-resolution remote sensing dataset collected for the purpose of RSIR. It contains 38 classes: tennis court, beach, solar panel, runway, parking space, storage tank, forest, sparse residential, football field, bridge, chaparral, coastal mansion, river, runway marking, transformer station, swimming pool, oil gas field, Christmas tree farm, oil well, airplane, wastewater treatment plant, overpass, dense residential, parking lot, harbor, freeway, baseball field, railway, golf course, basketball court, shipping yard, intersection, closed road, cemetery, mobile home park, crosswalk, ferry terminal and nursing home. Each class contains 800 images which measure 256 × 256 pixels. The images in PatternNet are either collected from Google Earth imagery or Google Map API for US cities. For PatternNet, we follow the data splitting strategy of 80% training and 20% testing as per [9]. Figure 6 presents sample images in the dataset.  NWPU-RESISC45 dataset [27] consists of 31,500 images, which is a large-scale RS image archive. It contains 45 classes: airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snowberg, sparse residential, stadium, storage tank, tennis court, terrace, thermal power station and wetland. Each class contains 700 images which measure 256 × 256 pixels, and the spatial resolution of them varies from 30 to 0.2 m. All of the images were collected from Google Earth, covering more than 100 countries. For the NWPU-RESISC45 dataset, we follow the data splitting strategy of 80% training and 20% testing as per [25]. Figure 7 presents sample images in the dataset. NWPU-RESISC45 dataset [27] consists of 31,500 images, which is a large-scale RS image archive. It contains 45 classes: airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snowberg, sparse residential, stadium, storage tank, tennis court, terrace, thermal power station and wetland. Each class contains 700 images which measure 256 × 256 pixels, and the spatial resolution of them varies from 30 to 0.2 m. All of the images were collected from Google Earth, covering more than 100 countries. For the NWPU-RESISC45 dataset, we follow the data splitting strategy of 80% training and 20% testing as per [25]. Figure 7 presents sample images in the dataset.

Performance Evaluation Criteria
To evaluate image retrieval performance, we use precision at k (P@k, precision of the top-k retrieval results) and mean average precision (mAP). In particular, the higher the value of mAP and P@k the better the retrieval performance.

Non-Trivial Examples Mining
For each query mentioned in Section 3.1.1, DCL mine samples violated the pairwise constraint with regard to the query. Specifically, we mined negative samples, where the distance between the hardest sample and the query sample should be less than τ in Equation (19). Meanwhile, we mined positive samples whose distance was larger than − as per Equation (18). As a result, in each ranked list, a margin m is built between negative and positive samples. Since the constraint parameters , m determines the sample mining range, and we implemented experiments on the large dataset PatternNet to evaluate their influence with the hyper parameter λ = 1, β = 50, = 1 and µ = 0.1.
Impact of parameter τ: To test threshold τ and its fitness for different networks, we respectively set the margin m = 1.0 and m = 1.2 in the VGG 16 and ResNet 50 network, and selected the results when τ = (0.85, 1.05, 1.25, 1.45) according to the experimental results. The learning rate is 1 × 10 . The results are presented by mAP in Figure 8, and by Precision @ K (%) in Table 1. It can be seen from Figure 8 that when training is performed using the VGG16 network (a), τ = 1.05 is the best, and when using the ResNet50(b) network, the performance is optimal when τ = 1.25. The quantitative comparison of the experimental results is shown in Table 2. This is the result obtained at epoch = 100. Table 2 shows that when we set τ = 1.05, the best result (P@5 = 98.42, P@10 = 98. 16

Performance Evaluation Criteria
To evaluate image retrieval performance, we use precision at k (P@k, precision of the top-k retrieval results) and mean average precision (mAP). In particular, the higher the value of mAP and P@k the better the retrieval performance.

Non-Trivial Examples Mining
For each query mentioned in Section 3.1.1, DCL mine samples violated the pairwise constraint with regard to the query. Specifically, we mined negative samples, where the distance between the hardest sample and the query sample should be less than τ in Equation (19). Meanwhile, we mined positive samples whose distance was larger than τ − m as per Equation (18). As a result, in each ranked list, a margin m is built between negative and positive samples. Since the constraint parameters τ, m determines the sample mining range, and we implemented experiments on the large dataset PatternNet to evaluate their influence with the hyper parameter λ = 1, β = 50, ϑ = 1 and µ = 0.1.
Impact of parameter τ: To test threshold τ and its fitness for different networks, we respectively set the margin m = 1.0 and m = 1.2 in the VGG 16 and ResNet 50 network, and selected the results when τ = (0.85, 1.05, 1.25, 1.45) according to the experimental results. The learning rate is 1 × 10 −7 . The results are presented by mAP in Figure 8, and by Precision @ K (%) in Table 1. It can be seen from Figure 8 that when training is performed using the VGG16 network (a), τ = 1.05 is the best, and when using the ResNet50(b) network, the performance is optimal when τ = 1.25. The quantitative comparison of the experimental results is shown in Table 2. This is the result obtained at epoch = 100. Table 2 shows that when we set τ = 1.05, the best result (P@5 = 98.42, P@10 = 98. 16  Impact of parameter m: The threshold m determines the distance between the positive sample and the hardest negative sample. In order to provide a suitable hyperplane for the positive sample, we need to choose a suitable value m. In the experiment, we set the margin τ = 1.05 in the VGG16 (a) network, and τ = 1.25 in the ResNet50 (b) network. In order to reduce the number of iterations, we set the learning rate to 1 × 10 . The results are presented by mAP in Figure 9, and by Precision @ K (%) in Table 2. It can be seen from Figure 9 that when using the VGG16 network (a) for training, m = 1.0 works best, and when using the ResNet50 (b) network, the performance is optimal when m = 1.2. The quantitative comparison of the experimental results obtained is shown in Table 2. Table 2    Impact of parameter m: The threshold m determines the distance between the positive sample and the hardest negative sample. In order to provide a suitable hyperplane for the positive sample, we need to choose a suitable value m. In the experiment, we set the margin τ = 1.05 in the VGG16 (a) network, and τ = 1.25 in the ResNet50 (b) network. In order to reduce the number of iterations, we set the learning rate to 1 × 10 −5 . The results are presented by mAP in Figure 9, and by Precision @ K (%) in Table 2. It can be seen from Figure 9 that when using the VGG16 network (a) for training, m = 1.0 works best, and when using the ResNet50 (b) network, the performance is optimal when m = 1.2. The quantitative comparison of the experimental results obtained is shown in Table 2. Table 2

Pooling Methods
In order to evaluate the impact of different pooling methods on the search results in the CNN fine-tuning network, we used global max pooling (MAC vector [49,65]), sum-pooling (SPoC vector [24]) and generalized-mean (GeM [50]) pooling to experiment. We present the results in Figure 10. From Figure 10 we can see that the sum-pooling is always higher than max pooling and generalizedmean pooling. This is because the information contained in the remote sensing image is scattered, so that each part has the same contribution to feature extraction. When the network goes deeper, the height and width of the feature map are smaller and contain more semantic information. In addition, remote sensing images contain a large amount of background information, while sum-pooling preserves and highlights background information. In the experiments in this article we used sumpooling.

Pooling Methods
In order to evaluate the impact of different pooling methods on the search results in the CNN fine-tuning network, we used global max pooling (MAC vector [49,65]), sum-pooling (SPoC vector [24]) and generalized-mean (GeM [50]) pooling to experiment. We present the results in Figure 10. From Figure 10 we can see that the sum-pooling is always higher than max pooling and generalized-mean pooling. This is because the information contained in the remote sensing image is scattered, so that each part has the same contribution to feature extraction. When the network goes deeper, the height and width of the feature map are smaller and contain more semantic information. In addition, remote sensing images contain a large amount of background information, while sum-pooling preserves and highlights background information. In the experiments in this article we used sum-pooling.

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1/ √ 2 works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

VGG16
(a) (b) Figure 10. Performance (mAP) comparison of different pooling layers: max pooling, sum-pooling and generalized-mean pooling with the fine-tune VGG16 (a) and the fine-tune ResNet50 (b) on PatternNet.

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct (a) (b) Figure 10. Performance (mAP) comparison of different pooling layers: max pooling, sum-pooling and generalized-mean pooling with the fine-tune VGG16 (a) and the fine-tune ResNet50 (b) on PatternNet.

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct (a) (b)

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct (a) (b)

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct

Multi-Scale Representation
We assess multi-scale representation established at test time in the absence of any additional learning to obtain the best scale representation combination. We used the average of the descriptors at multiple image scales [14]. Results are presented in Table 3. From the observation of Table 3, we can easily find that the combination of scales 1, 1 √2 ⁄ works best. These promotion experimental results demonstrate the effectiveness of the proposed multi-scale representation for the remote sensing image retrieval.

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19]

Comparison of Sample Mining Methods
In order to demonstrate the advantages of our sample selection, we thus compare our proposed algorithm with many widely used sampling strategies. Table 4 summarizes its performance in terms of mAP, P@5, P@10, P@50, P@100 and P@1000 accuracy on the test set in comparison with the five sampling strategies, including Triplet Loss [17], N-pair-mc Loss [19], Proxy NCA [22], Lifted Struct [18] and DSLL [23]. As can be seen from the Table 4, our proposed sample selection strategy outperforms all baseline algorithms, which validates our effectiveness. The DSLL algorithm was proposed in our previous paper, and it is a method applied to landmark image retrieval. It uses the same negative sample selection strategy as the DCL algorithm, and the weights of negative samples are assigned according to the distribution of the samples around the negative samples. Therefore, the sample features are accurately extracted by ensuring the consistency of the negative samples. However, the advantage of DCL is that it uses more positive samples than DSLL according to the proportion of hard samples and easy samples in the positive samples, and the selected positive samples are given dynamic weights. This is because during the weighting process of the DCL algorithm, the hard samples need to be demarcated and counted, which may result in large memory occupation and time consumption.

Comparison of Effects of Different Sample Numbers
In order to determine the number of samples that are most suitable for remote sensing image retrieval, for positive samples, we combined different numbers of samples and weighting methods, and the number of samples was set to 1, 5 and 10. The weights were 1, 1/n and the dynamic weights mentioned earlier. The retrieval results are shown in Table 5. For negative samples, we combined different numbers of samples with sample deduplication (choose only one per category). The sample sizes were set to 5, 10 and 15. Table 6 shows the results of the search. The experimental results were obtained after the 30th epoch. The positive sample selection experiment was performed under the condition that the number of negative samples is five and the sample is deduplicated; the negative sample selection experiment was performed under the condition that the number of positive samples is five and dynamic weight is attached.
By observing Table 5, we find that if the weight given to the positive sample is one, the accuracy decreases when the sample increases. When a positive sample is given a weight of 1/n, the accuracy at a weight of 1/n is higher than the accuracy at a weight of one. Because a large number of positive samples will be mixed with noise, a large amount of noise will affect the accurate extraction of features, thereby reducing the experimental effect. When dynamic weights are given, the accuracy of retrieval is better than the accuracy of other weights, and the effect is best when the number of samples is five. Table 6 shows the results of the negative sample sampling method. It can be seen from the table that the retrieval accuracy decreases with the increase in the number of samples. This is because we give the negative samples a weight determined by the permutation order, and the two samples are separated by a certain distance by the weight. However, too many negative samples we choose may have the problem of low hardness and small differences between samples, which cannot be well combined with weights. In addition, we found that the effect of sample deduplication is better than selecting all suitable samples in the category. This is because after the category deduplication, the network can better extract features by learning the intra-class differences.

Per-Class Results
In this section, we analyze the retrieval behavior across the different method for each individual category. Tables 7-9 provide the detailed precision results of each individual category for the UCMD dataset (Table 7), PatternNet dataset (Table 8) and NWPU-RESISC45 dataset (Table 9) after training the VGG16 and ResNet 50 network. Figure 11 shows intuitive results comparison under the VGG 16 network and ResNet 50 network on the UCMD dataset, PatternNet dataset and NWPU-RESISC45 dataset. The results are counted using the all retrieval images.  From the observation of Tables 7-9, it is obvious that DCL-based features perform better than pretrained features. In addition, an encouraging observation is that our DCL method enhances the retrieval performance to a large degree for many categories, for which other approaches' behavior is not satisfactory. For example, Pretrained VGG16-based features are particularly difficult in retrieving images of buildings, intersections and sparser residential areas, with an average mAP of 0.25, much lower than that of its counterpart, with 0.87 for the DCL-based features on the UCMD dataset in Table 7. Simultaneously, pretrained ResNet50-based features perform poorly on classes like dense residential, intersection and parking lot, with an average mAP of 0.29 and this value for DCL-based feature is 0.93. While on the PatternNet dataset in Table 8, pretrained features do not perform well in bridge, nursing home and swimming pool, with an average mAP of 0.22, while less than 0.87 for the DCL-based feature using VGG 16. The most significant improvement in performance is reflected in the use of ResNet50 networks. Pretrained features are particularly difficult for bridges, runways and tennis courts, with an average mAP of 0.26, reaching up to 0.98 for DCL-based features. The same improvement of performance can also be seen in Table 9. Pretrained ResNet 50-based features are particularly difficult in retrieving images of churches, palaces and ships, with an mAP of 0.56, 0.40 and 0.60, and these value for DCL-based feature are 0.97, 0.98 and 0.98.   Tables 7-9. Specifically, the labels of (a) and (b) correspond to Table 7, the labels of (c) and (d) correspond to Table 8, and corresponding to Table 9 are (e) and (f).

Comparison with the State of the Art
This section compares the performance of our proposed DCL method with the updated representations of the state-of-the-art performance. Table 10 lists the performance comparisons on UCMD dataset, Table 11 lists the performance comparisons on PatternNet dataset and the performance comparisons on NWPU-RESISC45 dataset are shown in Table 12. We divide the network into two categories: (1) the use of the network framework for VGG16, and (2) the use of the ResNet50. We can observe that our proposed DCL is superior to all previous methods. When using the VGG16 network framework, compared with the MiLaN [34], DCL provides a significant improvement of +6.94% in mAP on the UCMD dataset. The evaluation standard used by MiLaN [34] here is the result of mAP@20, hash bits k = 32; however, we use all the search results to calculate the map value as the evaluation criteria, which shows that the performance of our method is far superior to the performance of MiLaN. Furthermore, the DCL signatures achieves a gain of +6.21% in P@5 , +8.51% in P@10, +19.29% in P@50, +28.82% in P@100 and +1.11% in P@1000 on the PatternNet dataset, which surpassed the recently published VGGS Fc1 [9]. The best performance in Reference [47] is FC7(VGG16) [47], which achieves the mAP value of 96.48%. However, our method can achieve better performance, where the mAP value is 98.05%. Compared with RSIR-DBOW [31], our method achieves a higher mAP, representing a 16.45% improvement on the NWPU-RESISC45 dataset. When using the ResNet50 network framework, on the UCMD dataset, our experimental results have increased this indicators by more than 3% in mAP, compared to the reference Pool5 (ResNet50) [47], which achieves a mAP value of 98.76%. At the same time, our method achieves the value of 100% in P@5, 100% in P@10, 99.33% in P@50, 49.82% in P@100 and 3.98% in P@1000, which surpassed the recently published ResNet50 [27] Tables 7-9. Specifically, the labels of (a) and (b) correspond to Table 7, the labels of (c) and (d) correspond to Table 8, and corresponding to Table 9 are (e) and (f).
As we can easily see from Figure 11, the DCL loss we have proposed is stable and can reach 95% in almost all categories. When using the ResNet50 network, the mAP can be stabilized at around 98%, and even in some categories, it can reach 100%. Furthermore, whether on the UCMD dataset, PatternNet dataset or NWPU dataset, our DML-based features for content-based remote sensing image retrieval achieves the best performance, which proves that our method is useful to RSIR.

Comparison with the State of the Art
This section compares the performance of our proposed DCL method with the updated representations of the state-of-the-art performance. Table 10 lists the performance comparisons on UCMD dataset, Table 11 lists the performance comparisons on PatternNet dataset and the performance comparisons on NWPU-RESISC45 dataset are shown in Table 12. We divide the network into two categories: (1) the use of the network framework for VGG16, and (2) the use of the ResNet50. We can observe that our proposed DCL is superior to all previous methods. When using the VGG16 network framework, compared with the MiLaN [34], DCL provides a significant improvement of +6.94% in mAP on the UCMD dataset. The evaluation standard used by MiLaN [34] here is the result of mAP@20, hash bits k = 32; however, we use all the search results to calculate the map value as the evaluation criteria, which shows that the performance of our method is far superior to the performance of MiLaN. Furthermore, the DCL signatures achieves a gain of +6.21% in P@5, +8.51% in P@10, +19.29% in P@50, +28.82% in P@100 and +1.11% in P@1000 on the PatternNet dataset, which surpassed the recently published VGGS Fc1 [9]. The best performance in Reference [47] is FC7(VGG16) [47], which achieves the mAP value of 96.48%. However, our method can achieve better performance, where the mAP value is 98.05%. Compared with RSIR-DBOW [31], our method achieves a higher mAP, representing a 16.45% improvement on the NWPU-RESISC45 dataset. When using the ResNet50 network framework, on the UCMD dataset, our experimental results have increased this indicators by more than 3% in mAP, compared to the reference Pool5 (ResNet50) [47], which achieves a mAP value of 98.76%. At the same time, our method achieves the value of 100% in P@5, 100% in P@10, 99.33% in P@50, 49.82% in P@100 and 3.98% in P@1000, which surpassed the recently published ResNet50 [27] [34] 90.40 -----FC6 (VGG16) [47] 91.65 -----FC7 (VGG16) [47] 92.00 -----Pool5 (ResNet50) [47] 95.  [46] 71.79 -97.14 ---GCN [46] 73.11 -95.53 ---FC6 (VGG16) [47] 96.21 -----FC7 (VGG16) [47] 96.48 -----Pool5 (ResNet50) [47] 98. To summarize, using the three remote sensing datasets, namely the UCMD dataset, PatternNet dataset and NWPU-RESISC45 dataset, our method achieves a new state-of-the-art or comparable performance.

Visualization Result
In order to visualize the search results, as shown in Figure 12, we display quantitative results based on several query sample. In Figure 12, the top panel shows the result using the UCMD dataset, and the query images are from medium residential, beach, golf course and dense residential; the middle panel shows the result of the query using the PatternNet dataset, and the query images are from baseball field, bridge, airplane and basketball court; and the bottom panel shows the result of the query using the NWPU-RESISC45 dataset, and the query images are from cloud, island, airport and thermal power station. sample mining is also used to make better use of all informational data points for positive sample weighting and negative sample ranking weighting.
In addition, we have presented an RSIR network, which achieves state-of-the-art results with regards to retrieval precision. To our best knowledge, this is the first RSIR network to deploy features extracted in an end-to-end fashion. We have shown that distribution consistency loss, together with the fine-tuning network, yields significantly better performance than existing proposals. We also evaluated different pooling methods for feature extraction, and conclude that the sum-pooling method is the best for RSIR. In addition, we studied the multi-scale processing of the input image. From the research we conclude that multi-scale processing can significantly improve the image retrieval accuracy.
In the future, we plan to study query expansion and whitening methods for remote sensing images, because we have found that either reducing or failing to improve feature architectures may yield better search results.