Global Optimal Structured Embedding Learning for Remote Sensing Image Retrieval

A rich line of works focus on designing elegant loss functions under the deep metric learning (DML) paradigm to learn a discriminative embedding space for remote sensing image retrieval (RSIR). Essentially, such embedding space could efficiently distinguish deep feature descriptors. So far, most existing losses used in RSIR are based on triplets, which have disadvantages of local optimization, slow convergence and insufficient use of similarity structure in a mini-batch. In this paper, we present a novel DML method named as global optimal structured loss to deal with the limitation of triplet loss. To be specific, we use a softmax function rather than a hinge function in our novel loss to realize global optimization. In addition, we present a novel optimal structured loss, which globally learn an efficient deep embedding space with mined informative sample pairs to force the positive pairs within a limitation and push the negative ones far away from a given boundary. We have conducted extensive experiments on four public remote sensing datasets and the results show that the proposed global optimal structured loss with pairs mining scheme achieves the state-of-the-art performance compared with the baselines.


Introduction
The deep development of remote sensing technology in recent years has induced urgent demands for processing, analyzing and understanding the high-resolution remote sensing images. The most fundamental and key task for remote sensing image analysis (RSIA) is to recognize, detect, classify and retrieve the images belonging to multiple remote sensing categories like agricultural, airplane, forest and so on [1][2][3][4][5]. Among all these tasks, remote sensing image retrieval (RSIR) [2,[6][7][8] is the most challengeable in analyzing remote sensing data effectively. The main target of RSIR is to retrieve image through a given remote sensing dataset for a query and return the images with the similar visual information. RSIR has become more and more attractive due to the explosive increase in the volume of high-quality remote sensing images in the last decades [2,5,8].
Compared with content-based image retrieval (CBIR), RSIR is more challenging as there are vast geographic areas containing far-ranging semantic instances with subtle difference which is difficult to distinguish. Moreover, the images which belong to the same visual category might vary in positions, Figure 1. The optimization process under the proposed global optimal structured loss. The circles with different colors denote the samples with different label. The left part is the original distribution of sample pairs. The blue circle with small white circle in the center is the anchor, the green circle with small black circle in the center is the hardest negative sample to the anchor and the similarity of them is [ ] , the blue circle with small purple circle in the center is the hardest positive samples to the anchor and the similarity of them is [ ] . We use pairs mining strategy to sample more informative pairs for optimization. The black solid line is the negative border for negative pairs mining and the black dot line is the positive border for positive pairs mining. The cycles with arrow denote the mined informative samples and the arrows are the gradient direction. The right part is distribution optimization. The blue solid line is positive boundary used to limit positive pairs within a hypersphere. The blue dot line is negative boundary used to pull negative pairs far away from anchor.
As illustrated above, in our paper, we make the following contributions to improve the performance of RSIR task: (1) We propose to use a softmax function in our novel loss to solve the key challenge of local optimum in most methods. This is efficient to realize global optimization which could be significant to enhance the performance of RSIR. (2) We present a novel optimal structured loss to globally learn an efficient deep embedding space with mined informative sample pairs to force the positive pairs within a limitation and push the negative ones far away from a given boundary. During training stage, we take the information of all these selected sample pairs and the difference between positive and negative pairs into consideration; make the intraclass samples more compact and the interclass ones more separated while preserving the similarity structure of samples. (3) To further reveal the effectiveness of the RSIR task under DML paradigm, we perform the task of RSIR with various commonly used metric loss functions on the public remote sensing datasets. These loss functions aim at fine-tuning the pre-trained network to be more adaptive for a certain task. The results show that the proposed method achieves outstanding performance which would be reported in experiments section. (4) To verify the superiority of our proposed optimal structured loss, we conduct the experiment on multiple remote sensing datasets. The retrieval performance is boosted with approximately 5% on these public remote sensing datasets compared with the existing methods [28,[49][50][51] and this demonstrates that our proposed method achieves the state-of-the-art results in the task of RSIR.
We would like to present the organization of our paper as follows: We describe the related work from the aspects of metric learning and methods used in RSIR in Section 2. We give a detailed interpretation of our proposed method and the framework of the RSIR with our method in Section 3. In Section 4, we give some details of our experiments and present their results and analysis. Lastly, we present the conclusions of our paper. The optimization process under the proposed global optimal structured loss. The circles with different colors denote the samples with different label. The left part is the original distribution of sample pairs. The blue circle with small white circle in the center is the anchor, the green circle with small black circle in the center is the hardest negative sample to the anchor and the similarity of them is S [−1] , the blue circle with small purple circle in the center is the hardest positive samples to the anchor and the similarity of them is S [0] . We use pairs mining strategy to sample more informative pairs for optimization. The black solid line is the negative border for negative pairs mining and the black dot line is the positive border for positive pairs mining. The cycles with arrow denote the mined informative samples and the arrows are the gradient direction. The right part is distribution optimization. The blue solid line is positive boundary used to limit positive pairs within a hypersphere. The blue dot line is negative boundary used to pull negative pairs far away from anchor.
As illustrated above, in our paper, we make the following contributions to improve the performance of RSIR task: (1) We propose to use a softmax function in our novel loss to solve the key challenge of local optimum in most methods. This is efficient to realize global optimization which could be significant to enhance the performance of RSIR. (2) We present a novel optimal structured loss to globally learn an efficient deep embedding space with mined informative sample pairs to force the positive pairs within a limitation and push the negative ones far away from a given boundary. During training stage, we take the information of all these selected sample pairs and the difference between positive and negative pairs into consideration; make the intraclass samples more compact and the interclass ones more separated while preserving the similarity structure of samples. (3) To further reveal the effectiveness of the RSIR task under DML paradigm, we perform the task of RSIR with various commonly used metric loss functions on the public remote sensing datasets. These loss functions aim at fine-tuning the pre-trained network to be more adaptive for a certain task. The results show that the proposed method achieves outstanding performance which would be reported in experiments section. (4) To verify the superiority of our proposed optimal structured loss, we conduct the experiment on multiple remote sensing datasets. The retrieval performance is boosted with approximately 5% on these public remote sensing datasets compared with the existing methods [28,[49][50][51] and this demonstrates that our proposed method achieves the state-of-the-art results in the task of RSIR.
We would like to present the organization of our paper as follows: We describe the related work from the aspects of metric learning and methods used in RSIR in Section 2. We give a detailed interpretation of our proposed method and the framework of the RSIR with our method in Section 3. In Section 4, we give some details of our experiments and present their results and analysis. Lastly, we present the conclusions of our paper.

Related Work
In this section, we make a summary of various works related to DML and the task of RSIR. Firstly, we introduce some work about clustering-based losses, pair-based structured losses and informative pairs mining strategies. Then, we provide an overview on the development of RSIR which is based on handcraft and deep CNN features.

Deep Metric Learning
DML has been a long-standing research hotspot in improving the performance of image retrieval [42][43][44][45][46]52]. There are two different research direction of DML which are clustering-based and pair-based structured losses. We would like to give some detail introduction as follows.

Clustering-Based Structured Loss
The clustering-based structured losses aim to learn a discriminative embedding space by optimizing clustering metric and are applied in abundant fields of computer vision like face recognition [53,54] and fine-grained image retrieval (FGIR) [55,56]. Clustering loss [57] utilizes the structured prediction framework to realize clustering with higher score for ground truth than others. The quality of clustering would be measured by normalized mutual information (NMI) [58]. Center loss [54] suggested to learn a center for each category by compensating for softmax loss and obtain an appreciable performance in face recognition. The triple-center loss (TCL) [59] was proposed to learn a center for each category and separate the cluster centers and their relevant samples from different categories. To enhance the performance of FGIR, centralized ranking loss (CRL) [55] was proposed aiming to optimize centers and enlarge the compactness and separability of intraclass and interclass samples. Later, decorrelated global-aware centralized loss (DGCRL) [56] was proposed to optimize the center space by utilizing Gram-Schmidt independent operation and enhance the clustering result by combining softmax loss. However, all these clustering-based structured losses consume costly in computing and are hard to optimize. Moreover, these losses fail to make full use of the sample relationships which might contain meaningful information for learning a discriminative space.

Pair-Based Structured Loss
As a mass of structured losses [41][42][43][44][45][46][47] have obtained appreciable effectiveness in training networks to learn discriminative embedding features, we would like to make a brief review on the development of pair-based structured loss.
Contrastive loss [41] builds positive and negative sample pairs according to their labels as (x a , x k ), y ak and exploits these constructed pairs to learn a discriminative embedding space by minimizing the distance of positive sample pairs and increasing the distance of negative sample pairs larger than a given threshold m. And the loss function is defined as follows: where Q is the volume of samples in training set, y ak = 1 when a sample pair (x a , x k ) with the same label, and y ak = 0 when a sample pair (x a , x k ) with different label. The parameter m is a margin used to limit the distance of negative sample pairs, D ak indicates the Euclidean distance of a sample pair (x a , x k ) and is defined formularly as D ak = f (x a ) − f (x k ) 2 , and f (·) means the deep feature extracted from the network.
[·] + is hinge loss which is to limit the values to be positive.
From Equation (1), we could find that this loss function treats positive and negative pairs equally and fails to take into account the difference between positive and negative sample pairs. As it constructs all samples into pairs locally in training set, it might get fall into local optimum and result in slow convergence.
Triplet loss [42] utilizes abundant triplets to learn a discriminative embedding space to force positive sample pairs closer than negative ones with a given margin m. Each triplet is made up of an anchor sample, a positive sample with the same label to the anchor and a negative sample with different labels to the anchor. To be specific, we denote a triplet as x a , x p , x n , x a , x p and x n indicate the anchor, positive and negative sample separately. The loss is defined as: where T means the collection of triplets, x a , x p and x n are the index of anchor, positive and negative samples severally and |T| is the volume of triplets set. D ap = f (x a ) − f x p 2 and D an = f (x a ) − f (x n ) 2 denote the Euclidean distance of positive and negative pairs respectively. And f (·) means the deep feature extracted from the network.
[·] + is hinge loss which is to limit the values to be positive.
We could learn from Equation (2) that triplet loss does not consider the difference between positive and negative sample pairs which is important for identifying the pairs with more information. Although it takes the relationship between positive and negative pairs into consideration, the rate of convergence is still slow and might struck in local optimal as this loss encode the samples in a training set to triplets set which fails to make full use of sample pairs inside the training set globally.
N-pairs loss [43] takes advantage of the structured information between positive and multiple negative sample pairs in the training mini-batch to learn an effective embedding space. This loss function enhances the triplet loss by training the network with more negative sample pairs and the negative pairs are selected from all negative pairs of other categories. i.e., selecting one sample pair randomly per category. The N-pairs loss is defined as: where Q is the number of categories in a training set, and x a , x p N a=1 denote N sample pairs which are selected from N different categories, i.e., x a and x p are anchor and its positive sample for a certain category respectively; x n y n y a denotes negative samples for the current anchor; y n and y a denote the labels of x n and x a . S ap = f (x a ), f x p and S an = f (x a ), f (x n ) are dot product of positive and negative pairs respectively. The f (·) is the feature representation of an instance.
However, this loss fails to take the difference between negative and positive pairs and neglects some structured information inside the training set. Furthermore, it only selects one positive pair randomly for per class which could lose some significant information during training.
Lifted structured loss [44] was proposed to meet the challenge of local encoding by make full use of information among all the samples in a training batch. It aims to learn an effective embedding space by considering all negative sample pairs of an anchor and encourage the distance of positive pair as small as possible and force the distances of all negative pairs larger than a threshold m. Lifted structured loss is defined as: Sensors 2020, 20, 291 6 of 28 where x a and x p are anchor and positive samples respectively and x n and x k are both negative samples, P and N indicate the sets of positive and negative pairs respectively and the |P| is amount of P. D ap is the Euclidean distance of positive pair. D an and D pk are Euclidean distances of negative pairs. We could learn from Equation (4) that the lifted structured loss makes full use of the relationship between positive and negative sample pairs by constructing the hardest triplet with taking all negative pairs into consideration. However, it fails to keep the structured distribution inside the training set and still fails to realize global optimization as it is a form of hinge loss.
Ranked list loss [46] was proposed to restrict all positive samples into a given hypersphere with diameter as α − m and impel distance of negative sample pairs larger than a fixed threshold α. To be specific, this loss aims at learning a more discriminative embedding space where could separate positive and negative sample set by a margin m and it utilizes a weighting strategy to consider the difference of negative sample pairs: where x a , x p and x n denote anchor, positive and negative samples respectively and Q is the volume of a training set. P a and N a are the sets of positive and negative pairs for an anchor x a . D ap and D an are Euclidean distances of positive and negative pairs respectively which have been described above. β is a parameter which is used to reflect the degree of negative samples during weighting.
We could know that the ranked list loss has obtained an appreciable performance in multiple image retrieval tasks. However, it does not take the relationship between positive and negative sample pairs which is important to enhance the robustness and distinctiveness of network. Moreover, as it utilizes hinge function to optimize this loss which might be easy to lead to local optimum, the performance still couldn't meet our demands in RSIR.
To solve the limitations of existing DML methods, we propose to exploit the softmax function instead of the commonly used hinge function in our loss function to realize global optimization. Furthermore, we make full use of the structured information and maintain the inner similarities structure by setting positive and negative boundary for sample pairs during training stage.

Informative Pairs Mining
During the training stage, there are vast numbers of less informative sample pairs which might slow down convergence and result in a local optimum. It is significant to design a superior pairs mining scheme for training efficiency. There are many excellent studies on informative pairs mining scheme design [43][44][45][46]53,60]. A semi-hard mining strategy was proposed to sample a handful of triplets which contain a negative pair farther than positive one in FaceNet [53]. A more effective pairs mining framework was proposed to select hard samples from the database for training [60]. Sohn et al. proposed hard negative categories mining to collect more informative samples for training the network globally [43]. Song et al. proposed to select harder negative samples to optimize lifted structured loss [44]. Wang.et al. provided a simple pairs mining strategy which select the sample pairs in violation of distance restriction [46]. Wang. et al. designed a more effective pairs mining scheme to obtain more excellent performance which take the relationship between positive and negative sample pairs into consideration [45]. In this paper, we propose to utilize the pairs mining scheme proposed in [45] to realize more informative sample pairs mining and improve the performance of RSIR.

The Development of RSIR Task
In the last few decades, the task of RSIR has been received extensive attention from researchers and the wide studies have spawned a whole bunch of elegant methods. We would like to give some introduction on the methods for RSIR in terms of traditional handcrafted representation and deep representation methods. Moreover, we introduce some works related to the RSIR under DML.
In the initial time, researchers tended to extract textural features for remote sensing image classification [11,61]. Datcu et al. presented a special pipeline for the task of RSIR and proposed to utilize the model of Bayesian inference to capture spatial information for features extraction [62]. And at the same time, Schroder et al. proposed to exploit Gibbs-Markov random fields (GMRF) which could be used to capture spatial information to extract features [63]. Daschiel et al. suggested to utilize hierarchical Bayesian model to extract feature descriptors and these features are clustered by the dyadic k-means methods [64]. With the development of general image retrieval, Shyu et al. proposed a comprehensive framework defined as geospatial information retrieval and indexing system (GeoIRIS) for RSIR based on CBIR [65]. This system could be used to automatically extract features, mine visual content for remote sensing images and realize fast retrieval by indexing from database. The features are mainly based on patch which could be helpful to maintain some local information. And to enhance the retrieval precision, they extract various visual features including general features like spectral and texture features and anthropogenic features like linear and object features. However, these methods based on global visual features mentioned above are hard to maintain invariance to translation, occlusion and translation. With the introduction of SIFT descriptors [15], Yang et al. proposed to utilize BoW to encode SIFT features extracted from remote sensing images and the experiments have demonstrated that the method based on local features could be superior than global visual features [66]. Later, more works tend to use local features to realize efficient retrieval [16,67]. More recently, there are some studies that tend to utilize features extracted from remote sensing images to retrieve local climate zones [68,69]. However, these handcrafted features fail to extract richer information from remote sensing images as their limited descriptive ability.
With the successful application of deep learning in general image retrieval task, deep features extracted from CNN are gradually exploited to achieve more appreciable performance in RSIR [10,70,71]. Bai et al. proposed to map deep features into a BoW space [70]. Li et al. proposed to combine handcrafted features with deep features to produce more effective features for RSIR [71]. Ge et al. tended to combine and compress deep features extracted from pre-trained CNNs to enhance the descriptive power of features [10]. All these methods mentioned above have made great contributions on improving the performance of RSIR. However, these methods are mainly based on pre-trained networkd which might not be suitable for the task of RSIR. To further improve the performance, recent works tend to concentrate on fine-tuning the pre-trained network for RSIR [32,49,50,72,73]. Li et al. proposed to fine-tune a pre-trained CNN to learn more effective feature descriptors and the network is trained on remote sensing datasets [73]. Li et al. made a try on combining deep features learning network and deep hashing network together to develop a novel deep hash neural network which is trained in an end-to-end manner for RSIR [72]. Tang et al. proposed to utilize deep BOW (DBOW) to learn deep features based on multiple patches in an unsupervised way [50]. Wei et al. presented a multi-task learning network which is connected with a novel attention model and proposed to utilize center loss for network training [32]. Raffaele et al. proposed to conduct the aggregation operation of VLAD on the local deep features extracted from fine-tuned CNNs with two different attention mechanisms to eliminate the influence of irrelated background [49].
More and more elegant works prefer to apply DML in the field of remote sensing images to enhance the effectiveness of RSIR [30,[33][34][35][36][37]. Roy et al. proposed a metric and hash-code learning network (MHCLN) which could be used to learn semantic embedding space and produce hash codes at the same time [33]. It aims to realize accurate and fast retrieval in the task of RSIR. Cao et al. presented a novel triplet deep metric learning network for RSIR, the remote sensing images are embedded into the learned embedding space where the positive sample pairs closer and negative ones far away from each other [34]. Subhanker et al. presented a novel hashing framework which is based on metric learning [35]. Most existing DML methods for RSIR are mainly based on triplet loss which is limited with the local optimization and inadequate use of sample pairs. In this paper, we investigate the effectiveness of RSIR when applying more superior DML methods. Furthermore, we propose a more efficient loss function to learn a discriminative embedding space for remote sensing images to achieve elegant performance for the task of RSIR.

The Proposed Approach
In this section, we give some detailed descriptions about our proposed method which includes five parts. Firstly, we give the problem definition on the task of RSIR. In Sections 3.2-3.4, we describe our proposed loss function and the optimization process in detail.

Problem Definition
We denote the input images as x = x 1 , . . . , x a , . . . , x Q for a training set. There are C classes in a training set and we denote the labels for n input images as y = y 1 , . . . , y a , . . . , y n where y a ∈ {1, . . . , c, . . . , C}, particularly. There is only one label y a for an input image x a . The input images x are projected onto a d-dimension embedding space by utilizing a deep neural network with batch normalization which could be indicated as f (x, θ). To be specific, f is the deep mapping function of the network and θ is a set of parameters need to be optimized of the mapping function f . In this paper, we use inner product S ak to measure the similarity of any two images (x a , x k ) during the training and testing phases and we denote the similarity metric as S ak = f (x a ; θ), f (x k ; θ) . As we exploit all samples in a training batch as anchor and compute the similarity of all samples with an anchor, we could denote the similarities of a training batch as an n × n matrix S and use S ak to represent the element at (a, k).

Global Lifted Structured Loss
As described in Section 2.1.2, the lifted structured loss utilizes a set of triplets for training, which is dynamically constructed by considering all sample pairs except the positive pair as negatives. It takes all negative pairs but only one positive pair into consideration for each triplet. To meet this limitation, a more generative loss function is proposed to learn a more discriminative embedding space by considering all positive pairs in a training batch in person re-ID [74]. The loss is defined as: There are two parts in this loss function. The distance between positive and negative pairs is denoted as D ak = f (x a , θ) − f (x k , θ) 2 and m is a margin. In our paper, we utilize inner product to measure similarity. It's noted that the Euclidean distance could be converted to inner product as follows: where A is a constant. We could learn from Equation (7) that the Euclidean distance and inner product is inversely proportional to each other. In our paper, we exploit inner product to measure similarities. We recompute the generative lifted structured loss to inner product and we denote the formula as: where µ is a given margin. However, the generative lifted structured loss still fails to solve the limitation of encoding pairs locally which might result in local optimum. To breakthrough this limitation, we use the softmax loss to realize globally optimizing. As the softmax loss is used to deal with the task of Sensors 2020, 20, 291 9 of 28 classification, we here take our task as a classification of positive and negative similarity. The formula is defined as: As our target is to increase the similarities of positive pairs (i.e., draw the distance close for positive pairs) and reduce the similarities of negative pairs (i.e., make the distance further for negative pairs), we could take the limit for the similarities for positive and negative pairs. Specifically, we assume the positive and negative similarities (measured by inner product) are infinitely close to +1 and −1 respectively (i.e., positive and negative distances (measured by Euclidean distance) are 0 and +∞ respectively) which means that the numerator in Equation (9) is a constant. And we give definition of the probabilities for positive and negative similarities to an anchor as R y k =y a = A 1 / y k =y a e −S ak and R y k y a = A 2 / y k y a e µ+S ak . A 1 and A 2 are both constant. We combine the softmax loss with the generative lifted structured loss as: This global lifted structured loss could be likely to learn a discriminative embedding space globally. However, it still fails to eliminate the impact of less informative sample pairs and keep the sample pairs distribution inside the training batch. To achieve better performance in RSIR, we propose to use an efficient pairs mining strategy to select sample pairs with richer information and propose a global optimal structured loss which could increase the intraclass compactness and maintain the distribution of the selected sample pairs at the same time for network model training. We would like to give the detailed description about our mining scheme and global optimal structured loss.

Global Optimal Structured Loss
For the task of RSIR, our target is to increase intraclass compactness and interclass sparsity. However, the proposed global lifted structured loss described in Section 3.2 fails to keep the distribution of sample pairs inside the selected sample pairs set. In our paper, we propose a novel global optimized structured loss which is used to learn an efficient and discriminative embedding space. It aims to limit sample pairs with the same class label (positive sample pairs) within a hypersphere with diameter of (α − m). The fixed boundary could be important to maintain similarity distribution of the selected positive pairs for each category. And simultaneously all negative sample pairs could be pushed away from a fixed boundary α, the positive and negative sample pairs could be separated by a margin m.
We intend to use the pairs mining strategy described in [45], which exploits the hardest negative pair (with the largest similarity among all negative pairs) to mine informative positive pairs and similarly sample negative pairs with richer information by considering the hardest positive pair (with the smallest similarity among all positive pairs). In other word, for an anchor x a , we sample the informative positive and negative pairs according to the following two formulas. The informative positive and negative pairs sets are denoted as P a and N a respectively. The formulas are defined as: where = 0.1. From Equation (11), we could know that we select the positive pair x a , x p as an element of P a by comparing its similarity with the hardest positive similarity. And we could learn from Equation (12) that the negative pair (x a , x n ) is selected as an element of N a by comparing its similarity with the hardest positive similarity. And is a hyper-parameter used to control the scope of informative sample pairs.
To realize the target of pulling the mined positive pairs as close as possible and keeping the similarity distribution of each class sample pairs (positive pairs) simultaneously, we increase their similarities and force them to be larger than the positive boundary (α − m) by minimizing the positive part of our proposed loss function. It is defined as: Similarly, to achieve the goal of pushing the mined negative sample pairs far away from positive ones and realize the separation of positive and negative sample pairs, we propose to decrease the negative similarities and impel them to be smaller than the negative boundary α by minimizing the negative part of our proposed loss function. We define this as: For our proposed global optimal structured loss, we integrate the two part of minimization objectives and optimize them jointly. And as there is difference between positive and negative sample pairs, we utilize two different hyper-parameters β 1 and β 2 . Our proposed loss is represented as: where β 1 = 2, β 2 = 50. This global optimal lifted structured loss could be likely to pay more attention on the positive and negative pairs with more information, which would be helpful to further improve the performance and effectiveness of RSIR task.
To make full use of sample pairs among the mini-batch, we treat all images in a mini-batch as an anchor and the rest of images except the current anchor as gallery iteratively. And we would like to define the loss function for a mini-batch as follows: After the loss function has been defined, the network parameters could be learned by Back-Propagation. We minimize the L GOS with gradient descent optimization by conducting online iterative pairs mining and loss calculation in the form of matrix. We could compute the loss of deep features in training set f (x, θ) by utilizing Equation (16). And its gradient of with respect to f (x, θ) could be denoted as: In Equation (17), we could regard w + aj and w − aj as the weight for positive and negative similarity respectively. The network parameter update is determined by both positive and negative similarity, and the loss of positive (negative) similarity is used reflect intraclass compactness (interclass sparsity). We give the optimization process in Algorithm 1. For a = 1, . . . , Q do 9: Construct informative positive pairs set P a for anchor x a as Equation (11)  10: Construct informative negative pairs set N a for anchor x a as Equation (12)  11: Calculate L P as Equation (13) for the sampled positive pairs 12: Calculate L N as Equation (14) for the sampled negative pairs 13: Calculate L GOS (x a ) as Equation (15) for an anchor x a 14: end for 15: calculate L GOS (x) as Equation (16) for a mini-batch. 16: Backpropagation gradient and network parameters f (x, θ) update:

RSIR Framework Based on Global Optimal Structured Loss
In this section, we illustrate the RSIR framework based on our proposed global optimal structured loss which contains the stages of training and testing. We present this framework in Figure 2.

RSIR Framework Based on Global Optimal Structured Loss
In this section, we illustrate the RSIR framework based on our proposed global optimal structured loss which contains the stages of training and testing. We present this framework in Figure 2.  Figure 2. The RSIR framework based on the global optimal structured loss. The upper part denotes training stage and we fine-tune the pre-trained network with our global optimal structured loss. We utilize the fine-tuned network for more discriminative feature representations extraction. The bottom part is testing stage. The query image and the testing set would be input in the fine-tuned network, and the top K similar images would be returned.
During the training stage, we utilize our proposed method to fine-tune the pre-trained network and we have illustrated the optimization process in detail in Section 3.4. We exploit the pre-trained network to extract deep features and generate a feature matrix for a training mini-batch. We perform similarity calculation on feature matrix by inner product operation to obtain a similarity matrix with Figure 2. The RSIR framework based on the global optimal structured loss. The upper part denotes training stage and we fine-tune the pre-trained network with our global optimal structured loss. We utilize the fine-tuned network for more discriminative feature representations extraction. The bottom part is testing stage. The query image and the testing set would be input in the fine-tuned network, and the top K similar images would be returned.
During the training stage, we utilize our proposed method to fine-tune the pre-trained network and we have illustrated the optimization process in detail in Section 3.4. We exploit the pre-trained network to extract deep features and generate a feature matrix for a training mini-batch. We perform similarity calculation on feature matrix by inner product operation to obtain a similarity matrix with size Q × Q. And then we utilize our proposed global optimal structured loss to optimize the embedding space by increasing the similarity of positive sample pairs and reducing the similarity of negative ones which are selected by using a superior pairs mining scheme. The optimal embedding space could be efficient to force positive pairs more compact within a fixed hypersphere and impel different class pairs apart away from each other with a given margin. At the stage of testing, we utilize the fine-tuned network to extract deep features which could be more discriminative. We conduct the similarity computing operation (inner product) on the feature matrix to return a similarity matrix for a test set. Lastly, the top K similar remote sensing images would be returned according the values of similarities for each query.

Experiments and Discussion
In this section, we represent some details about the implementation of our experiments and verify the effectiveness of our proposed method by conducting experiments on different remote sensing datasets.

Experimental Implementation
We perform the experiments on Ubuntu 16.04 with a single RTX 1080 Ti GPU and 64 GB RAM. We implement our method by using Pytorch. The Inception network with batch normalization [75] which is pre-trained on ILSVRC 2012-CLS [76] would serve as our initial network. Moreover, during training, a FC layer is added on the top of our initial network and it is behind the global pooling layer. We utilize Adam as optimizer to implement our experiments. The learning rate is set to 1e −5 during training for our all experiments; the training process would be converged at 600 epochs. We use retrieval precision [50] to report the experimental results. The retrieval precision could be defined as TP/R, where TP is the number of images belong to the same category and R is the amount of returned images (candidates) for a query q. We select all images in the test set as query images and the final results which would be denoted as AveP: where |Q| means the volume of query images in the test set, R denotes the returned images for a query q, TP is the number of true positive images for a query q. And in our paper, we only return the top 20 retrieval images (candidates) by following the setting in DBOW [50].

Datasets and Training
Datasets. We perform our experiments on four kinds of different remote sensing databases: UCMerced Land Use [16,66], Satellite Remote Sensing Image Database [77], Google Image Dataset of SIRI-WHU [17,19,78] and NWPU-RESISSC45 [1]. We would like to give an introduction to these benchmark databases as follows: UCMerced Land Use [16,66] is collected from large amount of images download from the United States Geological Survey (USGS) by the team at the University of California Merced. This dataset is commonly used in tasks of retrieval and classification in the field of RSIA. UCMerced Land Use includes 21 geographic categories and there are 100 remote sensing images per category, the size of an image is 256 × 256 pixel with 0.3 m spatial resolution. We denote this dataset as UCMD in the remaining parts of this section.
Satellite Remote Sensing Image Database [77] contains 3000 remote sensing images of 256 × 256 pixel and the spatial resolution of each pixel is 0.5 m. There are 20 geographic categories labeled manually and each category includes 150 images. We denote this dataset as SATREM for convenience in the remainder of this section.
Google Image Dataset of SIRI-WHU [17,19,78] contains 2400 remote sensing images with size of 200 × 200 pixel and the spatial resolution of each pixel is 2 m. This dataset contains 12 geographic categories and there are 200 images in a certain category. As a matter of convenience, we denote this dataset as SIRI in experiments and discussion.
NWPU-RESISSC45 [1] is collected from Google Earth and is a large-scale remote sensing dataset. There are 31,500 remote sensing images totally and the size of image is 256 × 256 pixel. The spatial resolution of them varies from 30 to 0.2 m. This dataset contains 45 geographic categories and each category owns 700 remote sensing images. In order to facilitate the discussion in the remaining parts of this section, we indicate this dataset as NWPU.
Training setting. By following the data split protocol used in DBOW [50], we divide the training and testing set on a scale of 4:1 for each dataset. We crop the size of all input images to 224 × 224. In order to avoid overfitting during training, the data augmentation operation of random crop with random horizontal mirroring is applied in our experiments. As for testing stage, we utilize single center crop to realize data augmentation. During training, we set the size of every mini-batch as B.
A mini-batch consists of a certain amount of random geographic categories, and we sample M random images from each geographic category for training. We set M = 5 in all experiments by following the work of Wang et al. [45]. According to the analysis described in the section of ablation study, we set the hyper-parameters mentioned in Section 3 as β 1 = 2, β 2 = 50, = 0.1, α = 0.8, m = 0.5 in following experiments.

Comparision with the Baselines
Baselines. Tang and Raffaele successively performed comprehension comparisons on multiple systems [49,50]. We record the method proposed by Tang et al. as DBOW [50] and the method proposed by Raffaele et al. as ADLF [49] for convenience. Besides the DBOW and ADLF, we also select other three excellent works provided in DBOW and ADLF as baselines for comparison. The baselines could be introduced in detail in Table 1. For DN7 [28] and DN8 [28], the results are obtained by using the DN features extracted from the 7th and 8th fully connected layers in DBOW. For ResNet50, the result is obtained by using the VLAD encodings following ResNet 50 [51]. We would directly utilize the obtained results in their works as reference for comparisons. To verify the superiority of our proposed global optimal structured loss, we conduct a set of experiments on four different remote sensing datasets. We compare our proposed method with the baselines in the task of RSIR. Convolutional + VLAD 1500 DBOW [50] Convolutional + BoW 16,384 ADLF [49] Convolutional + VLAD 16,384 As mentioned in Section 3, we fine-tune the network with our proposed global optimal structured loss. We utilize the features extracted from the fine-tuned network for four different remote sensing datasets to realize the task of RSIR and perform a comparison with the baselines mentioned above. We set the embedding size to 512 and batch size to 40 in our experiments. Herein, we denote our proposed global optimal structured loss with pairs mining strategy as GOSLm. We present the results in Table 2. We could conclude from Table 2 that our global optimal structured loss with pairs mining strategy obtains the state-of-the-art results on the datasets of SIRI and NMPU. The AveP (%) outperforms the DBOW by 4% (from 92.6% to 96.6%) on SIRI and obtains the improvement of 4.6% (from 85.7% to 90.3%) on NMPU over ADLF. As for the datasets of UCMD and SATREM, we achieved the second-best performance with the AveP (%) is 85.8% and 91.1% respectively. While the best results on UCMD is obtained by ADLF which is with the post-processing of query expansion (QE), but on the remaining three datasets, our method would achieve stronger performance than ADLF. DBOW obtains the best performance on SATREM. However, our proposed method would outperform the DBOW on the remaining three datasets. Furthermore, it's worth noting that we conduct our experiments with raw feature representations without any post-processing operations like whitening, re-ranking and QE. We could learn that our proposed method shows great effectiveness in the field of RSIR and could obtain the state-of-the-art results on commonly used remote sensing datasets. To further investigate the effectiveness of our proposed method, we would like to show the precisions of the different geographic categories in the four remote sensing datasets in Tables 3-6 and the best results would be highlighted in bold. We utilize the top 20 retrieval images to compute the precision results for per geographic category. We could learn from Table 3 that our method achieves a marked improvement in nearly half of categories. Specifically, our proposed method makes the most prominent promotion on "Golf" and "Sparse" with the increase of 7% (from 85% to 92%) and 12% (from 79% to 91%). Moreover, we also make some small promotion on some categories. Specifically, the proposed method increases the precision by 1% (from 94% to 95%) over DN7 on "Agriculture", 3% (from 87% to 90%) over DBOW on "Baseball", 2% (from 93% to 95%) over DBOW on "Storage" and 1% (from 94% to 95%) over DBOW and ADLF on "Tennis". However, the weaker performance is obtained on other categories and we would like to report the results as follows. The precisions are 82%, 92%, 78%, 95%, 95%, 83%, 95%, 80%, 78% and 91% on the categories of "Airplane", "Beach", "Buildings", "Chaparral", "Forest", "Freeway", "Harbor", "Intersection", "Overpass", "Runway" respectively which are about on average level. We also come in second place on "Mobile", "Parking" and "River" with the precisions are 80%, 95% and 86% respectively. And our proposed method obtains the worst results on "Dense" and "Medium-density" with the precision of 55% and 59% respectively. We make a further research on the retrieval results and it turns out that our method is confused by the images belong to "Dense" with "Medium-density", "Mobile" and "Buildings". The averages of all precisions on UCMD with our proposed method comes in the second place and the result is 85.8%. From Table 4, we could know that our method outperforms the state-of-the-art methods on half of the categories in SATREM. Especially, our proposed method could make a great enhancement on the categories of "Airplane", "Beach", "Chaparral" and "Ocean". The precisions on these categories are 100%, 98%, 100% and 100% respectively, which are increased nearly by 4% comparied with the existing best results. We also obtain fine improvements on some categories. Specifically, the precisions are increased by 1% (from 97% to 98%) on "Artificial" and 2% (from 96% to 98%) on "Forest". Moreover, we obtain the same best results compared with the existing best methods on the categories of "Cloud", "Harbor" and "Runway" with the precisions of 100%, 98% and 97% respectively. However, our method obtains weaker results on some other categories. We achieve the second-best results on "Agriculture", "Buildings", "Road" and "Storage", the precisions on these categories are reported as 92%, 94%, 90% and 99% respectively. And the results on the categories of "Container", "Dense", "Factory", "Parking" and "Sparse" are mundane and they are mainly on the average level, the precisions on these categories are reported as 92%, 92%, 72%, 88% and 78%. The worst result is obtained on the category of "Medium-density" with the precision of 53%. The further analysis of retrieval results has shown that abundant incorrect images belong to "Building", "Dense Residential" and "Factory" retrieved for "Medium-density" images. For the average of the precision of all categories in SATREM, we could achieve a competitive result compared with the state-of-the-art results. Our proposed method obtains the second-best result with 91.1%. The results in Table 5 show that our proposed method achieves the state-of-the-art performance in almost all categories. To be specific, we achieve significant improvements compared with the existing best results on the categories of "Harbor", "Overpass" and "Park" with the improvement of 9% (from 89% to 98%), 6% (from 94% to 100%) and 10% (from 90% to 100%) respectively. We increase the precision slimly by 1% (from 99% to 100%) over DBOW on "Commercial", 2% (from 97% to 99%) over DBOW on "Idle", 2% (from 96% to 98%) over ADLF on "Industrial", 2% (from 93% to 95%) over DBOW on "Meadow", 1% (from 97% to 98%) over DBOW on "Residential" and 1% (from 99% to 100%) over ADLF on "Residential". However, we obtain weaker results on the categories of "Pond" and "River" and the precisions are reported as 96% and 77% which are on the average level. The final AveP of all images in SIRI is increased by approximately 4% (from 92.6% to 96.6%). The improvement achieved on dataset of SIRI demonstrates that our method could be more effective and superior than the state-of-the-art methods in processing the task of RSIR.

Comparison with Multiple DML Methods in the Field of RSIR
As described in Section 2.1.2, there are many proposed elegant DML methods and these methods have achieved appreciable performance in the tasks of general and fine-grained image retrieval. To verify the generalization ability of DML in the task of RSIR, we perform a set of experiments on four datasets with common DML methods of N-pairs loss [43], global lifted structured loss [74], our proposed global optimal structured loss and the latter two methods with pairs mining scheme. For convenience, we denote the global lifted structured loss, N-pairs loss and our global optimal structured loss as GLSL, N-pairs and GOSL respectively. Moreover, we use the subscript m to indicate whether employing our mining scheme. For all these DML methods, we set the embedding size to 512 and batch size at B = 40 in our experiments unless otherwise stated. For GLSL, we follow the experimental implementation and training set of our proposed global optimal structured loss with pairs mining scheme and the hyper parameter is set as µ = 0.5. And the GLSL m would follow the same setting of GLSL and the hyper parameter of mining scheme is set as = 0.1. As for N-pairs, we follow the experimental implementation and training set of our proposed global optimal structured loss with pairs mining scheme but the batch size and the number of images sampled from each category would be set as B = 20 and M = 2. We would like to represent the results of AveP (%) in Table 7. We could learn from Table 7 that the task of RSIR could achieve appreciable performance on the public remote sensing datasets with common DML methods. Firstly, we analyze the performance of the methods on UCMD dataset as follows. Our GOSL m achieved the best performance with AveP = 85.5% and it outperforms GOSL, GLSL m , GLSL and N-pairs by 0.7%, 1.5%, 3.2% and 3.6% respectively. Moreover, we could conclude that the GLSL and our GOSL with pairs mining scheme could increase the AveP by 0.7% and 1.7% respectively over the counterparts without pairs mining scheme. Secondly, we make a conclusion on the SATREM dataset according to the results reported in Table 7 as follows. We achieve the best performance (AveP = 91.1%) with our GOSL m and it outperforms GLSL m and N-pairs with 3.9% and 5.8% respectively. We could also learn that with pairs mining scheme, the performance of GLSL and GOSL would be promoted by a wide margin. To be specific, GOSL m improves the AveP from 86.8% to 91.1% over GOSL and GLSL m improve the AveP from 85.1% to 87.2% over GLSL. Thirdly, we analyze the results on SIRI with different DML methods. With the pairs mining scheme, our GOSL m could obtain the best performance with AveP = 96.6% and outperforms the GOSL with 1.3%. The pairs mining scheme also improves the performance of GLSL from 94.9% to 95.2%. Moreover, the AveP of our GOSL m is better than GLSL m and N-pairs. In the end, we analyze the results on NWPU according to the results in Table 7. We achieve the best performance with our proposed GOSL m which is higher than GLSL m and N-pairs by 1.7% and 6.0% respectively. Furthermore, the GLSL m increases the AveP by 3.1% over GLSL and the proposed GOSL m increases the AveP by 4.5% over GOSL. In brief, our proposed global optimal structured loss with pairs mining scheme could achieve the best performance on the four popular remote sensing datasets. The proposed novel loss is more effective than the common DML methods and the pairs mining scheme could be helpful to further boost the performance of DML methods.
To further study the efficiency of our proposed method, we propose to utilize Recall@K [44] (K = 1, 2,4,8,16,32) to evaluate the performance of RSIR with these common DML methods and our proposed method. Recall@K is a common metric used in retrieval task which is the average recall scores over all query images in a test set. We perform the experiments on the four remote sensing datasets with the same settings as the first part of this section. The results would be reported in Tables 8-11. From Table 8, we could learn that we achieve the best performance with our proposed GOSL m at the metric of Recall@K (K = 1, 2,4,8,16,32) and the results are reported as Recall@1 = 98.5%, Recalll@2 = 98.8%, Recall@4 = 99.0%, Recall@8 = 99.0%, Recall@16 = 99.2% and Recall@32 = 99.7% respectively. It's worth noting that the metric of Recall@1 is the most important index to analyze the effectiveness of methods. The proposed GOSL m outperforms GOSL, GLSL m , GLSL and N-pairs with 2.9%, 3.8%, 4.3% and 3.2% respectively at Recall@1. The results of GOSL m are increased by 2.9% over GOSL at Recall@1 and GLSL increases the Recall@1 by 0.5% over GOSL m . We could conclude that the global optimal structured loss with pairs mining scheme is superior than other DML methods and the pairs mining scheme is significant in improving the retrieval performance on the dataset of UCMD. We could conclude according to the results in Table 9 that our proposed GOSL m achieves the best performance at Recall@K (K = 1, 2,4,8,16,32) and the results are reported as Recall@1 = 94.8%, Recalll@2 = 97.0%, Recall@4 = 98.5%, Recall@8 = 99.3%, Recall@16 = 100% and Recall@32 = 100% respectively. We could find that the Recall@1 of GOSL m outperforms the methods of GOSL, GLSL m , GLSL and N-pairs by 1.5%, 0.3%, 2.0% and 1.2% respectively. Moreover, the performance of GOSL m is increased by 1.5% over GOSL and the GLSL m is increased by 1.7% over GLSL at Recall@1. According to the analyses, we could know that our proposed GOSL m shows great superiority and effectiveness in the task of RSIR on SATREM. We could make a conclusion as follows from Table 10. We achieve the best results with our proposed GOSL m at Recall@K (K = 1, 2,4,8,16,32) and we would show the results as Recall@1 = 97.2%, Recalll@2 = 97.5%, Recall@4 = 98.1%, Recall@8 = 98.7%, Recall@16 = 99.1% and Recall@32 = 99.5% respectively. The proposed GOSL m outperforms GOSL, GLSL m , GLSL and N-pairs by 1.2%, 1.4%, 1.8% and 2.2% respectively at Recall@1. We observe that the methods with mining scheme could be helpful in improving the RSIR performance. To be specific, the Recall@1 of GOSL m and GLSL m are improved by 1.2% and 0.4% over GOSL and GLSL. We could conclude from the analyses above that our proposed global optimal structured loss with pairs mining scheme is superior than other DML methods and the pairs mining scheme is helpful in improving the retrieval performance on SIRI. We could learn from Table 11 that the proposed GOSL m obtains the best results at Recall@K (K = 1, 2,4,8,16,32) and the results are reported as Recall@1 = 91.1%, Recalll@2 = 94.3%, Recall@4 = 96.3%, Recall@8 = 97.6%, Recall@16 = 98.3% and Recall@32 = 98.7% respectively. The proposed GOSL m outperforms the methods of GOSL, GLSL m , GLSL and N-pairs with 3.7%, 0.8%, 3.9% and 3.8% at Recall@1 respectively. We could also learn that the GLSL and our GOSL could be improved by 3.7% (from 87.4% to 91.1%) and 3.1% (from 87.2% to 90.3%) respectively at Recall@1 when utilizing the pairs mining scheme. The analyses above further demonstrate that our proposed global optimal structured loss with pairs mining scheme is more effective than other DML methods and the pairs mining scheme is significant in promoting the retrieval performance on the dataset of NWPU.
We report the errors of omission and commission with several easy and hard retrieval cases on UCMD to further validate the effectiveness of our proposed method. We show the top-10 similar images which are returned by N-pairs, GLSL m and our proposed GOSL m and represent the results in Figure 3. For each retrieval case, the top, middle and bottom rows denote the results obtained by using the methods of our GOSL m , GLSL m and N-pairs. The returned images with green and red border denote true and false retrieval results respectively. We could learn from Figure 3 that there are no omission or commission on the three easy retrieve cases with the three methods which means that the three methods all achieve excellent retrieval performance for the three easy categories (i.e., agricultural, storage tanks and tennis court). However, on other three hard cases, GOSL m , GLSL m and N-pairs perform worse as the categories of buildings, dense residential and medium residential with very low interclass variabilities. On case 4, the errors of GOSL m are lower than of GLSL m and N-pairs. On case 5, the errors of GOSL m , GLSL m and N-pairs are three, five and five respectively and the results show that our proposed GOSL m outperforms GLSL m and N-pairs for the category of dense residential. On case 6, errors with GOSL m , GLSL m and N-pairs are two, four and five respectively which demonstrates that our proposed GOSL m is more effective than the other two DML methods.
In a word, our GOSL m achieves the best performance on some easy retrieval cases and exhibits great superiority in coping with the challenge of low interclass variabilities existing in most categories of remote sensing images comparing with other DML methods.
agricultural, storage tanks and tennis court). However, on other three hard cases, GOSLm, GLSLm and N-pairs perform worse as the categories of buildings, dense residential and medium residential with very low interclass variabilities. On case 4, the errors of GOSLm are lower than of GLSLm and N-pairs. On case 5, the errors of GOSLm, GLSLm and N-pairs are three, five and five respectively and the results show that our proposed GOSLm outperforms GLSLm and N-pairs for the category of dense residential. On case 6, errors with GOSLm, GLSLm and N-pairs are two, four and five respectively which demonstrates that our proposed GOSLm is more effective than the other two DML methods. In a word, our GOSLm achieves the best performance on some easy retrieval cases and exhibits great superiority in coping with the challenge of low interclass variabilities existing in most categories of remote sensing images comparing with other DML methods.  Figure 3. Six retrieval cases with top-10 returned results on UCMD. The left part represents three easy retrieval cases and the right part represents three hard retrieval cases. For each retrieval case, the top, middle and bottom rows denote the results obtained by using the methods of our GOSLm, GLSLm, and N-pairs. The green and red border denote true and false retrieve results respectively.

Ablation Study
In this section, we perform an ablation study on sensing datasets. We make analysis on hyperparameters of our global optimal structured loss and analyze the performance of our method with different embedding size. We also study the impact of batch size for the performance of our proposed method. We would like to give more details as follows.

Hyper-Parameter Analysis
We conduct the analysis about the main parameters which have been mentioned in Section 3 on the dataset of Google Image Dataset of SIRI-WHU [17,19,78] on the Inception network with batch normalization [75]. We set embedding size to 512 and the batch size to 40 in our experiments And we set = 0.1 which is defined in Equations (11) and (12), = 2 and = 50 which are parameters in Equation (16) by following the setting of [45]. We use average value of precision (AveP) to measure the performance of RSIR as the same to DBOW. Figure 3. Six retrieval cases with top-10 returned results on UCMD. The left part represents three easy retrieval cases and the right part represents three hard retrieval cases. For each retrieval case, the top, middle and bottom rows denote the results obtained by using the methods of our GOSL m , GLSL m , and N-pairs. The green and red border denote true and false retrieve results respectively.

Ablation Study
In this section, we perform an ablation study on sensing datasets. We make analysis on hyper-parameters of our global optimal structured loss and analyze the performance of our method with different embedding size. We also study the impact of batch size for the performance of our proposed method. We would like to give more details as follows.

Hyper-Parameter Analysis
We conduct the analysis about the main parameters which have been mentioned in Section 3 on the dataset of Google Image Dataset of SIRI-WHU [17,19,78] on the Inception network with batch normalization [75]. We set embedding size to 512 and the batch size to 40 in our experiments And we set = 0.1 which is defined in Equations (11) and (12), β 1 = 2 and β 2 = 50 which are parameters in Equation (16) by following the setting of [45]. We use average value of precision (AveP) to measure the performance of RSIR as the same to DBOW.
The effectiveness of the fine-tuned network is crucial for more discriminative feature extraction which is significant to obtain more appreciable performance in the task of RSIR. In our proposed method, we aim to utilize a fixed positive boundary (α − m) to restrict the positive pairs into this boundary and use a given negative boundary α to force the negative pairs father than this boundary. Therefore, m is a fixed margin used to separate the two different boundaries. Herein, different values of α and m could differ the retrieval result. To achieve the best performance in RSIR task, we release our hyper-parameter analysis on α and m as follows.
As described in Section 3.4, factor α is a hyper-parameter used to limit the negative pairs far away from the positive pairs. We give a discussion on different α with {0.5, 0.6, 0.7, 0.8, 0.9, 1.0} by fixing m = 0.5. And we represent the results in Table 12. We could make a conclusion from Table 12 that when α is smaller than 0.6, the AveP keeps increasing monotonically. On the contrary, when α is larger than 0.6, the performance would decrease. We achieve the best result 96.6% when α is 0.6. We would like to set α = 0.6 in the section of experiments and discussion.
As for factor m, it is used to pull apart positive sample pairs away from negative ones. We conduct experiment to discuss the impact of hyper-parameter m by setting its value at {0.1, 0.2, 0.3, 0.4, 0.5, 0.6} and fixing α to 0.6. The results are shown in Table 13. From Table 13, we could conclude that when m is smaller than 0.5, the performance gradually increases. However, when m is larger than 0.5, the performance falls into degrading. The best result 96.6% would be achieved when m = 0.5. We prefer to select m = 0.5 for our following experiments according to the results in Table 13.

Impact of Embedding Size
Referring to the work of Wang et al. [45], the embedding size during training has an important impact on the retrieval performance. We compare the effectiveness of our proposed loss function on UCMD, SATREM, SIRI and NWPU datasets with embedding size at {64, 128, 256, 512, 1024}. We set batch size as B = 40. The results are reported in Table 14 and the best result is highlighted in bold. We could learn from Table 14 that the performance of UCMD, SATREM, SIRI and NWPU keeps sustained growth within the embedding size at 512 and it would go down with embedding size at 1024. The best results would be obtained when embedding size is set to 512 on the four datasets.

Impact of Batch Size
The batch size plays an important role in DML methods as it determines the size of problems need to be processed for each iteration in the training phase. We perform a set of experiments on UCMD, SATREM, SIRI and NWPU datasets with embedding size at 512, and we set batch size to {10, 20, 40, 60, 100, 160} for comparing. We report the results in Table 15. As the number of categories is limited in each dataset, the batch size of four datasets would be limited within 100, 105, 60 and 225 respectively. Once the batch size is larger than its upper limit, the related result would be invalid. We could learn from Table 15 that batch size has different degrees of influence on the four datasets. The changes of performance remain within about 1% on UCMD and SIRI, the SATREM and NWPU is most sensitive to the variation of batch size with the performance changes from 86.5% to 91.1% and 83.9% to 90.3% respectively. We obtain the best performance on the four datasets with batch size at 40.

The Retrieval Execution Complexity
In this section, we analyze the retrieval execution complexity of the retrieval system with our proposed method. We measure the time (in milliseconds) required for the retrieval process which includes deep features extraction and similarity matching. During the process of deep features extraction, it takes about 10 milliseconds to extract deep features for each image with size of 224 × 224 which is faster than the existing fasted RSIR methods [49]. We report the results on Table 16 and compare the retrieval time (similarity matching) taken from ADLF [49]. We could learn from Table 16 that as the size of test database grows, more time would be required for retrieval and the same conclusion is reached for the embedding size. Concretely speaking, the retrieve execution time is lower than ADLF which is the existing fast methods by 1. 36, 2.42, 9.64, 25.9, 45.64 and 73.63 milliseconds with DB size of 50, 100, 200, 300, 400 and 500, respectively, when the embedding size is 256. When the embedding size is 512, the retrieval execution time is lower than ADLF by 0.68, 2.97, 10.41, 15.24, 28.12 and 42.55 with DB size of 50, 100, 200, 300, 400 and 500, respectively. We achieve the lowest retrieve execution time with embedding size of 256 and the best results are 0.28, 0.40, 0.66, 1.03, 1.49 and 2.31 milliseconds at the DB size of 50, 100, 200, 300, 400 and 500, respectively. We could learn that the embedding size has less effect of lower than 2 milliseconds on the retrieval time comparing with DN7, DN8, DBOW and ADLF. Based on the discussions above, we could observe that our proposed method could achieve the state-of-the-art performance with lower retrieval time.

Conclusions
In this paper, we propose a novel global optimal structured loss under DML paradigm for more effective remote sensing image retrieval. Our proposed global optimal structured loss aims to learn an effective embedding space where the positive pairs would be limited within a given positive boundary and the negative ones would be pushed away from a fixed negative boundary, and the positive and negative pairs would be separated by a fixed margin. To deal with the key issue of local optimization in most DML methods, we propose to utilize a softmax function rather than a hinge function in our loss Sensors 2020, 20, 291 24 of 28 function to realize global optimization. To make full use of the sample pairs and take the difference and relationship between positive and negative sample pairs into consideration, we utilize a superior pairs mining strategy to mine more informative sample pairs in the confusion scope. It helps to eliminate the influence of less informative sample pairs and utilize the mined sample pairs to establish an elegant similarity structure for positive and negative sample pairs and the structure distribution could be preserved during embedding space optimization. Furthermore, our proposed global optimal structured loss would achieve the state-of-the-art performance with the lowest retrieval time on four popular remote sensing datasets compared with baselines.
Herein, we study the effectiveness of DML methods used in the task of RSIR and concentrate on how to design a more elegant loss function for more effective embedding space learning. The experimental results show that our proposed method achieves the state-of-the-art performance under the metric of AveP and Recall@K when compared with other common DML methods. We also improve the retrieval performance on SIRI and NWPU over the baselines by a large margin and refresh the state-of-the-art results. However, we could only achieve the second-best performance on UCMD and SATREM. It's worth noting that we don't conduct any post-processing operations and extra techniques like query expansion and attention mechanism on our proposed method. From the discussion we presented, our method fails to extract more informative feature representations which could be significant in improving retrieval performance. We prefer to combine the attention network with DML methods and utilize post-processing operations to further enhance the performance of RSIR in our future works.