High-Rankness Regularized Semi-Supervised Deep Metric Learning for Remote Sensing Imagery

: Deep metric learning has recently received special attention in the field of remote sensing (RS) scene characterization, owing to its prominent capabilities for modeling distances among RS images based on their semantic information. Most of the existing deep metric learning methods exploit pairwise and triplet losses to learn the feature embeddings with the preservation of semantic-similarity, which requires the construction of image pairs and triplets based on the supervised information (e.g., class labels). However, generating such semantic annotations becomes a completely unaffordable task in large-scale RS archives, which may eventually constrain the availability of sufficient training data for this kind of models. To address this issue, we reformulate the deep metric learning scheme in a semi-supervised manner to effectively characterize RS scenes. Specifically, we aim at learning metric spaces by utilizing the supervised information from a small number of labeled RS images and exploring the potential decision boundaries for massive sets of unlabeled aerial scenes. In order to reach this goal, a joint loss function, composed of a normalized softmax loss with margin and a high-rankness regularization term, is proposed, as well as its corresponding optimization algorithm. The conducted experiments (including different state-of-the-art methods and two benchmark RS archives) validate the effectiveness of the proposed approach for RS image classification, clustering and retrieval tasks. The codes of this paper are publicly available.


Introduction
Nowadays, the increasing availability of remote sensing (RS) data offers widespread opportunities in many important application fields, such as urban planning [1][2][3], aerial scene retrieval [4][5][6], change detection [7,8], analysis of the earth's surface [9,10], vegetation mapping [11,12], and remote object detection [13,14]. In these (and many other) important applications, the visual interpretation of RS scenes becomes a particularly challenging task, since a semantic characterization of RS images is required to deal with highly complex spatio-spectral land cover components that lead to high intra-class (and low inter-class) variability [15]. Note that there are specific factors affecting RS data, such as sensing conditions, sensor types, and data volume (among others) that often make semantically similar aerial scenes exhibit very different characteristics, resulting in the so-called large-scale variance problem [16][17][18].
With the improvement of earth observation technologies, different RS image characterization methods have been successfully proposed in the literature to deal with such intricacies [19]. In general, it is possible to distinguish three main types of methods: hand-crafted feature-based [20,21], unsupervised feature learning-based [22,23], and deep feature learning-based methods [24][25][26][27]. Despite the potential advantages of using manually designed features or unsupervised learning techniques, the enormous capability of deep learning models as feature extractors makes these methods the current state-of-the-art technology to effectively characterize RS scenes via convolutional neural networks (CNNs) [28][29][30][31]. Among all the conducted research, deep metric learning has recently shown to be one of the most relevant image characterization trends, since it pursues to map the input data into a feature space where semantically similar images are projected to nearby locations [32][33][34]. However, this kind of model generally demands massive amounts of annotated data for training, which may severely constrain their practical application in operational RS scenarios with limited labelled data [35].
In order to address the above-mentioned limitation, this paper proposes a novel RS image characterization method, named high-rankness regularized semi-supervised deep metric learning (HR-S 2 DML), which re-defines the standard deep metric learning framework by using an innovative semi-supervised design. More specifically, the proposed method aims at learning a low-dimensional metric space, which is able to capture semantic similarities among aerial scenes from a reduced number of labeled images, while exploiting the potential decision boundaries of massive unlabeled RS images. To achieve this goal, the proposed model includes a newly defined loss function, which is based on two main constitutive components: (i) a normalized softmax loss with margin, which aims at aligning RS images from the same class-as well as enhancing the intra-class compactness and inter-class discrepancy under the semi-supervised framework-and (ii) a high-rankness regularization term, which enforces the model preservation from the viewpoint of both the discrimination and diversity capabilities between labeled and unlabeled RS scenes. Additionally, an appropriate optimization mechanism is also proposed to generate consistent features within each training epoch. The extensive experimental comparison conducted in this work, including several state-of-the-art models and two benchmark datasets, validates the effectiveness of the proposed method in the task of characterizing RS scenes on three different applications: classification, clustering, and retrieval. Summarizing, the main contributions of this paper can be listed as follows:

1.
A new semi-supervised deep metric learning model is presented to characterize vast RS image collections in an end-to-end manner, using a reduced amount of annotated data. Specifically, the proposed method has been designed to learn (based on CNN models) a metric space that jointly preserves the discrimination capability for labelled and unlabelled RS scenes.

2.
A new loss function, based on the normalized softmax loss with margin and the high-rankness regularization, is proposed to enhance the feature learning ability under a semi-supervised assumption. Additionally, an optimization mechanism is also defined to produce consistent features within each training epoch.

3.
The extensive experimental evaluation (based on three different RS applications) conducted in this paper compares the performance of the proposed method against different state-of-the-art methods using several datasets. The codes of this paper are publicly available to the research community (https://github.com/jiankang1991).
The organization of the rest of paper is the following. Section 2 introduces some related works as well as their main limitations to characterize aerial scenes. Section 3 defines the proposed semi-supervised model for efectively representing RS scenes. Section 4 presents the experimental part of the work including different benchmark datasets and state-of-the-art methods. Section 5 provides a discussion of the obtained results. Finally, Section 6 concludes the paper with some remarks and hints at plausible future research lines.

Related Work
During the past years, a considerable number of methods have been proposed for characterizing RS images. Generally, these approaches can be categorized into three different types [36]: hand-crafted feature-based, unsupervised feature learning-based, and end-to-end deep learning-based methods. Hand-crafted feature-based techniques make use of different visual descriptors to capture elementary image characteristics, such as color [37], shape [20,38], or texture [21,39]. Alternatively, unsupervised learning methods try to improve these results by using different kinds of unsupervised learning protocols. That is, these approaches pursue to encode the low-level visual descriptors into a higher-level feature space via sparse coding [40,41], topic modeling [42], and auto-encoders [43], among other unsupervised paradigms. However, the lack of supervised information during the learning process often reduces the ability of these techniques to effectively discriminate among complex RS concepts [19]. With the development of deep learning technology, deep learning-based methods have been shown to obtain excellent results for characterizing RS scenes, due to the great potential of CNNs to uncover high-level features from an end-to-end perspective [44]. For example, this is the case of the work in Li et al., who define in [45] a multi-layer feature fusion framework that exploits multiple pre-trained CNN models to represent RS images. Analogously, Piramanayagam et al. proposed in [46] a composite convolutional architecture to fuse multi-sensor data into a single characterization. Moreover, Li et al. presented in [30] a feature extraction network for RS that combines global and local features using the VGGNet [47] model and a recurrent neural network-based attention module, respectively. Other authors, such as Pires et al. in [31] also showed the benefits of considering a transfer learning approach to characterize RS scenes.
Despite the advantages of these and other deep learning-based methods [48], the deep metric learning scheme has recently been shown to be one of the most effective alternatives to characterize RS data [49]. In general, deep metric learning is focused on projecting semantically similar images to nearby locations in feature space, using non-isotropic metrics [50]. Consequently, this scheme is becoming increasingly popular for alleviating the large-scale variance problem in RS since it can naturally model complex semantic similarities. For instance, Cheng et al. defined in [32] a deep metric learning approach (with a regularization term) based on the contrastive embedding framework [51] to learn discriminative CNN-based characterizations for RS images. In [33], Yan et al. developed a cross-domain extension of this contrastive scheme to reduce the bias of the corresponding feature distribution and the spectral shift. Alternative works also contemplate other relationships between RS scenes when learning the feature space. This is the case of Cao et al. who proposed in [52] a deep metric learning method for representing aerial scenes using a predefined CNN model and the triplet loss formulation [53], where both positive and negative samples are used to build the corresponding feature embeddings. Yun et al. presented in [34] a coarse-to-fine deep metric learning technique based on the triangular loss, which also accounts for the differences between negative and positive samples during training to achieve more precise results. Additionally, Kang et al. defined in [54] a deep metric learning framework for characterizing RS images based on scalable neighborhood component analysis [55], in order to better preserve the neighborhood structure in scalable datasets. Hong et al. [56] proposed a novel deep cross-modal network, which improves the classification results based on the cross-modality RS datasets.
Existing deep metric learning methods for RS image characterization are mainly focused on considering tuples of two or three labelled scenes, and then learning their binary relationships to build the corresponding feature space in a supervised manner. However, the availability of such annotations for training is usually rather limited in RS, since obtaining high-quality ground-truth land cover information for vast image archives is very expensive, as well as time-consuming. This fact logically contrasts with the requirement of large amounts of training data to properly train deep metric learning-based image characterization models, which may eventually become an important constraint in RS [18]. Although unsupervised image characterization methods [57,58] are potentially able to relieve this limitation, the high intricacy of the RS image domain still makes unsupervised schemes unable to capture the complex semantic relationships between land cover concepts, because real RS class labels are not taken into account [19]. With these considerations in mind, it seems reasonable to find a trade-off between the supervised and unsupervised scenarios in order to take advantage of both paradigms to effectively characterize RS images from a deep metric learning-based perspective. Precisely, some recent works point the benefits of using a semi-supervised scheme in this context. For example, Liu et al. defined in [59] a semi-supervised deep metric learning approach specially designed to classify synthetic aperture radar (SAR) data. More specifically, the authors made use of a manifold regularization term to penalize large distances between labeled and nearest neighbor unlabeled instances of synthetic aperture radar (SAR) data; however, the same authors concluded that there is still room for improvement since more research is required to provide effective solutions for multi-spectral RS data and other target applications. That is, the increasing complexity of RS images in terms of data volume and semantic understanding [16,18,35] demands new strategies to enhance the capacity of deep metric learning-based characterization methods to distinguish between a broader range of contrasting land cover types using limited amounts of labelled data. More precisely, relieving these important limitations (from a semi-supervised viewpoint) motivates the research conducted in this work.

Proposed Semi-Supervised Deep Metric Learning for Remote Sensing
The proposed HR-S 2 DML approach, which is specially designed to characterize RS images, is composed of two main parts: (1) a backbone CNN architecture to encode the RS images into corresponding features in a low-dimensional metric space; and (2) a new joint loss function for guiding the CNN model to learn a metric space in semi-supervised fashion. Figure 1 illustrates the proposed framework in a graphical way. As it is possible to see, the proposed end-to-end model is made up of two different segments that make use of the same CNN backbone architecture and share their corresponding weights. On the one hand, the top segment covers the labeled RS scenes by the normalized softmax loss with margin, with the objective of facilitating the generation of a metric space with high intra-class compactness and inter-class discrepancy for the available labelled data. On the other hand, the bottom segment employs the high-rankness regularization over the unlabelled images for preserving the discrimination and diversity capabilities on the unlabeled data. The details of our approach will be provided in the following subsections. Nonetheless, we first briefly define the notations used in the paper.
Let us assume that L and U represent labelled and unlabelled images, respectively. Let X L = {x L 1 , · · · , x L M L } be an RS image dataset of M L images with category annotations, and Y = {y L 1 , · · · , y L M L } be the corresponding set of labels, where each label is represented by a one-hot vector of the form Deep metric learning aims to learn a CNN model F (·) for effectively encoding the semantic contents of images with low-dimensional feature embeddings in the produced metric space, where the semantically similar images are located close and semantically dissimilar images are separated. In the context of semi-supervised deep metric learning, the CNN model F (·) is learned by utilizing both the labeled and unlabeled image datasets, X L and X U . With respect to the image x i , f i ∈ R D represents its normalized feature embedding produced by F (·), i.e., f i = F (x i )/ F (x i ) 2 , and D is the dimension of the feature embedding. Using this notation, the following subsections describe the different parts of the proposed joint loss function and the optimization algorithm.

Normalized Softmax Loss with Margin
The softmax loss, also noted as cross-entropy loss, is widely applied for supervised classification: where M L represents the number of labeled images, and p c i represents the probability that x L i is classified into class c, described by: where w c ∈ R D denotes the learnable weight vector associated with the class c. Here, the bias term is omitted for simplicity. By minimizing the softmax loss, the images from the same class are aligned with respect to the corresponding weight vector w c [54,55]. However, the similarity for intra-class images and the diversity for inter-class images cannot be explicitly enforced by the softmax loss [60]. Thus, the metric space produced via the CNN model optimized by the softmax loss cannot sufficiently capture the semantic structures among the images, especially under a semi-supervised learning framework. To overcome such limitation, we utilize the normalized softmax loss with margin to enhance the intra-class compactness and inter-class discrepancy [60]. Specifically, with the assumption that w c is normalized, i.e., w c 2 = 1, the loss function can be described as: where θ c i denotes the angular margin between the feature embedding f L i and w c , i.e., θ c i = arccos(w T c f L i ), m is the angular margin penalty, and τ represents the temperature parameter which regulates the level of concentration in the sample distribution [61]. Compared with the traditional softmax loss, the involved angular margin penalty m can enforce the images from the same class to be closer to each other and the images from different classes to be pushed away. The effect of the angular margin can be illustrated in Figure 2. By minimizing the traditional softmax loss, the feature embeddings within each class are optimized to decrease their cosine distances with respect to the corresponding class prototype vector w c . Therefore, they are enforced to be aligned with respect to each class prototype learned by the CNN model. However, the image features from different classes lying around the class decision boundaries may still share some similarities. Given such a learned metric space, the out-of-sample images located near such class boundaries cannot be easily categorized. By exploiting the normalized softmax loss with margin, we encourage that the images belonging to different classes are forced to be separated with a certain angular distance, so that the semantic structure of the metric space can be better characterized by the learned CNN model.

High-Rankness Regularization
Although the metrics for the labeled images can be captured by using the normalized softmax loss with margin, the discrepancy between the labeled training images and the unlabeled test images could lead to a poor prediction performance under a semi-supervised learning scenario. Since the CNN model is optimized using just a small number of labelled images, the learned decision boundaries with respect to the unseen test images are often ambiguous. Moreover, in case of the CNN model trained by an unbalanced dataset, it is normal that a few categories dominate the images within mini-batches, which can degrade the prediction diversity of the trained CNN model. In order to overcome these limitations, we adopt the high-rankness regularization of the model predictions within each mini-batch to enforce the optimized CNN model with the preservation of both the discrimination and diversity capabilities [62]. Specifically, given each mini-batch of unlabeled images X U B , the rank of their category prediction matrix is maximized as: max(Rank(P U B )), where P U B is the probability matrix of the category prediction for each mini-batch. Such optimization is an NP-hard non-convex problem. The rank optimization can also be relaxed into the optimization of matrix nuclear norm [63][64][65][66][67][68]. Thus, Equation (4) can be relaxed by minimizing the following loss function: This optimization increases the rankness of the predicted class probability matrix of each mini-batch P U B . As P U B can be described by: where ∑ c (·) denotes a summation along the category direction. In the case of semi-supervised learning, the classification of a large amount of unlabeled images based on the learned class prototype W with a limited number of training images may not be sufficient. Thus, most feature embeddings of unlabeled images may be located around the class decision boundaries. In other words, the predicted class probability vectors of unlabeled images from different classes are similar to each other. This leads to the low-rankness of the matrix P U B and, inevitably, to classification ambiguities for the unlabeled images. By minimizing L HR , the feature embeddings of the unlabeled images will be pushed towards the learned class prototypes W, and the discrimination and diversity capabilities of the CNN model on massive unlabeled images can be preserved. To this end, the proposed joint loss function for training the CNN model is formulated as: L = L s−m + λL HR .
Finally, the corresponding optimization algorithm is described in Algorithm 1.

Algorithm 1 Optimization for HR-S 2 DML
Require: x L i , x U i , and y L i 1: Initialize τ, m, λ and D 2: for t = 0 to maxEpoch do 3: Sample mini-batches from training and test sets, X L B and X U B .

4:
Calculate L s−m and L HR based on X L B and X U B , respectively.

5:
Aggregate the two loss terms into a joint loss L.

Dataset Description
To validate the performance of the proposed semi-supervised deep metric learning approach, this work considers two benchmark RS image archives. A detailed description of these datasets is provided below:

1.
Aerial Image Dataset (AID) [69]: This dataset has been specifically designed for RS image classification and retrieval tasks. Specifically, it contains a total of 10,000 images belonging to the following 30 semantic classes: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks, and viaduct. Figure 3a shows some of its images for illustrative purposes. All the images have a size of 600 × 600 pixels in the RGB space, with a spatial resolution ranging from 8 to 0.5 meters, and each semantic class contains from 220 to 420 images. This collection is available online (AID: https://captain-whu.github.io/AID/).

2.
NWPU-RESISC45 [19]: This archive is a large-scale RS dataset, which is made of 31,500 images which are uniformly distributed in the following 45 semantic classes: airplane, airport, baseball diamond, basketball court, beach, bridge, chaparral, church, circular farmland, cloud, commercial area, dense residential, desert, forest, freeway, golf course, ground track field, harbor, industrial area, intersection, island, lake, meadow, medium residential, mobile home park, mountain, overpass, palace, parking lot, railway, railway station, rectangular farmland, river, roundabout, runway, sea ice, ship, snow-berg, sparse residential, stadium, storage tank, tennis court, terrace, thermal power station, and wetland. Figure 3b illustrates some examples of this collection. All the images have a size of 256 × 256 pixels in the RGB space, with a spatial resolution varying from 30 to 0.2 m. This dataset is also available online (NWPU-RESISC45: http://www.escience.cn/ people/JunweiHan/NWPU-RESISC45.html).
In order to generate a semi-supervised learning scenario for the experimental part of the work, we randomly select for each dataset a 5%, 10%, 15% and 20% of the data as labeled images, and a 95%, 90%, 85% and 80% as unlabeled images, respectively. Note that we also identify these sets of labeled and unlabeled images as training and test sets in the downstream evaluation tasks.
These RS archives have been selected as benchmark collections due to their challenging complexity (in terms of data volume, semantic intricacy and visual diversity) and also their widespread popularity in other related works [32,33,54]. However, alternative RS datasets with different spectral bands could be used instead by adjusting the number of channels of the considered backbone architecture to the number of bands of the input data.

Evaluation Tasks
For evaluating the effectiveness of the proposed method on the feature embedding generation, we conduct experiments related to three different RS tasks: (1) KNN classification; (2) clustering; and (3) image retrieval.

KNN Classification
Given an out-of-sample image identified by x * , its corresponding feature embedding f * can be generated using the trained CNN model F (·). By measuring the Euclidean distance between f * and the feature embeddings of the training set in the metric space, the top-K nearest neighbors can be retrieved. Then, based on the majority voting of the labels associated with the K nearest neighbors, y * can be calculated. The performance evaluation is done by calculating the overall accuracy figure of merit.

Clustering
The generated feature embeddings of the test set can also be evaluated by carrying out k-means clustering. If they can be perfectly clustered in the metric space, the uncovered clusters can match the ground-truth semantic classes. For the performance evaluation, we exploit the Normalized Mutual Information (NMI) and the unsupervised clustering accuracy (ACC) [70]. NMI is defined by: In this expression, Y denotes the ground-truth labels and C represents the corresponding cluster assignments. Besides, I(·; ·) and H(·) are the mutual information and entropy functions, respectively. This figure of merit quantifies the agreement between the ground-truth information and the assigned clusters. For ACC, it is defined by: where l i denotes the ground-truth class, c i is the assigned cluster of image x U i and being δ(·) the Dirac delta function. Additionally, M represents a mapping function than finds the best correspondence between the uncovered clusters and the ground-truth classes.

Image Retrieval
Given the feature embedding of one query image, the image retrieval task aims to find the images in the dataset with high semantic-similarity. Such similarity can be measured by the Euclidean or Cosine distance between the feature embedding of the query image and the ones in the dataset. Logically, the more effective the metric learning technique, the more semantically relevant the images retrieved from its embedding space. For assessment purposes, we make use of the Precision-Recall (PR) curve to analyze the precision and recall metrics when varying the total number of retrieved images and the mean average precision (MAP). The average precision (AP) is defined by: where Q is the number of ground-truth RS images in the dataset that are relevant with respect to the query image, P(r) denotes the precision for the top r retrieved images, and δ(r) is an indicator function to specify whether the rth relevant image is truly relevant to the query.

Experimental Setup
As it was previously mentioned, the semi-supervised learning scheme is generated by randomly selecting 5%, 10%, 15% and 20% of the datasets as labeled images (training) and the remaining samples as unlabeled data (test). After fixing these partitions for each dataset, we train the models (once per considered ratio) and perform the corresponding evaluation tasks. The clustering task is conducted on the feature embeddings of the test sets generated by the learned CNN model. For image retrieval purposes, the test set is served for querying, and the training set is the database. The proposed method is implemented in PyTorch [71]. We use ResNet18 [72] as the CNN backbone for extracting the features. It is worth noting that other CNN architectures can also be applied, while we exploit ResNet18 in this paper for the sake of simplicity. The images are all resized to 256 × 256 pixels, and three data augmentation methods are adopted during training: (1) RandomGrayscale, (2) ColorJitter, and (3) RandomHorizontalFlip. For the parameters in our HR-S 2 DML, we select τ, m, D, and λ to be 0.05, 0.5, 128 and 1.0, respectively. The Stochastic Gradient Descent (SGD) optimizer is adopted for training. The initial learning rate is set to 10 −3 , and it is decayed by 0.5 every 30 epochs. The batch size is 256 and we totally train the CNN model for 100 epochs. For evaluating the effectiveness of the proposed semi-supervised deep metric learning, we compare it with respect to several metric learning methods including: (1) D-CNN [32]; (2) deep metric learning based on triplet loss [52,53]-simply termed as Triplet hereinafter-; and (3) Normalized Softmax Loss (NSL) [73]. D-CNN is one of the first works for deep metric learning based on remote sensing images, where a metric learning regularizer is integrated with the cross entropy loss for learning the discriminate features. Triplet is one of the most popular losses for deep metric learning, where a triplet of images (one positive image pair and one negative image pair) is constructed for learning the metrics. NSL is exploited for learning the class proxies based on the normalized weights within the framework of the cross entropy loss, and optimizing the metrics of the input images with respect to them. Regarding their parameter configurations, the margin parameter of the triplet loss is selected as 0.2 and the parameters of D-CNN are set to the same values as in the original paper. Additionally, the learning rates of all the compared methods are tuned to be optimal. All the experiments are conducted on an NVIDIA Tesla P100 graphics processing unit (GPU). Table 1 displays the KNN classification accuracies (%) obtained by using all the considered methods, when the percentages of the labeled images are 5%, 10%, 15%, and 20%, respectively, and K = 10.

KNN Classification
Compared with other state-of-the-art methods, our HR-S 2 DML achieves the best performance on the two considered benchmark datasets. As it is posibble to observe, the proposed approach improves the classification accuracy by a margin of 10% and 3% with respect to NSL and D-CNN, respectively. In NSL, the normalized softmax loss is utilized without imposing the margin between the images from different classes. Thus, for a large number of unseen RS images, the produced class decision boundaries by NSL may lead to ambiguous predictions. The contrastive and triplet losses exploited in D-CNN and Triplet require a sufficient optimization when the number of training images is at the level of O(|X L | 2 ) and O(|X L | 3 ). Normally, such requirement cannot be easily satisfied when the CNN model is trained with a certain number of epochs. Thus, the performances of D-CNN and Triplet are limited by the dataset sampling. By enforcing the discrimination and diversity capabilities for both the labeled and unlabeled RS scenes, our HR-S 2 DML can better generate a low-dimensional metric space where the distances among the images are more accurately captured than the other tested methods. Table 1. KNN classification accuracies (%) obtained by using the considered methods, when the percentages of the labeled images are 5%, 10%, 15%, and 20%, respectively, and K = 10.

Clustering
Tables 2 and 3 demonstrate the NMI and ACC scores obtained on the test sets after conducting K-means clustering to their feature embeddings generated by the different methods. It can be observed that the proposed method provides the most accurate matching between the ground-truth semantic labels and the obtained clusters. This fact indicates that the intra-class distances among the produced feature embeddings of the same class by our HR-S 2 DML are smaller than those obtained by the other tested methods. Moreover, the corresponding inter-class distances among the produced feature embeddings of different classes by our HR-S 2 DML are larger than those obtained by the other methods, so that more test images can be accurately clustered. Moreover, in Figure 4 we display the feature emebddings projected into the 2-D space via t-SNE on the AID test set. It can be obviously seen that the intra-class compactness of HR-S 2 DML is higher than the other considered methods, and larger margin exists for inter-class feature embeddings. From this perspective, higher clustering accuracy can be guaranteed by the proposed method.    Figure 5 displays the PR curves showing the precision and recall pairs (with different numbers of retrieved images) with respect to the considered methods, when the percentage of the labeled images is set to 20%. As in the previous experiments, our HR-S 2 DML exhibits superior retrieval performance when compared to the other tested methods, particularly when the number of retrieved images increases. Therefore, the proposed method can group closer the images with higher semantic-similarities and separate the images with dissimilar patterns in the metric space. In Table 4, we calculate the MAP scores of the image retrieval results, when the percentages of the labeled images are 5%, 10%, 15%, and 20%, respectively, and using R = 20. Consistently with the above observation, the proposed method obtains the best image retrieval performances for R = 20. With a limited number of labeled images (5%), the retrieval performances of the other methods significantly degrade. In comparison, the image retrieval performance of our HR-S 2 DML is more stable as the number of labeled images increases, which indicates that the learned CNN model exhibits better generalization capability. Given two query images from the benchmark datasets, we display the 1st, 5th, 10th, 15th, and 20th nearest neighbors retrieved based on the considered methods in Figure 6. For example, the pattern of Playground cannot be easily distinguished from BaseballField in the result of Triplet on the AID dataset. Table 4. MAP scores (%) of the image retrieval results obtained by the considered methods, when the percentages of the labeled images are 5%, 10%, 15%, and 20%, respectively, using R = 20.

Parameter Sensitivity Analysis
There are two main parameters that need to be set in the proposed method, i.e., τ and m, where τ controls the compactness of the sample distribution, and m is the introduced angular margin penalty. Tables 5 and 6 display the KNN classification performances with respect to different values of τ and m, when the percentage of the labeled images is 15% and K = 10. It can be observed that the best choice of τ lies in a range from 0.05 to 0.2 for the two benchmark datasets. Moreover, when m ranges from 0.2 to 0.5, optimal classification performance can be achieved. This indicates the effectiveness of the proposed approach, although a certain margin penalty can indeed improve the deep metric learning performance.

Discussion
Based on the experimental results from different tasks, we can observe that the proposed method can achieve the out-performance of the generated feature embeddings on the higher intra-class compactness and inter-class discrepancy compared with several state-of-the-art methods. The success of the proposed method lies on two points: (1) the precise metric learning for the limited number of labeled images; and (2) the modification of the learned class decision boundaries based on the high-rankness regularization of the unlabeled image features. When the percentage of the labeled images is low (e.g., 5%), HR-S 2 DML can also preserve a high-quality feature generation. It will benefit the training of CNN models on large-scale unlabeled RS images. Although the benchmark datasets investigated in this work are with RGB bands, the proposed method can be also exploited for encoding the semantic contents of multispectral or hyperspectral images. One simple way is to modify the first layer of the CNN models to adapt to the input images with multiple bands. In addition, the proposed loss functions can be also combined with the other state-of-the-art CNN architectures for the feature generation. In terms of the possible limitations of HR-S 2 DML, hyper-parameters including τ and m should be carefully tuned. From the experimental results, the selection of τ is towards to a small number (e.g., 0.05), and m can be chosen with a relatively large number (e.g., 0.4).

Conclusions
This paper presents a novel semi-supervised deep metric learning method specially designed to effectively characterize RS scenes using a reduced amount of annotated data. Unlike other deep metric learning methods available in the literature, the proposed approach is able to take advantage of the potential decision boundaries of unlabeled RS images to better preserve the semantic similarities in the embedding space. To this aim, a new joint loss function is defined based on two synergistic factors that simultaneously exploit supervised and unsupervised information: (1) a normalized softmax loss with margin for the labeled data, and (2) a high-rankness regularization term for the unlabeled dataset. Compared with several state-of-the art metric learning methods, the proposed method demonstrates a superior performance when classifying, clustering and retrieving RS images. The main conclusion that arises from this work is the importance of considering a semi-supervised deep metric learning scheme to relieve the lack of annotated RS data. Under the proposed semi-supervised deep metric learning framework, the normalized softmax with margin generates a metric space with high intra-class compactness and inter-class discrepancy, whereas the high-rankness regularization preserves the discrimination and diversity capabilities on the unlabeled scenes, which greatly benefits the network training on large-scale RS image collections. In the future, we plan to analyze different data natures and extending the proposed method for dealing with datasets annotated by multiple semantic labels. In addition, we seek to investigate the effectiveness of Gaussian Softmax [74] for the discriminative feature learning instead of the utilized normalized softmax loss with margin.
Author Contributions: All authors contributed to this manuscript: Conceptualization, J.K. and R.B.; methodology and software, J.K., Z.Y. and P.G.; experiment and analysis, J.K. and R.B.; data curation, Z.Y. and X.T.; writing-original draft preparation, J.K. and R.B.; supervision and funding acquisition, X.T. and A.P. All authors have read and agreed to the published version of the manuscript.