Coarse-to-Fine Deep Metric Learning for Remote Sensing Image Retrieval

: Remote sensing image retrieval (RSIR) is the process of searching for identical areas by investigating the similarities between a query image and the database images. RSIR is a challenging task owing to the time difference, viewpoint, and coverage area depending on the shooting circumstance, resulting in variations in the image contents. In this paper, we propose a novel method based on a coarse-to-ﬁne strategy, which makes a deep network more robust to the variations in remote sensing images. Moreover, we propose a new triangular loss function to consider the whole relation within the tuple. This loss function improves the retrieval performance and demonstrates better performance in terms of learning the detailed information in complex remote sensing images. To verify our methods, we experimented with the Google Earth South Korea dataset, which contains 40,000 images, using the evaluation metric Recall@n. In all experiments, we obtained better performance results than those of the existing retrieval training methods. Our source code and Google Earth South Korea dataset are available online.


Introduction
As high-resolution remote sensing (RS) images have become easily accessible owing to the advancement of Internet technology and remote sensors, there is a growing interest in managing large databases for using in various domains, such as military, navigation, and delivery services. Despite the impressive performance of deep neural networks, the effective management of a large RS database is hindered by many problems caused by time difference, viewpoint, high resolution, and various contents. Success in retrieving large amounts of RS images is an important starting point for effectively managing large volumes of RS data [1][2][3].
Remote sensing image retrieval (RSIR) refers to the search for and return of images of interest in a large database of images [4][5][6][7][8][9][10]. An RSIR system consists of a feature extraction and a similarity comparison unit, both of which are important parts to determining the success or failure of systems closely related to each other. Feature extraction is a major step in the retrieval process of RS images, as it summarizes images into high-dimensional features. The quality of the extracted features representing the images determines the success or failure of the system. Afterward, the similarities between the summarized high-dimensional features are compared to determine the similarities between images. Generally, the similarities between the features are expressed as the Euclidean distances between Many studies based on traditional computer vision methods have been studied to solve the real-world tasks [11][12][13][14][15][16][17][18][19][20][21][22][23][24]. Also, these computer vision methods [25][26][27][28] are widely used to solve retrieval problems. However, these methods are time inefficient and have low accuracy, as they do not effectively reflect the similarities between images. Therefore, with the enhancement of computing power and the availability of access to large amounts of data, deep learning has gained attention in various areas of computer vision [29][30][31][32]. Accordingly, attempts have been made to apply deep learning in the field of image retrieval [8,10,33].
However, these methods are specific to general image retrieval and are not effective in RSIR. RS images are large and are taken at high altitudes compared with ordinary images. As a result, there are many small objects in one image, making the structure highly complex [6,34]. To resolve this problem, the RS community has introduced various approaches. Yang et al. [35] published a 21-class scene classification dataset called the "UC Merced land use dataset," and Xia et al. [36] published a 30-class scene classification dataset called the "Aerial Image Dataset (AID)". These two datasets have become the most popular RSIR datasets.
However, these datasets and methods are used for image retrieval that is oriented toward scene classification. Buildings in different locations are considered the same under the label "building", and airports in different locations are considered the same under the label "airport". Therefore, classification-oriented learning is not appropriate for image retrieval of identical regions. As a result, attempts were made to use the global position system (GPS) locations for remote sensing image retrieval. One study [37] proposed a method to retrieve the aerial images from a ground-view image and estimate the GPS location. In our work, we focus on the task of retrieving the remote sensing images from an input aerial image.
Content variation exists owing to the conditional differences between the database images and the query images. For images of the same region, the contents of the images change greatly according to the variation in time, season, and shooting range. For images of different regions, the contents of the images can appear similar. These two factors have a great impact on the degradation of image retrieval performance.
To overcome the aforementioned limitations, we need to consider the relationship between the images of each region when searching for images with identical regions. Deep metric learning is a method of training a network to learn the relationship between predefined images. Contrastive loss [38,39] and triplet loss [5,[40][41][42] functions are representative examples of deep metric learning. These methods are used for defining two or three image tuples and for learning the binary relationships between images within the tuple. Deep metric learning solves the suboptimal problems caused by cutting the networks taught by image classification, and enables learning for retrieval purposes. It also provides connectivity between the feature extraction and the similarity comparison stages, in order to achieve higher performance. In addition, the recently introduced log ratio loss function [43] goes beyond the binary relationship of previous metric learning and uses the continuous relationships between the images to effectively train the network. Figure 2. Illustration of the feature embedding concept. First, the regions between images are distinguished in a binary manner, which denotes the same region or not. Then, the degree of similarity is defined as a label, which indicates how the contents of the images overlap each other. The distance of the features is rearranged according to the similarity.
We propose a framework based on deep metric learning to search for query images that are identical to the region of a database image. We deal with the situation in which 50%≈100% of the identical areas between the query images and the database images are used. Our purpose of learning is to embed features, as shown in Figure 2. The framework is trained using the coarse-to-fine method. For the coarse step, the learning is done to distinguish binary information for regional differentiation. Despite the changes in the same region, the coarse step focuses on the regional differentiation through training to extract similar features. Images with time differences at exactly the same location are used for training. By acknowledging the similarities between these images, we can extract features with time and season variations. For the fine step, the learning is done to extract features that depend on the similarity of contents between the images. We use three images with varying parallel shifts in one region and define the similarities using continuous values. The similarities are calculated by comparing how much the regions of the images overlap one another. Moreover, we propose a new loss function for the fine step. This function improves the consideration of partial relationships within the tuple of existing methods by regarding the entire relationships. By learning the entire relationship within the tuple, the proposed function can perform an improved RSIR through delicate embedding. Even with the difference in shooting range, robust feature extraction is possible.
The main contributions of this paper are as follows: • We propose a coarse-to-fine deep metric learning method for RSIR. In the first step, the network learns the binary relationship and is then retrained to learn the continuous relationship in the second step. The proposed method is end-to-end trainable using the similarity between images.

•
We introduce a new loss function to learn continuous information. The triangular ratio loss function considers all the detailed relationships within the tuples, and, therefore, it is appropriate for highly complex RS images that contain many objects.

•
Using the intersection-over-union (IOU) ratio between images, we confirmed that it is effective to define a continuous similarity between images.

Image Retrieval for the Conventional CNN Method
Image retrieval using the conventional CNN method is performed as follows [7]: (i) Train the classification task in the conventional CNN structure. (ii) Extract the features from the convolutional layer or fully connected layer of the trained network. The convolutional layer abstracts the image information through window sliding, using shared filter values, and converts it into feature map information. The outputs of the convolutional layer are the feature maps of the image. These feature maps are known to have spatial information and local information about the image. The feature maps are flattened to obtain a set of feature vectors. Let n and m be the number and the size of the feature maps, respectively. The descriptor can be defined as where x i (i = 1, 2, 3, ..., m) is an n-dimensional feature vector. However, the dimension of the feature vector is so large that it is difficult to efficiently summarize it for use in image retrieval. The fully connected layer generates a high-level feature by considering all relationships within the feature summarized through the convolutional layer. The fully connected layer feature is known to lose spatial information and local information because the nodes in the current layer are linked to all nodes in the previous layer. However, the fully connected layer summarizes the entire contents of the image and briefly summarizes the dimensions of the features. Thus, it is effective in summarizing large-capacity images concisely. The feature extractor using fully connected layer is shown in Figure 3.

Image Retrieval for the Deep Metric Learning Method
The use of the conventional classification CNN method for image retrieval has problems, however. The retrieval process consists of two stages: feature extraction and similarity comparison, as shown in Figure 1. The previous methods do not have these two stages connected. Moreover, the purpose of the training is not for retrieval because only part of the network trained on classification is used as a feature extractor. To overcome these problems, deep metric learning trains the network through the Euclidean distance between features, which is a way to compare the similarity.
Sun et al. [39] proposed the adoption of a metric learning function to a deep network. They trained a deep network through the sample relationship rather than through the class label unit. They utilized a traditional contrastive loss [38] function for the deep network. The authors trained the network by using the relationship between the samples. If two samples had the same attributes, they minimized the Euclidean distance between these samples, and if two samples had different attributes, they maximized the Euclidean distance between these samples. Using this method, they eliminated the need to match the number of nodes in the fully connected layer with the number of labels and solved the suboptimal problem of using only part of the network. Therefore, the contrastive loss function has achieved excellent results in the deep learning field. However, this method does not consider how close/far the samples are to/from each other.
Hoffer and Ailon [42] proposed a triplet loss function. In triplet loss, the distances of same attributes are defined as a positive distance, whereas the distances of different attributes are defined as a negative distance. Using these distances, the function trains the network so that the negative distance is greater than the positive distance in the feature space. Contrastive loss uses the absolute distance of the features, whereas triplet loss uses the relative distance of the features for training. Therefore, in general cases, triplet loss is known to have better performance than that of contrastive loss in the deep learning field. However, defining relationships as binary limits the representation of attributes to just two: they are either the same or different. In addition, network performance varies greatly depending on the sampling methods, and it is difficult to train the network because of the hyperparameter, which is the margin that represents how closely the positive distance will be compared to the negative distance.
As a result, Kim et al. [43] proposed a log ratio loss function that trains the network using continuous relationship. The log ratio loss function defines the similarity between images as a constant value between 0 and 1. Then, the network is trained to learn the ratio of the distance between the features and the defined continuous similarities. Compared with the previous loss function, this function can search for images with more similar content by learning the relationship between images as a continuous variable, and it does not require hyperparameters.

Coarse-to-Fine Deep Metric Learning
In this section, we describe a novel network architecture for RSIR. Training is done in two steps using the coarse-to-fine method. In the coarse step, binary information is taught to distinguish the difference of regions. The purpose of this training step is to extract features as similar as possible in identical regions and as different as possible in different regions while ignoring the changes in time instances between the same regions. However, the network will not have the ability to properly distinguish the differences between images of similar regions and identical regions just by this step. Therefore, additional fine learning is necessary to identify the variations of same location regions. In the fine step, the continuous information is learned to realize how much the regions are related. The purpose of this step is to extract features that depend on the similarity of contents between the images. The network is trained according to the change in contents owing to the variation in time instances and parallel shifts of the images with the same region. Once the learning has been completed through these two steps, retrieval can be performed robustly at different time instances or coverage areas between the query image and the database images.
The overall network is used by modifying the last layer using the conventional CNN method. ResNet-34 [30] uses a global average pooling layer to synthesize the channel's information into the average value of the entire feature map. However, in the case of remote sensing images, spatial information is important, because there are many local features and various objects. Therefore, we applied cross-channel pooling [48] to our network. The illustration of process is shown in Figure 4. Cross-channel pooling is a type of pooling that reduces the dimension of a feature; however, it works differently from conventional maximum pooling and average pooling. Conventional maximum and average pooling methods compress the information within each feature map. They preserve the number of channels but reduce the width and height values of the feature maps. In contrast, cross-channel pooling preserves the width and height values but reduces the number of channels by convolutions of the elements in the same locations of each channel. Through cross-channel pooling, we can reduce the dimension of the features and retain important spatial and local information about the remote sensing images. The fully connected layer consists of an output vector with a dimension of 512, which is the same as that in conventional ResNet-34. Moreover, we used the initial values of the network, which were pretrained on the ImageNet [49] dataset. The overall architecture is shown in Figure 5.  In addition, defining the tuple process is known to be important in metric learning. In the coarse step, we define image i as a location where the longitude and the latitude are exactly the same as those of the anchor image, and we define image j as a different location from that of the anchor image. In the fine step, we create three images so that the IOU region overlap is at least 26% for any two of the three images. Through this process, the network can learn automatically without any human annotation.

Step 1: Coarse Deep Metric Learning
For training, we constructed data tuples with three images involved. The data tuples consisted of two images from the same region and one from another region. Images from the same region had a time difference of approximately 1 year. The data tuple example of Step 1 is shown in Figure 6. Moreover, the dense triplet mining method [43] was used for efficient learning by organizing images from another region. This method is a way of defining the data tuples for effective metric learning. Using this method, we composed image j, which had the closest Euclidean distance to the anchor image for each mini-batch B.
In Step 1, the network is taught end to end using contrastive loss. Through contrastive loss learning, the network can distinguish information between the regions. As a result, the network extracts more similar features for the same region and more different features for different regions. However, this type of training leads to poor performance when the database image and query images do not have completely identical regions, and additional training must be done to resolve this problem.

Contrastive Loss
Contrastive loss [38,39] is the basic loss function for deep metric learning. This function is appropriate for distinguishing a given pair of images. For training, a data pair ({I i , I j }) is required. The contrastive loss function minimizes the distance in the embedding space when the labels are the same and separates them with a fixed margin when the labels are different. When the contrastive loss training is done, features with the same properties are embedded into the same space, although they have different visual structures. This has been a drawback for other tasks; however, we were able to use it to cope with the variation in content over time that occurred in the same region. The loss function is defined as follows: Here, I i and I j are the image pairs, and f i and f j denote the features of each image. y is the indicator of the label. For example, if two images have same label, y i,j becomes 1, and 0 in the other case.
[ * ] + is the hinge function, D( * ) is the squared Euclidean distance between two features, and m is the margin parameter.
Originally, this loss function was structured to learn about two random data and labels. However, because our data tuple trains three images in units, the loss function was defined as follows: Here, [I a , I i , I j ] is an image tuple; and f a , f i , and f j are the features of each image.
[ * ] + is the hinge function, D( * ) is a squared Euclidean distance between two features, and m is the margin parameter.

Step 2: Fine Deep Metric Learning
In the previous step, the network learns the binary relationship that separates the same and different regions. Even if the same area is taken, changes in the shooting range and time instance cause a variation in image contents. Moreover, remote sensing images may have similar contents, even though different regions are captured. With only the first step, the relationship within the region is not learned and the exact regions cannot be retrieved. In addition, the network responds sensitively to small image changes in the same region. To solve this problem, we further train the network to learn the similarity within the region. For the fine step, a tuple with three images is necessary. Three images are extracted from two images of the same area at different times. In each image, one or two crops are performed to obtain three images. The data tuple example of Step 2 is shown in Figure 7. Moreover, the IOU overlaps that measure the amount of common areas between the images are generated. The network is trained through the IOU values obtained.

Triangular Loss
The triangular ratio function was modified according to the log ratio loss function. The log ratio loss function learns the relationship between the anchor and other two images using the continuous label. The log ratio loss function is defined as follows: Here, f a is feature for the anchor image (I a ), f i is the feature for image i (I i ), and f j is the feature for image j (I j ). D( f α , f β ) is the feature distance (squared Euclidean distance) between two samples (α, β). Moreover, D(y α , y β ) is a label distance between two samples (α, β).
The log ratio loss function has the advantage of being able to learn the continuous relationship compared with the previous functions that learns the binary relationship. By approximating the ratios between continuous label distances instead of the binary label, the log ratio loss function can learn more accurately in a feature space. Moreover, ideally, the distance between two images in the feature space will be proportional to their label distance. However, because only two of the relationships among the three images were used for training, there was the disadvantage of not considering the relationship between samples i and j.
These problems arise from the consideration of the anchor images. If learning is done in a single step, the definition of the anchor image determines the order of the images being trained. Moreover, the anchor also plays a key role in bridging the relationship between the other two images in this loss function. However, this consideration is not necessary in our coarse-to-fine training. Thus, instead of considering the anchor, we added the relationship between images i and j. We consider when all images become a key factor in the function and modify the formula accordingly. Our triangular ratio loss is defined as follows: L Tgr (a, i, j) = mean{L lr (a, i, j), L lr (i, a, j), L lr (i, j, a)}.
Comparing this function to the log ratio function, we observe that the log ratio function directly matches the relationship between a and i to the relationship between a and j. However, our proposed function is taught like the length ratio of a triangle, by considering all relationships in the three images.
The main advantage of the triangular ratio loss function is that it is possible to consider all relationships within the tuple regardless to the order of anchor and other images. The log ratio loss function does not train the relationship between i and j. The proposed loss function trains the relationship from the log ratio loss function and also the relationship between i and j, leading to a more precise embedding. Learning with this function prevents the divergence of the distance between i and j during the continuous learning and results in the gathered images with the same region becoming more closely related to each other.

Implementation Details
For the implementation of our network, we utilized the ResNet-34 architecture [30]. We used a pretraining model learned on the ImageNet dataset [49] and fine-tuned the model to extract the features suitable for RSIR. There is no remote sensing image dataset available for searching for identical regions; therefore, we generated training image pairs and retrieval image pairs from the Google Earth South Korea dataset. A detailed description of the data is shown below. The network was end-to-end trained, and all parameters of the network were fine-tuned. In training the fine step, we set a small-enough learning rate because a learning rate that is too big spoils the coarse step. We trained the network for 100 epochs, each with a batch size of 20, which took approximately 2 days on a single GPU (GTX 1080Ti) and CPU (Intel i-7 8700) for each step. As a result, it took an average of 1.7 seconds to retrieve one image. In detail, it is as follows: 1.3 seconds for feature extraction and 0.4 seconds for similarity comparison (for 10,000 database images). Our source code and Google Earth South Korea dataset are online available at https://github.com/sioot.

Google Earth South Korea Dataset
We created a multi-temporal dataset for RSIR using Google Earth [50] images, which were captured by the Landsat 7 and 8 satellites. We acquired three-band (RGB) images from South Korea, excluding the islands. South Korea is a challenging research area because various environmental changes exist, depending on the four seasons, and also because of the existence of various regions, such as cities, mountains, agricultural lands, and the sea. Moreover, the images were geographically registered; however, they were not orthophotos, and, therefore, tall buildings looked different at different time instances. There were also color variations. We divided all satellite images into 1080 × 1080 images. We gathered satellite images that were taken in 2016, 2017, 2018, and 2019. Each of the four images had the same area but had different time instances. Among the 120,000 images, owing to ambiguity (e.g., mostly composed of sea, cloud, river, and forest), we extracted 40,000 images from them. Figure 8 shows a few examples of these images.

Experimental Results
To assess our method and experiments, we followed the standard retrieval and place recognition evaluation procedure [9,25,[51][52][53]. The query image was deemed to have been correctly retrieved if at least one of the top n retrieved database images overlapped 50% of the contents from the query image. The percentages of correctly retrieved queries (recall) were estimated when n = 1, 5, 10, and 100. The extracted feature dimension was 512. We conducted additional experiments with feature dimensions of 128, 1024, and 3113 for performance evaluations in various dimensions. The feature dimension of 128 had a large drop in performance, whereas feature dimensions greater than 512 were excluded because they did not have any clear improvement in performance.

Quantitative Experiments
We used half of the images for learning (2016 and 2017) and the other half for retrieval (2018 and 2019). We divided the learning set into the training set (90%) and the validation set (10%). In the retrieval set, we used half as a database (2018) and part of the other half as a query image (2019). To select the query images, we randomly sampled 500 different regions. Then, we excluded 111 regions where the contents of the areas were too ordinary to be distinguishable. As a result, image retrieval was conducted with a total of 389 query images. Moreover, the query image consisted of a more than 50% IOU ratio overlapped with the corresponding database image.

•
The Conventional Learning Methods To add objectivity to our experiments, we used two types of tuples in the conventional methods. Training was done in each of the loss functions. The first type consisted of an anchor image, a positive image with an at least 50% IOU overlap with the anchor image, and a negative image in a different region from that of the anchor image. The first type of tuple was marked with " †" in the Table 1. The second type consisted of an anchor image, a positive image in an exactly identical region as that of the anchor image, and a negative image in a different region from that of the anchor image. We used the transformed triplet loss of Fan et al. [54] for the evaluation of the comparison results, because the basic form has performance degradation problems. The results for the first tuple type showed that the overall performance improved for both triplet loss and contrastive losses compared with that of the baseline model. In the detailed comparison of the two loss functions, triplet loss scored 18.3% for r@1, performing better than contrastive loss, which scored 8.7%. However, triplet loss scored only 40.1% for r@100, performing worse than contrastive loss, which scored 64.5%. For the second tuple type, the overall improvement in performance was greater for contrastive loss than that for triplet loss. Looking into the results for each type of tuples, we observed that the second type showed significantly better performance, because exactly identical regions were matched between images by training with deep metric learning.
We also conducted an ablation study to see the effects of cross-channel pooling. Table 2 above shows the performance evaluation results of excluding cross-channel pooling from the method used in Table 1. As a result, most of the loss functions had a noticeable drop in performance. This shows that cross-channel pooling effectively preserves the spatial information, and has a positive effect on the improvement in aerial image retrieval performance. • Coarse-to-Fine Learning Method We conducted an experiment by additional learning with the first coarse deep metric learning model. As a result of the coarse deep metric learning, we observed that using the second type of tuple that defined the positive image as an image in the exact same region as that of the anchor image led to better performance.
The Table 3 shows the retrieval result using the coarse-to-fine method. We can confirm the effectiveness of the coarse-to-fine training. Performance improved in the triplet-loss-based model and in the contrastive-loss-based model. Table 3. Result of the coarse-to-fine learning method trained on the Google Earth South Korea dataset (Recall@n).

Recall@n(%)
Step 1 Step 2 n = 1 n = 5 n = 10 n = 100 The performance improved more for the former model than that for the latter model. Triplet loss forces the distance between the anchor and the positive to be smaller than that between the anchor and the negative. Contrastive loss maximizes the distance between the anchor and the negative, while minimizing the distance between the anchor and the positive. Therefore, contrastive loss identifies the distance between features more strictly than the triplet loss. In addition, it was shown that coarse-to-fine learning resulted in better performance when trained with strictly defined features. Moreover, the overall performance showed greater improvement when using triangular loss than when using log ratio loss. Figure 9. shows examples of the coarse model and coarse-to-fine model retrieval results based on contrastive loss with the best performance.
We integrated the steps of the method proposed in Table 4 and evaluated the performance after training. The overall recall performance degradation occurred when the training was conducted in integration. This shows that the proposed two-step approach is more effective than the integrated approach. • Comparison with State-Of-The-Art As shown in the Table 5, we compared various SOTA methods in the field of image retrieval with our coarse to fine method. The models used for comparison were LDCNN [7], R-MAC Descriptor [45], NetVlad [51], triplet loss, and contrastive loss. These methods are widely used in image retrieval. LDCNN is a method of using low-level features of CNN through the mlpconv layers. R-MAC method identifies the activations of the convolutional feature maps, and uses the features of the parts determined to be important. LDCNN and R-MAC are classification-oriented methods, and therefore, a classification-labeled dataset is needed to train these two models. We used the AID dataset with features (altitude, picture quality, etc.) most similar to the Google Earth South Korea dataset. NetVlad method is primarily used in a smartphone environment for GPS location image retrieval. The method is robust to light changes and obscuring objects. NetVlad method was used with the same conditions as those used for coarse learning, and dense triplet mining was used for the sampling method. The backbone network used for the comparison methods was ResNet-34, except for LDCNN, which used the VGG-16 [34] network because of the number of parameters. As a result, the LDCNN used 490 dimensional feature vectors and the other models used 512 dimensional feature vectors. We confirm that the coarse to fine method has a noticeable difference compared to other SOTA methods.  9. Examples of the retrieval results on the Google Earth South Korea dataset. Each row corresponds to one test case: the query is shown in the first column, the baseline in the second column, the coarse model in the third column, and the coarse-to-fine models in the fourth and fifth columns. Our trained network shows relatively stable retrieval results despite the variations in time instance and shooting range.

Visualization of the Feature Descriptor
We observed the feature distributions of the query image that corresponded to those of the database image. The database image and the query image shared the same location. However, the two images had field variations due to the differences in time instances and had content-wise variation due to the shooting range of the images. The network must recognize these two images as identical to successfully retrieve under these circumstances. To observe how the baseline network and the network trained by the proposed method interpreted the two images, we visualized the feature descriptors. Of the 512 feature dimensions of each network, we plotted for every multiple of four. As shown in Figure 10, the baseline network feature descriptors have fewer overlaps for the two images, whereas the proposed network feature descriptor has, relatively, a lot of overlaps. This comparatively shows that the proposed network recognizes the two images as identical regions.

Location Wise t-SNE
To investigate the embedding distribution of our trained network, we performed an additional analysis on the Google Earth South Korea dataset. First, we selected 10 different regions from the said dataset. Then, we selected different times (2016, 2017, 2018, and 2019) and different scopes (0.26 ≤ IOU ≤ 1) of those regions for intraregional variation, as shown in Figure 11.     Figure 12a. The baseline model often extracted different features for the same region, depending on the time instance and the shooting range of the images. In particular, if the green field is included in the image, the gray-colored class is considered to be closer to the sky blue-colored class. Moreover, the green-colored class cannot be clustered or categorized properly within the t-SNE distribution because of the overall change in color of the images by the variation in time instance. However, training done with the proposed model shows regionally distinct distributions of the images in the t-SNE distribution despite the variations in time instance and shooting range. The baseline model t-SNE result is shown in Figure 12b.

Recognition of Differences Between Images
We conducted an experiment to determine what the proposed network considers as an important difference. Looking into the Figure 13, the difference heat map is shown. For each row, the two images on the left are input images and two images on the right are what the network considers as the difference. In the heat map is shown by calculating the squared difference between the convolutional layer feature maps. Because the network was trained to determine the differences in time variation, it did not consider field color and vehicle changes as differences because they were natural changes. However, important information changes, such as new buildings, were considered pivotal differences.

Discussion
We used the Google Earth South Korea dataset to confirm the effectiveness of coarse-to-fine learning and the proposed loss function. In the quantitative result, coarse-to-fine learning models showed better performance than that of the single (coarse) learning method. The proposed model showed the best performance among the models. Comparing the two types of loss functions used for the coarse method, we observed that contrastive loss showed better performance than that of triplet loss. The model that used contrastive loss function showed the best performance, with a score of 49.1% for r@1 on the Google Earth South Korea dataset. We conducted three qualitative assessments to further analyze our trained network. For the first assessment, we compared the trained network and the baseline network by visualizing the feature descriptors. As a result, the trained network extracted relatively similar features for identical regions that had different time instances and parallel shifts. For the second assessment, we selected 10 locations and four different time instances from the Google Earth South Korea dataset and made variations in the shooting range to visualize the distribution of the features. From the t-SNE figure, it is shown that identical regions with different time instances were recognized as relatively similar to each other. Finally, we analyzed what the trained model recognized as the differences between the two images with different time instances but with identical regions. The analysis process used the differences between features in the convolutional layer to observe what the network recognized as the differences between the images. Natural changes, such as changes in grasslands and the presence of cars, were ignored, and major changes, such as the presence of buildings, were recognized as differences.

Conclusions
We proposed a novel method based on end-to-end trainable deep metric learning for remote sensing image retrieval. The proposed method is designed to train by using a coarse-to-fine strategy. In the coarse step, the images are differentiated in a binary manner and trained. In the fine step, features are extracted depending on the overlap of contents between images using continuous values.
Furthermore, our network is end-to-end trainable using the similarity between images. Moreover, we proposed a new loss function for deep metric learning. This function is more precisely teachable than the previous methods because it reflects the entire triple relationship. As a result, our method can cope with a change in both time instance and shooting range between the database image and the query image. Experimental result shows that the proposed coarse-to-fine method improved performance compared to other methods on the Google Earth South Korea dataset.
In this study, we dealt with the variation in time instance and shooting range of the image. However, to use the proposed method widely, we should consider the rotational or torsional transformations. We plan to extend this method to deal with these transformations in the future.