A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization

He, Qiyi; Xu, Ao; Zhang, Yifan; Ye, Zhiwei; Zhou, Wen; Xi, Ruijie; Lin, Qiao

doi:10.3390/rs16163039

Open AccessArticle

A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization

by

Qiyi He

¹,

Ao Xu

¹

,

Yifan Zhang

¹

,

Zhiwei Ye

^1,*,

Wen Zhou

¹,

Ruijie Xi

²

and

Qiao Lin

³

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

School of Civil Engineering and Architecture, Wuhan University of Technology, Wuhan 430070, China

³

School of Computer Science, University of Nottingham Ningbo China, Ningbo 315100, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 3039; https://doi.org/10.3390/rs16163039

Submission received: 17 June 2024 / Revised: 16 August 2024 / Accepted: 16 August 2024 / Published: 19 August 2024

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

Multi-view scene matching refers to the establishment of a mapping relationship between images captured from different perspectives, such as those taken by unmanned aerial vehicles (UAVs) and satellites. This technology is crucial for the geo-localization of UAV views. However, the geometric discrepancies between images from different perspectives, combined with the inherent computational constraints of UAVs, present significant challenges for matching UAV and satellite images. Additionally, the imbalance of positive and negative samples between drone and satellite images during model training can lead to instability. To address these challenges, this study proposes a novel and efficient cross-view geo-localization framework called MSM-Transformer. The framework employs the Dual Attention Vision Transformer (DaViT) as the core architecture for feature extraction, which significantly enhances the modeling capacity for global features and the contextual relevance of adjacent regions. The weight-sharing mechanism in MSM-Transformer effectively reduces model complexity, making it highly suitable for deployment on embedded devices such as UAVs and satellites. Furthermore, the framework introduces a contrastive learning-based Symmetric Decoupled Contrastive Learning (DCL) loss function, which effectively mitigates the issue of sample imbalance between satellite and UAV images. Experimental validation on the University-1652 dataset demonstrates that MSM-Transformer achieves outstanding performance, delivering optimal matching results with a minimal number of parameters.

Keywords:

geo-localization; contrastive learning; multi-view scene matching; transformer; image retrieval

1. Introduction

In recent years, with the rapid advancement of UAV technology, its applications in fields such as agriculture, geological surveys, environmental monitoring, and urban planning have become increasingly widespread [1,2,3,4]. The high-resolution geospatial data provided by UAVs place extremely high demands on the accuracy and real-time requirements of geographic information systems (GIS), thereby posing greater challenges for related geo-location technologies [5]. Conventional geo-location methodologies depend on navigation systems such as GPS. Nevertheless, in specific environments, such as dense urban areas or instances of signal obstruction, the reliability of these technologies may be compromised [6].

To address this issue, researchers have initiated an investigation into the potential of computer vision techniques to assist in UAV geo-location. This involves the matching of UAV view images with the most similar satellite images representing the same geographic target from another angle. Satellite images typically contain precise GPS coordinates, which can be employed to ascertain the precise location of the UAV [7,8]. Essentially, it can be understood as image retrieval between two different sources of images. By indirectly utilizing the geographic information from satellite images, UAVs can achieve geo-location and navigation without GPS devices, as shown in Figure 1.

In the early stages of research into cross-view geo-location, traditional methods primarily relied on feature matching and landmark recognition techniques. These methods determine geo-location by using manually designed features and calculating the similarity between features [9]. Although these methods can perform well under specific conditions, they are generally constrained by the accuracy of feature extraction and the impact of environmental changes [10]. Over time, dynamic changes in the environment and visual disturbances in the scenes have significantly increased the challenges of cross-view matching.

In recent years, the rapid advancement of deep learning technology has brought revolutionary progress to cross-view geo-location research. In particular, methods based on Convolutional Neural Networks (CNNs) have been widely applied in scene-matching tasks for UAV and satellite images. These methods utilize convolution operations to extract invariant features from images taken from different perspectives, effectively enhancing the accuracy and efficiency of geo-location [11]. Although CNNs excel in feature extraction, they still tend to extract local features, which may lead to reduced matching accuracy due to the presence of similar interfering features in the environment, such as buildings with the same shape or color [12]. To solve this problem, researchers have started exploring methods based on Vision Transformers (ViTs). Unlike CNNs, ViTs rely on global feature modeling, which provides an advantage when dealing with complex scenes that contain highly similar local features. By utilizing global attention mechanisms, ViTs can effectively reduce the impact of local feature interference, thus enhancing the accuracy and robustness of scene-matching tasks [13]. However, these methods are generally structurally complex and require multiple networks for training, consuming a substantial amount of computing resources. Additionally, due to the large size and complexity of the networks, they are not suitable for deployment on edge embedded devices such as UAVs and satellites.

Despite the numerous methods proposed for cross-view geo-location, the unique characteristics of image data mean that UAV and satellite cross-view geo-location still face many challenges. One of the main issues is the imbalance in data sample ratios. The uneven distribution of samples leads the model to primarily focus on the more prevalent samples, making it challenging to learn the distribution of minority class images. Researchers have attempted multiple strategies to address this problem. For instance, some studies have balanced sample ratios by augmenting minority class samples, such as oversampling the minority class or generating new samples using synthetic techniques [14]. Additionally, some studies focus on improving loss functions. For example, Zhai et al. proposed a new triplet loss (QUITLoss) that embeds all potential positive samples into the original triplet loss, making the model focus more on minority class samples during training, thereby optimizing the model’s classification accuracy for geolocation [15]. Triplet loss, as the standard loss in multi-view geolocation, uses only one negative example per batch. When using hard negatives, it can easily cause the model to collapse [16].

In this paper, we propose a weight-sharing MSM-Transformer architecture for multi-view geo-location tasks involving UAVs and satellites. To achieve precise geo-location, it is essential to enhance the global modeling capability of images to reduce similar interfering features between different views. To address this issue, we propose an end-to-end ViT architecture that captures global contextual information while maintaining computational efficiency. Inspired by DaViT [17], we use the self-attention mechanism of spatial tokens and channel tokens to ensure consistency in global information extraction, which is advantageous for multi-view scene matching. Given the limited computational capacity of embedded devices like UAVs and satellites, our MSM-Transformer employs a weight-sharing design, thus achieving excellent performance with minimal parameters. To tackle the sample imbalance issue, we innovatively introduce a decoupled contrastive learning strategy, which symmetrically computes the loss, further resolving the sample imbalance problem. Our main contributions are summarized as follows:

To better capture global information and reduce model complexity, we proposed the MSM-Transformer architecture, which reduces model complexity by leveraging a weight-sharing strategy and employs a self-attention mechanism to fuse features, thereby capturing global information. This framework, while maintaining low model complexity, significantly enhances inference speed and the ability to capture global contextual information.
To address the issue of sample imbalance between drone and satellite images, we proposed a symmetric DCL loss function that decouples positive and negative samples, thereby eliminating their coupling effect and addressing the challenge of the small proportion of positive samples in training. In comparison with existing loss functions, our approach is capable of overcoming the difficulty of low positive sample ratios in the training process, thus enhancing the overall performance and robustness of the model.
We validated the effectiveness of our method on the University-1652 benchmark dataset, achieving the highest accuracy with the least number of parameters.

2. Related Works

2.1. UAV to Satellite Cross-View Geo-Localization

The objective of the research on cross-view geo-localization using UAV and satellite images is to ascertain the geographical location of an image by means of a comparative analysis of images from disparate perspectives. This problem is challenging due to the significant differences in perspective, scale, and appearance between UAV and satellite images. As deep learning and computer vision technologies advance, various methods have been proposed to address these challenges. Early cross-view geo-localization methods were primarily dependent on handcrafted features and traditional image matching techniques. For instance, Bansal et al. [18] introduced a scale normalization method similar to SIFT features, achieving scale invariance for similar structure representations, demonstrating the potential of feature-based methods in this field. However, these methods often encountered difficulties in handling large-scale and perspective variations between UAV and satellite images. With the advent of deep learning, UAV to satellite cross-view geo-localization has made significant progress. A multitude of studies have proposed convolutional neural network (CNN) architectures with the objective of learning features that are robust to perspective changes. Zhai et al. [19] introduced a Siamese Network architecture that employs a shared CNN to extract features from UAV and satellite images. Subsequently, contrastive loss is employed to ensure that matching image pairs have similar feature representations. This approach has been demonstrated to significantly improve matching accuracy in comparison to traditional techniques. Vo et al. [20] investigated the potential of metric learning in cross-view geo-localization. They developed a cross-view training framework that employs triplet loss to jointly train UAV and satellite images, thereby enhancing the discriminative power of feature representations. Shao et al. [21] introduced a style alignment strategy (SAS), converting the diverse visual styles of UAV images into a unified satellite image visual style, resulting in more accurate cross-view matching. Regmi et al. [22] proposed a framework based on Generative Adversarial Networks (GANs) to achieve cross-view matching by generating images consistent with the target view. This method leverages generative models for matching in unconstrained environments, further enhancing localization robustness. Cai et al. [23] proposed a network based on an attention mechanism, which improves matching performance by focusing on key areas in the images. Zhu et al. [24] proposed a purely Transformer-based method for cross-view geo-localization, aiming to overcome the limitations of existing CNN methods, especially in global information modeling and explicit location information encoding. Despite their effectiveness in solving specific problems, the aforementioned methods are parameter-complex and time-consuming in computation, making them particularly challenging to use on edge devices with limited computational capabilities, such as UAVs or satellites. To address this challenge, we propose a weight-sharing MSM-Transformer architecture. By utilising a weight-sharing mechanism, this architecture significantly reduces model parameters and enhances computational efficiency, enabling efficient operation on edge devices with limited computational resources.

2.2. Transformer in Vision

The attention mechanism of the Transformer [25] was initially proposed to address issues in the field of natural language processing. Subsequently, the powerful visual capabilities of the Transformer highlighted the superiority of its architecture. ViT (Vision Transformer) [26] was first proposed by the Google Research team, with the core idea of dividing an image into fixed-size patches and then treating each patch as a word, inputting it into the Transformer for processing. In contrast to traditional CNNs, ViT is capable of directly capturing global information, which enables it to excel on large-scale datasets. To mitigate ViT’s dependence on large-scale data, Touvron et al. [27] proposed Data-efficient Image Transformers (DeiT), which incorporates knowledge distillation technology. This involves using a pre-trained CNN model as a teacher model to guide ViT training, significantly enhancing data efficiency. Chen et al. [28] introduced the Hybrid Network Transformer, which integrates CNNs with transformers to leverage the strengths of CNNs in local feature extraction and transformers in global relationship modeling. In the cross-view domain, Yang et al. [29] proposed a layer-to-layer transformer (L2LTR), which leverages the positional encoding of the transformer to assist L2LTR in understanding and aligning the geometric configuration between ground and aerial images. Dai et al. [13] proposed a straightforward and efficient Transformer-based architecture, named Feature Segmentation and Region Alignment (FSRA), which aimed to enhance the model’s comprehension of contextual information and instance distribution. Ding et al. [17] developed a simple and effective Dual Attention Vision Transformer (DaViT) architecture, which employed the self-attention mechanism of spatial and channel tokens, enabling the model to capture global context while maintaining computational efficiency. Given that the pre-trained DaViT model has fewer parameters and higher training efficiency compared to the pre-trained ViT model, this paper utilises the advantages of this network and incorporates it into our method’s pipeline.

2.3. Image Retrieval Methods for Contrastive Learning

The geo-location of UAV views is fundamentally an image retrieval task, which involves the matching of images from UAVs or satellites to corresponding images. Image retrieval is a crucial task in computer vision, with the objective of retrieving images similar to the query image from a large dataset. Contrastive learning is a critical component of image retrieval, as it effectively learns image representations, thus enhancing retrieval accuracy and efficiency. SimCLR is an unsupervised learning framework introduced by Chen et al. [30]. It creates positive sample pairs via data augmentation, using different images as negative sample pairs, and optimises image representations through a contrastive loss function. MoCo, proposed by He et al., is a contrastive learning method that enhances the stability and efficiency of contrastive learning by incorporating a momentum encoder and a dynamically updated negative sample queue [31]. Supervised Contrastive Learning, proposed by Khosla et al. [32], extends contrastive learning into the realm of supervised learning by using class labels to construct positive and negative sample pairs, greatly enhancing image retrieval accuracy. This approach incorporates class information into the contrastive loss, resulting in more compact representations for similar images and more dispersed representations for dissimilar images. Multi-View Contrastive Learning, proposed by Tian et al. [33], employs a range of data augmentation strategies from different perspectives to generate multi-view positive sample pairs, thereby enhancing image representations. Wu et al. [34] employ a non-parametric approach to instance discrimination, treating each instance as its own class. This method effectively extracts image features through contrastive learning without the need for complex labels, facilitating rapid nearest neighbour retrieval. Cao et al. [35] proposed a novel unsupervised deep hashing framework, known as Fine-Grained Similarity Preserving Contrastive Learning Hashing (FSCH), which comprehensively explores fine-grained semantic similarities between different images and their augmented views. Huang et al. [36] proposed a supervised contrastive learning approach called SCFR, which is based on the fusion of global and local features. The objective of this approach is to enhance the ability to distinguish between images with the same semantic information but different visual representations. Yeh et al. [37] introduced Decoupled Contrastive Learning (DCL), which significantly enhances the network’s learning efficiency by eliminating the pronounced Negative-Positive Coupling (NPC) effect in the InfoNCE loss. This paper adopts and refines this concept by proposing Symmetrical Decoupled Contrastive Loss, which addresses the positive and negative sample mismatch problem in UAV and satellite views.

3. Proposed Method

3.1. Overview

The geo-location of a UAV view can be regarded as an image retrieval task. Given a UAV image, satellite images are retrieved for the purpose of determining the current latitude and longitude, thereby achieving UAV geo-location. The method proposed in this paper, as shown in Figure 2, comprises three main parts: data augmentation, a weight-sharing MSM-Transformer architecture, and a symmetrical DCL loss function. Firstly, in Section 3.2, we present the data augmentation strategies employed to address the issue of insufficient input data during network training, thereby facilitating effective network training. Secondly, in Section 3.3, we provide a detailed account of the proposed weight-sharing MSM-Transformer architecture for the task of UAV image geo-location. By sharing weights, this architecture significantly reduces network complexity while effectively maintaining accuracy, with the assistance of DaViT’s excellent global feature extraction capabilities. Finally, in Section 3.4, we introduce our proposed symmetrical DCL loss function. This loss function significantly improves learning efficiency by eliminating the Negative-Positive Coupling (NPC) effect, and successfully resolves the positive and negative sample mismatch problem between satellite and UAV images using a contrastive learning loss framework.

3.2. Data Augmentation

In the context of neural network training, insufficient input data often results in inadequate sample diversity, which in turn fails to sufficiently describe the complete sample distribution. The introduction of data augmentation techniques can enhance sample diversity, which in turn helps to improve the model’s generalisation ability and robustness [38]. In the University-1652 dataset, the number of UAV images is significantly greater than that of satellite images, and this imbalance in sample proportions makes learning useful information challenging. To address the issue of data imbalance, we applied data augmentation techniques to both UAV and satellite images. In order to ensure the invariance of geometric structures such as buildings and roads in the images, we employed a number of data augmentation methods. Specifically, to enhance the model’s generalization ability and avoid overfitting during the training process, we conducted synchronous horizontal flipping and rotation on satellite images and corresponding movement on UAV images to maintain the satellite images’ orientation. To avoid focusing on specific regions of some images, we also employed grid and coarse dropout, and to enhance generalisation, we applied colour jitter. As illustrated in Figure 3, the upper section displays the comparison of enhancement effects on UAV images, while the lower section displays the comparison of enhancement effects on satellite images. As the University-1652 dataset encompasses one-to-many or many-to-one tasks, a bespoke batch processing strategy was employed to prevent multiple images from the same class from appearing in the batch process. Symmetrical DCL Loss treats all other samples as negative samples, therefore, without a batch processing strategy, positive samples would also be considered negative samples.

3.3. MSM-Transformer

Images captured by UAVs from disparate vantages may exhibit considerable dissimilarities in their image characteristics. This variation in perspective presents considerable challenges for image feature extraction and matching, particularly in cross-view image retrieval tasks [39]. Traditional methods based on CNN have achieved notable outcomes in image feature extraction. However, these methods frequently fall short in handling global contextual information, resulting in suboptimal feature matching accuracy in complex scenarios. In contrast, Transformer-based network architectures have demonstrated superior capabilities in encoding global contextual information. Based on this, this paper proposes a weight-sharing MSM-Transformer framework based on the Transformer architecture, specifically designed for the task of UAV view geo-location, as shown in Figure 4.

Due to the limited computing power of edge embedded devices such as UAVs and satellites, model design must consider computational complexity and resource consumption. Image retrieval models based on the Transformer architecture often have large parameter counts and high computational demands, making them difficult to run efficiently on these devices [40]. To address this challenge, we adopted the weight-sharing strategy of the Siamese Network, which not only effectively reduces the number of network parameters but also significantly lowers training and inference time costs. In the MSM-Transformer, we further optimised model efficiency by sharing the weight parameters of the two sub-networks. This method not only preserves the Transformer model’s capacity to capture global contextual information but also significantly reduces the model’s complexity. Consequently, we can achieve efficient geo-location tasks on resource-limited edge embedded devices.

The specific process of MSM-Transformer is as follows: firstly, the UAV and satellite images are divided into several small blocks of 16 × 16 pixels. Each image block is then passed through a fully connected layer for linear projection, converting it into a series of fixed-length vectors. Since the Transformer architecture itself lacks the ability to handle sequence order, the model also introduces position embedding vectors. The position encodings employ an absolute position encoding method, which effectively encodes the positional information of each image block and preserves the relative positional relationships between different blocks. This design is particularly important for the model when processing cross-view information. After the embedding vectors of the image blocks have been added to the corresponding position vectors, the resulting output is fed into the Transformer encoder. The encoder comprises multiple layers with a uniform structure, each incorporating a multi-head self-attention mechanism (MSA) and a feedforward neural network (FFN). In the multi-head self-attention mechanism, the query matrix

Q \in R^{n \times d_{k}}

is multiplied with the key matrix

K \in R^{n \times d_{k}}

and normalised by the softmax function. Where n represents the length of the input sequence, and

d_{k}

represents the dimension of the Key matrix. The normalized result is multiplied by the value matrix

V \in R^{n \times d_{v}}

to generate the final attention matrix output. Where

d_{v}

represents the dimension of the value matrix. The attention matrix is calculated as follows:

A t t e n t i o n (Q, K, V) = S o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

After passing through multiple Transformer encoding layers, the output of the class token is passed to an MLP Head to generate the embedding feature vectors representing the UAV and satellite images. The dimensions of these embedding vectors are determined by the MLP Head, with the dimension set to 1000 in this study. During the training phase, the model calculates the symmetrical DCL loss between the two embedding feature vectors and performs backpropagation to optimize the overall model parameters. In the testing phase, the final retrieval results are obtained by comparing the similarity between the embedding feature vectors.

The sub-network of MSM-Transformer employs the DaViT architecture, which enhances image processing capabilities by alternately utilising Spatial Window Attention and Channel Group Attention mechanisms. Among these, Spatial Window Attention divides tokens in the spatial dimension into different windows and performs self-attention in the spatial dimension, thereby enabling the capture of relationships between different positions. The Channel Group Attention mechanism divides tokens in the channel dimension into different groups and performs self-attention in the channel dimension, allowing it to capture relationships between different features. By combining these two types of self-attention, DaViT can simultaneously consider both spatial and channel information, achieving more comprehensive global modeling. Furthermore, the window mechanism and group mechanism of attention result in a significant reduction in the network’s computational complexity.

3.4. Symmetric DCL Loss

UAV cross-view geo-localization typically uses supervised metric learning methods, defaulting to the triplet loss function [41] and making improvements, such as the soft-margin triplet loss [20] and the weighted soft-margin triplet loss [42]. The purpose of the triplet loss is to reduce the distance between positive samples and the anchor while increasing the distance between negative samples and the anchor, making similar samples closer in the embedding space while dissimilar samples are relatively farther apart. However, to effectively apply the triplet loss, appropriate triplets need to be sampled in advance. Randomly selecting triplets may result in low training efficiency, so techniques such as Hard Negative Mining are needed to select effective negative samples. In light of this, we innovatively adopted an unsupervised contrastive learning method to fully utilize multiple hard negative samples.

The fundamental principle of contrastive learning is to develop effective representations by maximizing the similarity between positive samples and minimizing the similarity between negative samples. The InfoNCE loss function is frequently employed to maximize the mutual information between positive sample pairs. The specific formula is as follows:

L {(q, G)}_{InfoNCE} = - log \frac{exp (q \cdot g^{+} / τ)}{\sum_{i = 0}^{G} exp (q \cdot g_{i} / τ)}

(2)

where q represents the UAV image encoded by the Transformer, referred to as the query. G is a set of satellite images encoded by the Transformer, known as the Gallery. The Gallery set contains a positive sample

g^{+}

that matches the query q, as well as several mismatched negative samples.

g_{i}

denotes all samples in the Gallery set. In the InfoNCE loss function, similarity is computed by the dot product of the query q and the images in the Gallery. Specifically, the numerator in the InfoNCE loss function is the exponential similarity between the query q and its positive sample

g^{+}

, while the denominator is the sum of the exponential similarities of the positive sample pair and all negative sample pairs. Thus, minimizing the InfoNCE loss means maximizing the similarity between the query and the positive sample while minimizing the similarity between the query and any negative samples. This process essentially involves optimizing the inter-view similarity by adjusting the relative sizes of the numerator and denominator. Additionally, a temperature parameter

τ

is introduced in the loss calculation, which can be set as a learnable or fixed value as a hyperparameter [43]. Adjusting the temperature parameter significantly impacts the smoothness and separation of the feature space learned by the model. By adjusting

τ

, we can control the model’s sensitivity to similarity differences, thereby further optimizing its performance.

However, InfoNCE loss exhibits a negative-positive coupling (NPC) effect. When the coupling between positive and negative samples is too high, it becomes challenging for the model to distinguish between true positive samples and random negative samples [37]. Therefore, it is necessary to decouple the InfoNCE loss. Inspired by [37], we decoupled the denominator part of the InfoNCE loss, and the resulting decoupled formula is shown below:

L {(q, G)}_{InfoNCE} = - log \frac{exp (q \cdot g^{+} / τ)}{exp (q \cdot g^{+} / τ) + \sum_{i = 1}^{G} exp (q \cdot g_{i}^{-} / τ)}

(3)

among them,

g_{i}^{-}

denotes the negative samples in the whole Gallery set that mismatch the query q. The

exp (q \cdot g^{+} / τ)

term in the denominator is the positive sample pair term, also known as the NPC multiplier. If this term is removed from the denominator of Equation (3), we get the decoupled contrastive learning loss, as follows:

L {(q, G)}_{DCL} = - (q \cdot g^{+} / τ) + log \sum_{i = 1}^{G} exp (q \cdot g_{i}^{-} / τ)

(4)

where

q \cdot g^{+} / τ

is the attractive term, and

log \sum_{i = 1}^{G} exp (q \cdot g_{i}^{-} / τ)

is the repulsive term. Additionally, to better balance the losses of the two terms, we introduce a weight coefficient

λ

in the attractive term. The decoupled contrastive learning loss is as follows:

L {(q, G)}_{DCL} = - λ (q \cdot g^{+} / τ) + log \sum_{i = 1}^{G} exp (q \cdot g_{i}^{-} / τ)

(5)

In the field of unsupervised learning, contrastive learning loss functions are frequently applied in an asymmetric form in image processing tasks [44,45]. However, research has demonstrated that during multimodal pre-training, a symmetric loss function design can effectively reduce the differences between different modalities [46]. In light of these findings, we also adopted a similar symmetric strategy to optimise our DCL loss function in order to ensure that information flow in both directions could be fully utilised, namely from UAV images to satellite images and from satellite images to UAV images. The symmetric DCL loss calculation formula is as follows:

\begin{matrix} L_{Symmetric DCL} = & (- λ (q \cdot g^{+} / τ) + log \sum_{i = 1}^{G} exp (q \cdot g_{i}^{-} / τ)) \\ + γ (- λ (g \cdot q^{+} / τ) + log \sum_{i = 1}^{Q} exp (g \cdot q_{i}^{-} / τ)) \end{matrix}

(6)

where q represents the UAV image encoded by the Transformer, known as the query. g represents the satellite image encoded by the Transformer, known as the gallery. G is a set of satellite images encoded by the Transformer, referred to as the Gallery. Q is a set of UAV images encoded by the Transformer, called the Query.

γ

is the weight coefficient of the symmetric term, set to 0.5 in this paper.

4. Experiments

4.1. Datasets

The University-1652 dataset [47] is a resource for cross-view geo-location tasks. Unlike CVUSA [48] and CVACT [49], which focus on matching ground panoramic images with satellite images, it offers images of buildings captured from both satellite and UAV perspectives, making it the sole dual-view benchmark dataset in this domain. Given that the primary focus of this study is on UAV view geo-location, the ground view data from the dataset is not utilized. This dataset consists of geo-location images of 1652 buildings from 72 universities worldwide, where each building has one satellite view and 54 UAV views. A total of 701 buildings were utilized for training purposes, with an additional 701 buildings serving as a test set for evaluation. Furthermore, 250 satellite images of buildings were included as distractors in the reference test set to increase task complexity. University-1652 supports two evaluation tasks: UAV view target localization (UAV -> Satellite) and UAV navigation (Satellite -> UAV). In the UAV view target localization task, the query set comprises 37,855 UAV view images, while the gallery includes 701 satellite view images that are true matches and 250 satellite view images that serve as distractors. In this setting, each query is associated with a single true matching satellite view image. The University-1652 dataset comprises a significantly greater number of UAV images than satellite images, which introduces an imbalance in sample proportions that presents a challenge to the learning process. To solve the data imbalance problem, we utilized data augmentation techniques for both UAV and satellite images. To maintain the invariance of geometric structures such as buildings and roads in the images, we used multiple data augmentation methods. Specifically, to enhance the model’s generalization ability and avoid overfitting during the training process, we applied simultaneous horizontal flipping and rotation to the satellite images, and made corresponding adjustments to the UAV images to preserve the satellite images’ orientation. To prevent focusing on specific regions of certain images, we also employed grid and coarse dropout techniques. To further enhance generalization, we applied color jittering.

4.2. Evaluation Metric

To comprehensively evaluate the efficacy of UAV view geo-location tasks, we employed the use of

T o p - k

recall (

R @ k

) as the primary metric for assessment.

R @ k

is defined as:

R @ k = \frac{T_{p} @ k}{T_{p} @ k + F_{n} @ k}

(7)

where

T_{p} @ k

represents the number of true positive samples in the top k items, and

F_{n} @ k

represents the number of false negative samples in the top K items. This study calculated recall rates for K = 1, 5, and 10 to evaluate the model’s retrieval performance. To address the many-to-one matching problem in the University-1652 dataset, we also adopted Average Precision (

A P

) as a precision metric to evaluate the model’s performance. The formula for

A P

is as follows:

A P = \frac{\sum_{k = 1}^{N} (P (k) \cdot r e l (k))}{M}

(8)

where M is the total number of query images, N is the total number of images relevant to the query,

r e l (k)

is an indicator function that equals 1 if the k-th image is relevant, otherwise 0;

P (k)

represents the precision of the top k retrieved results. To evaluate the lightweight nature of our proposed method, we also introduced parameter count (

P a r a m s / M

) and inference time as evaluation metrics.

4.3. Implementation Details

This paper designs and implements the algorithm using the PyTorch [50] deep learning framework on an Ubuntu 20.04.2 system. We used the equipment parameters shown in Table 1 to train the model. For the feature extraction backbone network, we selected DaViT-tiny and DaViT-small, both of which were pre-trained on ImageNet [51].

During the training process, the input image size was uniformly set to 224 × 224 pixels, and 40 training epochs were conducted for all models. The AdamW optimizer was used for parameter optimization, which addresses the inefficiency of adaptive learning rate optimizers with L2 regularization by introducing a weight decay mechanism [52]. The specific parameter settings are as follows: the momentum is set to 0.9, and the weight decay coefficient is set to 0.0005. Additionally, the batch size is set to 128, the number of workers is set to 8, and multithreading techniques are employed to load data, thereby improving data processing efficiency. For learning rate adjustment, a dynamic learning rate adjustment strategy is employed. The initial learning rate was set to 0.01, and it was subsequently adjusted using a warmup learning rate strategy and a cosine annealing learning rate strategy [53] in order to optimize the training process and enhance model convergence.

4.4. Ablation Study

To assess the efficacy of the proposed MSM-Transformer framework, we conducted ablation studies on the University-1652 dataset to evaluate the influence of each improvement (DaViT, Share Weighted DaViT, symmetric DCL loss) on its performance. The results of the ablation experiment are presented in Table 2.

In the ablation study, we used a Siamese network with ResNet-50 as the feature extraction backbone as our experimental baseline, with the loss function being triplet loss. Method (1) shows that when we replace the baseline network’s backbone with DaViT-tiny to enhance the network’s feature representation, the network’s AP increased by 11.31%. This is because convolutional neural networks’ feature extraction capability is still inadequate and dependent on local feature extraction, whereas the Transformer-based DaViT has powerful global feature extraction capabilities. However, due to the large number of parameters in DaViT-tiny, it requires substantial computational cost, which is not friendly to edge computing devices like UAVs and satellites. Method (2) shows that when we use the parameter-shared DaViT-tiny, the network’s parameters can be significantly reduced. Due to the parameter sharing, the network can better learn the correlation between UAV and satellite views, enhancing accuracy while reducing the number of parameters by half. Method (3) shows that introducing symmetric DCL loss based on the weight-shared DaViT-tiny can significantly improve network accuracy without adding any parameter cost. This is because the proposed symmetric DCL loss can effectively alleviate the sample imbalance problem in the University-1652 dataset, further demonstrating the superiority of contrastive learning in UAV view geo-location tasks.

4.5. Comparison with Other State-of-the-Art Methods

To comprehensively analyze the superiority of our proposed method, we compared it with the following eight mainstream methods on the University-1652 test set, as detailed below.

(1): ResNet-50 + contrastive loss [54]: A Siamese network with VGG-16 as the backbone and contrastive loss as the loss function.
(2): ResNet-50 + triplet loss [55]: A Siamese network with ResNet-50 as the backbone and triplet loss as the loss function.
(3): TResNet-50 + instance loss [56]: A Siamese network with ResNet-50 as the backbone and instance loss as the loss function.
(4): LCM [57]: As a solution to the multi-view problem, LCM uses a multi-query strategy to supplement missing features in single-view images during the query process, enhancing the algorithm’s robustness.
(5): LPN [58]: In similarity measurement, LPN adopts a square ring feature segmentation strategy, focusing on the distance from the image center. It simplifies block matching, enabling block representation learning and similarity measurement.
(6): F3-Net [59]: After multi-view feature learning, F3-Net uses a Feature Alignment and Unification (FAU) module with EM distance to compute the similarity of misaligned features.
(7): PCL [60]: To address the geometric spatial relationship between UAV and satellite views, PCL proposes an end-to-end cross-view matching method that integrates a cross-view synthesis module and a geo-location module, fully considering the spatial correspondence and surrounding area information of UAV-satellite views.
(8): FSRA [13]: To tackle the sample imbalance problem, FSRA proposes a multi-sampling strategy to overcome the differences in the number of satellite images and images from other sources.
(9): TransFG [8]: In response to the cross-view image matching problem, TransFG introduces a transformer pipeline that integrates Feature Aggregation (FA) and Gradient Guidance (GG) modules to effectively balance feature representation and alignment.
(10): GeoFormer [61]: To significantly enhance the accuracy and efficiency of drone cross-view geo-localization, GeoFormer introduces a Transformer-based Siamese network that achieves a balance between matching accuracy and efficiency through efficient Transformer feature extraction, multi-scale feature aggregation, and a semantic-guided region segmentation module.

Since different methods use different image resolutions, and excessively high resolutions consume more computational resources, the image resolution in this paper is set to 224 × 224 to further reduce the computational cost of training and inference. Table 3 presents the experimental results of the aforementioned eight mainstream methods and our proposed method. The evaluation metrics include Params/M, R@1, R@5, R@10, and AP. As shown in Table 3, our method achieved state-of-the-art results with the least parameters. Compared to the best CNN-based method, PCL, our method considers global feature extraction through the DaViT structure, resulting in better performance than CNN-based methods. Compared to the previous best-performing transformer-based method, FSRA [15], at a resolution of 256 × 256, our proposed method’s tiny version improved R@1 by 6.97 and AP by 6.22. This improvement is due to our proposed symmetric DCL loss, which effectively addresses the sample imbalance problem between UAV and satellite images, making the loss computation more reasonable. In comparison to transformer-based siamese networks, such as TransFG and GeoFormer, our method requires a reduced number of parameters and demonstrates superior matching accuracy. Our method significantly reduces network parameters through the parameter-sharing strategy while achieving better performance. Due to the weight-sharing architecture we adopted, even the small version does not consume many parameters. This architecture allows our proposed method to achieve an impressive AP of 92.23 at a resolution of 224 × 224. In summary, our method can achieve better results on low-resolution images than previous methods, further demonstrating the effectiveness of our proposed approach.

4.6. Retrieval Results

To visually demonstrate the effectiveness of the algorithm, we randomly selected several UAV view images from the test set, and the results are shown in Figure 5. The algorithm proposed in this paper can precisely match most scenes in the University1652 dataset. In the eighth row of the figure, the strong visual similarity caused by similar image elements (longitudinal buildings surrounded by highways) makes it difficult even for geographic professionals to distinguish. Nonetheless, the algorithm can still correctly match before R@5. Furthermore, we can observe that when collecting UAV view images, the algorithm can achieve correct matches even without using the standard orientation. For instance, in the second and fifth rows of Figure 5, the true north direction of the correctly matched UAV view images is not parallel to the buildings. Additionally, the visual appearance of false positive samples is very similar to that of positive samples. For geographic professionals without ground truth labels, it may still take some time to determine if the match is correct, which may even lead to incorrect assessment results.

4.7. Algorithm Efficiency Test

Real-time capability is essential for UAVs. Compared to the Siamese network method, which uses convolutional neural networks as the backbone for feature extraction, transformer-based methods are generally more complex. However, due to the advantage of weight sharing, our proposed method shows good real-time performance and lower model parameters. To evaluate the advantages of our proposed MSM-Transformer network structure in terms of time performance and model parameters, we conducted a comparative time consumption test on the University-1652 dataset on an RTX 3090 computing platform. The algorithm measured the time consumption for all 701 regions in the University-1652 test set, matching 54 UAV view images from different viewpoints with 951 satellite images. The results of the test are shown in Table 4.

As shown in Table 4, the contrastive loss model has a relatively long inference time because VGG-16 has a large size and many parameters. In comparison, the models using triplet loss and LPN adopt the ResNet-50 structure, and their inference speed is faster than the VGG-16 based models due to the reduced number of parameters. The backbone network of the FSRA model adopts the ViT-small structure, and although it has fewer parameters, its high computational complexity significantly reduces the inference speed. Our method adopts a weight-sharing strategy, which allows the twin network to avoid initialising the model twice, thereby significantly improving the inference speed while reducing the model parameters. The tiny model achieved an impressive inference speed of 1 min and 13 s, almost five times faster than FSRA. This advantage is particularly important for edge computing devices such as UAVs and satellites.

5. Discussion

This paper presents a weight-sharing MSM-Transformer network architecture to solve the geo-localization problem in UAV views. The proposed data augmentation and batch processing strategies effectively enhance the model’s generalization ability and robustness. The Transformer-based network architecture markedly enhances the network’s global feature modeling ability and the contextual information correlation in adjacent regions. The weight-sharing model construction strategy enables the MSM-Transformer to greatly reduce complexity and increase inference speed. The symmetric DCL loss based on contrastive learning can effectively alleviate the sample imbalance problem between satellite images and UAV images. Nevertheless, the proposed algorithm is not without shortcomings. In the case of UAV and satellite images exhibiting significant discrepancies in perspective and scale, the model’s capacity for feature matching remains inadequate. This deficiency can be attributed to the absence of sufficient feature alignment between the UAV and satellite images.

In future work, we will attempt new feature alignment methods on UAV and satellite images to improve the model’s feature matching ability. Additionally, to better evaluate the effects of varying light, weather, and seasons on the model’s generalization, our next task is to build a cross-view geo-localization dataset of UAV and satellite images under different climatic conditions, providing a more thorough assessment of the model while validating it across more scenarios.

6. Conclusions

This paper proposes a weight-sharing MSM-Transformer network architecture to address the cross-view geo-localization problem between UAV and satellite images. The network utilizes a weight-sharing DaViT network for feature extraction, enhancing the understanding of global image features and significantly improving the network’s ability to model global features and the contextual relevance of adjacent regions. This method also significantly reduces the network parameters, thereby effectively improving inference speed. For UAV and satellite images, we propose data augmentation and batch processing strategies to ease the calculation of symmetric decoupled contrastive learning loss in subsequent steps. To effectively address the issue of sample size imbalance between satellite and UAV images, we employed a symmetric DCL loss function based on contrastive learning to alleviate this problem. Our method achieved R@1 and AP scores of 91.81 and 92.23, respectively, on the publicly available University-1652 dataset, reaching state-of-the-art results. Moreover, the tiny version of the model has only 28.4 M parameters, a significant reduction compared to previous models.

Author Contributions

Conceptualization, Q.H. and Z.Y.; Data curation, Q.H., A.X. and Z.Y.; Formal analysis, A.X. and Y.Z.; Funding acquisition, Q.H. and Z.Y.; Methodology, A.X.; Project administration, A.X. and Y.Z.; Resources, Q.L.; Software, Y.Z.; Supervision, Z.Y. and Q.L.; Validation, A.X., W.Z. and R.X.; Visualization, A.X. and Y.Z.; Writing—original draft, A.X.; Writing—review and editing, R.X. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the funding of following science foundations: the National Natural Science Foundation of China (No.42201464).

Data Availability Statement

The public sources of the data mentioned in this study are described in the paper.

Acknowledgments

The authors appreciate the editors and anonymous reviewers for their valuable recommendations.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Radoglou–Grammatikis, P.; Sarigiannidis, P.; Lagkas, T.; Moscholios, I. A compilation of uav applications for precision agriculture. Comput. Netw. 2020, 172, 107148. [Google Scholar] [CrossRef]
Ružić, I.; Benac, Č.; Jovančević, S.D.; Radišić, M. The application of uav for the analysis of geological hazard in krk island, croatia, mediterranean sea. Remote Sens. 2021, 13, 1790. [Google Scholar] [CrossRef]
Wang, H.; Yao, Z.; Li, T.; Ying, Z.; Wu, X.; Hao, S.; Liu, M.; Wang, Z.; Gu, T. Enhanced open biomass burning detection: The brantnet approach using uav aerial imagery and deep learning for environmental protection and health preservation. Ecol. Indic. 2023, 154, 110788. [Google Scholar] [CrossRef]
Qadir, Z.; Ullah, F.; Munawar, H.S.; Al-Turjman, F. Addressing disasters in smart cities through uavs path planning and 5g communications: A systematic review. Comput. Commun. 2021, 168, 114–135. [Google Scholar] [CrossRef]
Wei, J.; Dong, W.; Liu, S.; Song, L.; Zhou, J.; Xu, Z.; Wang, Z.; Xu, T.; He, X.; Sun, J. Mapping super high resolution evapotranspiration in oasis-desert areas using uav multi-sensor data. Agric. Water Manag. 2023, 287, 108466. [Google Scholar] [CrossRef]
Cui, Z.; Zhou, P.; Wang, X.; Zhang, Z.; Li, Y.; Li, H.; Zhang, Y. A novel geo-localization method for uav and satellite images using cross-view consistent attention. Remote Sens. 2023, 15, 4667. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans. Geosci. Remote Sens. 2018, 57, 574–586. [Google Scholar] [CrossRef]
Zhao, H.; Ren, K.; Yue, T.; Zhang, C.; Yuan, S. Transfg: A cross-view geo-localization of satellite and uavs imagery pipeline using transformer-based feature aggregation and gradient guidance. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4700912. [Google Scholar] [CrossRef]
Ye, Y.; Bruzzone, L.; Shan, J.; Bovolo, F.; Zhu, Q. Fast and robust matching for multimodal remote sensing image registration. IEEE Trans. Geosci. Remote Sens. 2019, 57, 9059–9070. [Google Scholar] [CrossRef]
Li, J.; Yang, C.; Qi, B.; Zhu, M.; Wu, N. 4scig: A four-branch framework to reduce the interference of sky area in cross-view image geo-localization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703818. [Google Scholar] [CrossRef]
Zhu, Y.; Sun, B.; Lu, X.; Jia, S. Geographic semantic network for cross-view image geo-localization. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4704315. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A transformer-based feature segmentation and region alignment method for uav-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4376–4389. [Google Scholar] [CrossRef]
Volpi, R.; Namkoong, H.; Sener, O.; Duchi, J.C.; Murino, V.; Savarese, S. Generalizing to unseen domains via adversarial data augmentation. In Proceedings of the Advances in Neural Information Processing Systems 31: 31st Annual Conference on Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018. [Google Scholar]
Zhai, Q.; Huang, R.; Cheng, H.; Zhan, H.; Li, J.; Liu, Z. Learning quintuplet loss for large-scale visual geolocalization. IEEE MultiMedia 2020, 27, 34–43. [Google Scholar] [CrossRef]
Wu, C.-Y.; Manmatha, R.; Smola, A.J.; Krahenbuhl, P. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2840–2848. [Google Scholar]
Ding, M.; Xiao, B.; Codella, N.; Luo, P.; Wang, J.; Yuan, L. Davit: Dual attention vision transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 74–92. [Google Scholar]
Bansal, M.; Daniilidis, K.; Sawhney, H. Ultrawide baseline facade matching for geo-localization. In Large-Scale Visual Geo-Localization; Springer: Berlin/Heidelberg, Germany, 2016; pp. 77–98. [Google Scholar]
Zhai, M.; Bessinger, Z.; Workman, S.; Jacobs, N. Predicting ground-level scene layout from aerial imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 867–875. [Google Scholar]
Vo, N.N.; Hays, J. Localizing and orienting street views using overhead imagery. In Proceedings of the ECCV 2016: 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Cham, Switzerland, 2016; pp. 494–509. [Google Scholar]
Shao, J.; Jiang, L. Style alignment-based dynamic observation method for uav-view geo-localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 3000914. [Google Scholar] [CrossRef]
Regmi, K.; Shah, M. Bridging the domain gap for ground-to-aerial image matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 470–479. [Google Scholar]
Cai, S.; Guo, Y.; Khan, S.; Hu, J.; Wen, G. Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8391–8400. [Google Scholar]
Zhu, S.; Shah, M.; Chen, C. Transgeo: Transformer is all you need for cross-view image geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1162–1171. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Chen, D.; Miao, D.; Zhao, X. Hyneter: Hybrid network transformer for multiple computer vision tasks. IEEE Trans. Ind. Inform. 2024, 20, 8773–8785. [Google Scholar] [CrossRef]
Yang, H.; Lu, X.; Zhu, Y. Cross-view geo-localization with layer-to-layer transformer. Adv. Neural Inf. Process. Syst. 2021, 34, 29009–29020. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Tian, Y.; Krishnan, D.; Isola, P. Contrastive multiview coding. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XI 16. Springer: Cham, Switzerland, 2020; pp. 776–794. [Google Scholar]
Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3733–3742. [Google Scholar]
Cao, H.; Huang, L.; Nie, J.; Wei, Z. Unsupervised deep hashing with fine-grained similarity-preserving contrastive learning for image retrieval. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 4095–4108. [Google Scholar] [CrossRef]
Huang, M.; Dong, L.; Dong, W.; Shi, G. Supervised contrastive learning based on fusion of global and local features for remote sensing image retrieval. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5208513. [Google Scholar] [CrossRef]
Yeh, C.H.; Hong, C.Y.; Hsu, Y.C.; Liu, T.L.; Chen, Y.; LeCun, Y. Decoupled contrastive learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 668–684. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Zeng, Z.; Wang, Z.; Yang, F.; Satoh, S. Geo-localization via ground-to-satellite cross-view image retrieval. IEEE Trans. Multimed. 2022, 25, 2176–2188. [Google Scholar] [CrossRef]
Li, X.; Xiang, Y.; Li, S. Combining convolutional and vision transformer structures for sheep face recognition. Comput. Electron. Agric. 2023, 205, 107651. [Google Scholar] [CrossRef]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Hu, S.; Feng, M.; Nguyen, R.M.H.; Lee, G.H. Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7258–7267. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Güldenring, R.; Nalpantidis, L. Self-supervised contrastive learning on agricultural images. Comput. Electron. Agric. 2021, 191, 106510. [Google Scholar] [CrossRef]
Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A survey on contrastive self-supervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
Shubodh, S.; Omama, M.; Zaidi, H.; Parihar, U.S.; Krishna, M. Lip-loc: Lidar image pretraining for cross-modal localization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2024; pp. 948–957. [Google Scholar]
Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle WA USA, 12–16 October 2020; pp. 1395–1403. [Google Scholar]
Workman, S.; Souvenir, R.; Jacobs, N. Wide-area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 3961–3969. [Google Scholar]
Liu, L.; Li, H. Lending orientation to neural networks for cross-view geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5624–5633. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Deng, J. A large-scale hierarchical image database. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. In Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Deng, W.; Zheng, L.; Ye, Q.; Kang, G.; Yang, Y.; Jiao, J. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 994–1003. [Google Scholar]
Liu, H.; Feng, J.; Qi, M.; Jiang, J.; Yan, S. End-to-end comparative attention networks for person re-identification. IEEE Trans. Image Process. 2017, 26, 3492–3506. [Google Scholar] [CrossRef]
Zheng, Z.; Zheng, L.; Garrett, M.; Yang, Y.; Xu, M.; Shen, Y.-D. Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 51. [Google Scholar] [CrossRef]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A practical cross-view image matching method between uav and satellite for uav-based geo-localization. Remote Sens. 2020, 13, 47. [Google Scholar] [CrossRef]
Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, Y.; Zheng, B.; Yang, Y. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 867–879. [Google Scholar] [CrossRef]
Sun, B.; Liu, G.; Yuan, Y. F3-net: Multi-view scene matching for drone-based geo-localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610611. [Google Scholar]
Tian, X.; Shao, J.; Ouyang, D.; Shen, H.T. Uav-satellite view synthesis for cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4804–4815. [Google Scholar] [CrossRef]
Li, Q.; Yang, X.; Fan, J.; Lu, R.; Tang, B.; Wang, S.; Su, S. Geoformer: An effective transformer-based siamese network for uav geo-localization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9470–9491. [Google Scholar] [CrossRef]

Figure 1. The process of UAV view geo-location utilises multi-view UAV images as queries, retrieves matching satellite images through image retrieval, and uses the geographical location of the satellite images as the geo-location result of the UAV.

Figure 2. Overview of the proposed method, which is mainly composed of three parts: data augmentation, a weight-sharing MSM-Transformer architecture, and a symmetrical DCL loss function.

Figure 3. Comparison of data augmentation effects. The UAV-images underwent simultaneous horizontal flipping and rotation, coarse dropout, and color jittering methods, while the satellite images underwent simultaneous horizontal flipping and rotation, and grid dropout methods.

Figure 4. The MSM-Transformer architecture inputs UAV and satellite images into the weight-sharing DaViT model, where they are used to obtain the corresponding embedding vectors. The symmetrical DCL loss is then calculated to determine the loss between the two embedding vectors.

Figure 5. The top-k recall rate is referred to as R@k and is used to evaluate the ranking position of positive samples among all reference samples.

Table 1. Experimental equipment and environmental settings.

Parameter	Configuration
CPU	Intel Core i7-12700KF
GPU	NVIDIA GeForce RTX 3090 (24 GB)
CUDA version	CUDA 11.7
Python version	Python 3.9.13
Deep learning framework	Pytorch 2.0.0
Operating system	Ubuntu 22.04.2 LTS

Table 2. Results of ablation experiments.

Method	DaViT-Tiny	Share Weight	Symmetric DCL Loss	Params/M	R@1	R@5	R@10	AP
baseline				51.2 (25.6 × 2)	58.19	79.00	85.21	62.90
Method (1)	✓			56.8 (28.4 × 2)	72.79	87.97	92.02	76.24
Method (2)	✓	✓		28.4	81.96	93.74	95.68	84.58
Method (2)	✓	✓	✓	28.4	89.22	97.29	98.18	91.04

Table 3. Comparison with other algorithms.

Method	Backbone	Params/M	R@1	R@5	R@10	AP
Contrastive loss (224 × 224)	VGG-16	276.8 (138.4 × 2)	41.70	63.61	73.98	46.93
triplet loss (224 × 224)	ResNet-50	51.2 (25.6 × 2)	58.19	79.00	85.21	62.90
instance loss (224 × 224)	ResNet-50	51.2 (25.6 × 2)	58.49	79.40	85.31	63.31
LCM (224 × 224)	ResNet-50	51.2 (25.6 × 2)	66.65	84.93	90.02	70.82
LPN (512 × 512)	ResNet-50	51.2 (25.6 × 2)	77.71	/	/	80.80
F3-Net (384 × 384)	SF	/	78.64	91.71	94.57	81.60
PCL (512 × 512)	ResNet-50	51.2 (25.6 × 2)	83.27	90.32	95.56	87.32
FSRA (256 × 256)	ViT-small	44.2 (22.1 × 2)	82.25	/	/	84.82
FSRA (512 × 512)	ViT-small	44.2 (22.1 × 2)	85.50	/	/	87.53
TransFG (512 × 512)	ViT-small	44.2 (22.1 × 2)	87.92	/	/	89.90
GeoFormer (224 × 224)	E-Swin-large	157.0	89.08	96.83	98.09	90.83
Ours (224 × 224)	DaViT-tiny	28.4	89.22	97.29	98.18	91.04
Ours (224 × 224)	DaViT-small	49.7	91.81	98.83	99.27	92.23

Table 4. Results of algorithm efficiency test.

Method	Backbone	Params/M	Flops/G	Time-Consuming
Contrastive loss	VGG-16	276.8	31.00	6 min 30 s
triplet loss	ResNet-50	51.2	8.24	4 min 14 s
LPN	ResNet-50	51.2	8.24	4 min 23 s
FSRA	ViT-small	44.2	9.20	5 min 11 s
Ours	DaViT-tiny	28.4	4.54	1 min 13 s
Ours	DaViT-small	49.7	8.80	1 min 52 s

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Q.; Xu, A.; Zhang, Y.; Ye, Z.; Zhou, W.; Xi, R.; Lin, Q. A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization. Remote Sens. 2024, 16, 3039. https://doi.org/10.3390/rs16163039

AMA Style

He Q, Xu A, Zhang Y, Ye Z, Zhou W, Xi R, Lin Q. A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization. Remote Sensing. 2024; 16(16):3039. https://doi.org/10.3390/rs16163039

Chicago/Turabian Style

He, Qiyi, Ao Xu, Yifan Zhang, Zhiwei Ye, Wen Zhou, Ruijie Xi, and Qiao Lin. 2024. "A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization" Remote Sensing 16, no. 16: 3039. https://doi.org/10.3390/rs16163039

APA Style

He, Q., Xu, A., Zhang, Y., Ye, Z., Zhou, W., Xi, R., & Lin, Q. (2024). A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization. Remote Sensing, 16(16), 3039. https://doi.org/10.3390/rs16163039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization

Abstract

1. Introduction

2. Related Works

2.1. UAV to Satellite Cross-View Geo-Localization

2.2. Transformer in Vision

2.3. Image Retrieval Methods for Contrastive Learning

3. Proposed Method

3.1. Overview

3.2. Data Augmentation

3.3. MSM-Transformer

3.4. Symmetric DCL Loss

4. Experiments

4.1. Datasets

4.2. Evaluation Metric

4.3. Implementation Details

4.4. Ablation Study

4.5. Comparison with Other State-of-the-Art Methods

4.6. Retrieval Results

4.7. Algorithm Efficiency Test

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI