UAV-Satellite Cross-View Image Matching Based on Adaptive Threshold-Guided Ring Partitioning Framework

Liao, Yushi; Su, Juan; Ma, Decao; Niu, Chao

doi:10.3390/rs17142448

Open AccessArticle

UAV-Satellite Cross-View Image Matching Based on Adaptive Threshold-Guided Ring Partitioning Framework

Rocket Force University of Engineering, Xi’an 710025, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2448; https://doi.org/10.3390/rs17142448

Submission received: 19 May 2025 / Revised: 11 July 2025 / Accepted: 12 July 2025 / Published: 15 July 2025

(This article belongs to the Special Issue Temporal and Spatial Analysis of Multi-Source Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

Cross-view image matching between UAV and satellite platforms is critical for geographic localization but remains challenging due to domain gaps caused by disparities in imaging sensors, viewpoints, and illumination conditions. To address these challenges, this paper proposes an Adaptive Threshold-guided Ring Partitioning Framework (ATRPF) for UAV–satellite cross-view image matching. Unlike conventional ring-based methods with fixed partitioning rules, ATRPF innovatively incorporates heatmap-guided adaptive thresholds and learnable hyperparameters to dynamically adjust ring-wise feature extraction regions, significantly enhancing cross-domain representation learning through context-aware adaptability. The framework synergizes three core components: brightness-aligned preprocessing to reduce illumination-induced domain shifts, hybrid loss functions to improve feature discriminability across domains, and keypoint-aware re-ranking to refine retrieval results by compensating for neural networks’ localization uncertainty. Comprehensive evaluations on the University-1652 benchmark demonstrate the framework’s superiority; it achieves 82.50% Recall@1 and 84.28% AP for UAV→Satellite geo-localization, along with 90.87% Recall@1 and 80.25% AP for Satellite→UAV navigation. These results validate the framework’s capability to bridge UAV–satellite domain gaps while maintaining robust matching precision under heterogeneous imaging conditions, providing a viable solution for practical applications such as UAV navigation in GNSS-denied environments.

Keywords:

cross-view geo-localization; domain gap adaptation; UAV–satellite image matching; adaptive ring partitioning; keypoint re-ranking

1. Introduction

Cross-view geo-localization is a task that achieves target localization through multi-source image matching, commonly involving perspectives such as UAV imagery, satellite imagery, and ground-level imagery. It finds applications in autonomous driving [1], event detection, 3D reconstruction, and other fields. Typically, this process involves using an image from one perspective (query image) to search for the most similar image within a large dataset (gallery) captured from another perspective. The gallery images are usually pre-annotated with geographic coordinates, enabling the derivation of the query image’s location through successful matching. This localization approach functions as an image retrieval method and serves as a robust supplementary positioning solution when GNSS signals are weak or unavailable [2,3].

Early research on cross-view image matching predominantly focused on single ground-level views [4,5]. However, ground-level imagery suffers from limitations such as small spatiotemporal coverage, incomplete geolocation metadata, and high manual annotation costs. Satellite imagery, with its inherent advantages of broad spatial coverage and geotagged metadata, has positioned multi-view cross-view matching as a prominent research focus. A seminal advancement in this field is the cross-view feature translation method proposed by Lin et al. [6], which represents a critical milestone in cross-view matching research. Traditional methods relied on handcrafted feature descriptors for matching. Castealdó et al. designed descriptors to robustly capture semantic concepts and spatial layouts [7], yet manually engineered features demonstrated limited robustness and matching accuracy. Subsequently, Support Vector Machine (SVM)-based image classification methods emerged. With the rapid advancements in deep learning, Convolutional Neural Networks (CNNs) have shown superior performance in feature representation and are now widely adopted in cross-view image matching. Zhai et al. proposed a semantic feature extraction strategy for satellite views based on the VGG16 architecture, which projects satellite image features onto ground-view perspectives for direct comparison with ground-level image features, enabling cross-view image matching and localization [8]. Pan et al. proposed a CNN-based visual navigation method for UAVs, designing a fully convolutional network model integrated with saliency features and a neighborhood saliency reference localization strategy to achieve multi-scale aerial image localization [9].

The advancement of cross-view image matching has necessitated the creation of specialized datasets. Workman et al. introduced the large-scale CVUSA (Cross-View USA) dataset to support training and developed a CNN-based framework for transforming ground-level image features into aerial representations [10]; Liu and Li proposed the CVACT dataset and incorporated orientation-aware modeling into their framework, enhancing matching accuracy between satellite and ground-level views [11]. In recent years, with the widespread adoption of UAVs, researchers have leveraged drones as critical platforms for data collection. Building on this, Zheng et al. introduced the University-1652 tri-view dataset, utilizing dual-branch and triple-branch CNNs with category labels for cross-view image matching and localization by incorporating UAVs as auxiliary platforms [12]. In the University-1652 dataset, UAV images capture fewer obstructions compared to ground-level imagery, providing a broader field-of-view (FOV). Additionally, each platform offers multi-view imagery, with an average of 71 images per location, which enhances the model’s ability to comprehend target structures and learn viewpoint-invariant features. Therefore, this dataset is selected to investigate UAV-to-satellite image matching and localization [13], offering a robust foundation for cross-view geo-localization research.

Most existing cross-view image matching studies adopt viewpoint-agnostic processing, where end-to-end learning extracts highly discriminative features from different views and distinguishes similar images based on inter-class variations. Ding et al. improved Zheng’s baseline model by reframing image retrieval as a classification problem, proposing the LCM (Location Classification Method) [14]. Zhuang et al. designed the MSBA (Multiscale Block Attention) model using a self-attention mechanism, partitioning images into multiscale blocks for feature extraction to achieve more effective metric learning [15]. Inspired by human visual observation patterns, Wang et al. introduced the LPN (Local Pattern Network) based on square-ring partitioning [16]. This strategy captures geographic target information and contextual details, enhancing matching performance and robustness to image rotation. However, as building distributions in images are irregular, fixed square-ring partitions may introduce noise from varying UAV flight heights and angles. Equal weighting of all four partitions amplifies such noise.

To address this limitation, we revisit human observation habits; humans typically focus first on central structures in images. If the central region lacks informative content or lacks a dominant structure, attention shifts to surrounding salient features. Therefore, images can be partitioned into central subjects and peripheral environments, aligning with human observation patterns while reducing noise interference. In addition, given the variability in the position and size of the central subject across different images, we propose incorporating heatmaps to achieve precise localization of the central subject while dynamically optimizing the weighting of the central region through hyperparameter adaptation. This approach effectively enhances the accuracy of image partitioning.

Beyond enhancing feature discriminability and maximizing inter-class differences, reducing intra-class distances offers another pathway to improve matching accuracy. By treating multi-view images of the same building as a single class, minimizing feature discrepancies between views effectively reduces intra-class variance. To address large cross-view perspective gaps, Huang et al. proposed a Conditional Generative Adversarial Network (CGAN)-based prediction module, generating auxiliary ground-level information from satellite images for feature alignment [17]. Tian et al. integrated spatial correspondences between satellite images and surrounding areas, first transforming UAV oblique views to nadir perspectives via perspective projection, then using CGANs to adapt UAV images to satellite-like views [18]. Shao et al. introduced a Style Alignment Strategy (SAS) to harmonize UAV image RGB distributions with satellite imagery, mimicking their stylistic properties [19].

Observations reveal significant stylistic discrepancies between UAV and satellite views due to differences in capture time and angles. Satellite images exhibit uniform coloration owing to long-distance imaging, while UAV images display vivid colors and inconsistent lighting due to variable flight heights and angles. Therefore, to reduce intra-class variance, we align UAV image brightness with satellite references, harmonizing stylistic discrepancies between the two views to enhance matching accuracy.

Deep learning-based drone geo-localization algorithms fundamentally rely on constructing cross-modal aerial-satellite matching models, which necessitate large-scale training data. To address this, researchers have begun integrating traditional methodologies with deep learning frameworks to compensate for inherent limitations and enhance robustness in complex environments. Nassar et al. [20,21] developed a hybrid framework combining conventional computer vision techniques with CNNs; initial registration is achieved through SIFT and ORB feature matching, followed by U-Net semantic segmentation to extract precise building and road masks. Kenvin et al. [22] proposed a hybrid-feature non-rigid correspondence estimation method for multi-perspective remote sensing imagery, where SIFT point features initially characterize images before resolving terrain undulations and viewpoint variations via mixed-feature correspondence estimation.

While prior studies typically employed traditional methods for preliminary registration, such approaches often introduce false correspondences and incur high computational costs. To overcome these limitations, our method implements a CNN-based coarse matching stage to identify candidate satellite images, followed by geometric verification for fine-grained alignment. This hierarchical strategy significantly improves matching accuracy.

In summary, the primary contributions of this paper are as follows:

We propose an adaptive threshold-guided ring partitioning framework that divides images into central subjects and peripheral environments. By dynamically selecting the central subject’s position via heatmaps and optimizing its size as a learnable hyperparameter, this method resolves noise issues introduced by fixed four-layer partitioning, achieving more accurate and flexible feature division.
During testing, we introduce a keypoint matching-based re-ranking mechanism for the top 5 candidate images. By refining rankings based on matched keypoint counts, this approach addresses the low accuracy of top 1 correct matches, significantly improving retrieval precision.
By adjusting the brightness of UAV images to align with satellite references, we address the challenge of significant stylistic discrepancies between cross-view images of the same target, harmonizing their visual styles and thereby reducing the complexity of cross-view matching. Simultaneously, we propose integrating an additive angular margin loss (ArcFace Loss) with the cross-entropy loss, which enhances the discriminability of visual features by enforcing intra-class compactness and inter-class separation in angular space, effectively overcoming the difficulty of extracting salient features.
We rigorously evaluate our method on the University-1652 dataset. Both qualitative analysis and quantitative evaluations demonstrate that the proposed algorithm achieves competitive performance, confirming its effectiveness in cross-view geo-localization tasks.

2. Related Work

2.1. UAV-View Geo-Localization

With advancements in UAV technology, drones have become pivotal tools for cross-view geo-localization, forming vision-based positioning systems centered on UAV-to-satellite image matching. These systems typically extract features and establish similarity metrics to compare multi-source imagery for autonomous localization. Scene matching algorithms can be categorized into three types based on feature extraction methodologies. Early approaches relied on template matching. For instance, Fan et al. proposed a composite deformable template-matching registration algorithm that fused edge and entropy features in image representation, demonstrating robustness and effectiveness [23]. However, template matching suffers from limited applicability in complex scenes. The second category involves extracting local invariant features using manual descriptors. Shan et al. achieved multi-source image matching by comparing Histogram of Oriented Gradient (HOG) features between UAV and satellite views [24]. While these methods demonstrate strong environmental adaptability and robustness by leveraging prior knowledge, they remain limited in complex scenarios. The third paradigm employs neural networks to automatically learn features and compute similarity metrics, excelling in data-rich and complex environments. In 2015, Workman et al. first utilized CNN to learn features for the cross-view localization task and proposed a network architecture that integrates features from multiple spatial scales of aerial images [25]. Hinzmann et al. developed a deep learning-based UAV matching model that pre-constrains interference factors and integrates visual odometry for localization [26]. Schleiss combined CGANs with template matching, first converting UAV images into map-like representations for subsequent template-based matching [27].

As a foundational resource, Zheng et al. introduced University-1652, the first multi-source, multi-view UAV geo-localization dataset. They designed a baseline model to address UAV geo-localization and navigation tasks. Dai et al. proposed a Feature Split and Region Alignment (FSRA) model based on transformer architectures, which automatically partitions regions via heatmaps to improve instance distribution understanding [28]. To address altitude-induced variability, Zhu et al. contributed the SUES-200 dataset, encompassing multi-altitude and multi-scenario UAV imagery [29]. These studies provide a rich theoretical and practical foundation for UAV-view geo-localization.

2.2. Part-Based Representation Learning

Local features have been integral to manually designed descriptors. Classical local invariant features include SIFT (Scale-Invariant Feature Transform) [30], SURF (Speeded-Up Robust Features) [31], and ORB (Oriented FAST and Rotated BRIEF) [32], which rely on image attributes like grayscale intensity, gradients, and color. For instance, Lowe et al.’s SIFT method employs HOG to describe local features around keypoints, achieving invariance to translation, rotation, and scale.

To overcome limitations of traditional descriptors, researchers refined feature engineering. Mantelli et al. developed BRIEF (Binary Robust Independent Elementary Features) [33], generating global descriptors via random pixel pair sampling to enable UAV-to-satellite matching. For complex environments, multi-feature fusion emerged as a solution. Shan et al. combined Maximal Self-Dissimilarities (MSD) and Local Self-Similarities (LSS) to jointly detect keypoints and extract descriptors, balancing precision and real-time performance in visual localization [34].

With deep learning advancements, local invariant features were integrated into neural frameworks, forming part-based representation learning. Sun et al. proposed the PCB (Part-based Convolutional Baseline) network for person re-identification [35], outputting convolutional descriptors from uniformly partitioned regions. Zhong et al. enhanced PCB by incorporating three invariance mechanisms to address intra-domain variations [36]. For partial occlusion scenarios, Fu et al. introduced Horizontal Pyramid Matching (HPM) to leverage partial information efficiently [37]. Wang et al. pioneered the square-ring partition strategy in their LPN, aligning local learning with human visual perception while capturing edge-aware features. Compared to PCB and its variants, LPN’s ring-based partitioning improves robustness to image rotation. Inspired by LPN, we propose an adaptive threshold-guided ring partition strategy that dynamically adjusts partition layouts via learned heatmaps, better simulating human visual attention and optimizing local feature extraction.

2.3. Loss Functions in Cross-View Geo-Localization

Selecting appropriate and efficient loss functions is particularly critical for cross-view image matching tasks. Commonly used loss functions in image retrieval and re-identification include cross-entropy loss, contrastive loss, triplet loss, instance loss, and Circle Loss. In their baseline model, Zheng et al. adopted instance loss, treating all images of a target as one class and forcing a classifier to map different targets into a shared feature space. To address limited inter-class variations in triplet loss, Hu et al. proposed the Weighted Soft-Margin Ranking Loss to enhance training and retrieval performance [38]. Observing inconsistent optimization objectives between ID loss and triplet loss in ReID baselines, Luo et al. introduced a BNNeck structure to decouple the two losses into separate feature spaces [39], significantly improving ReID model accuracy. To mitigate cross-entropy loss’s neglect of negative prediction distributions, Shao et al. incorporated a deconstruction loss to reduce inter-target feature correlations during training.

In the original LPN network for cross-view matching, only the sum of cross-entropy losses across all partitions was minimized for optimization. However, standard cross-entropy loss primarily considers probability distributions of known classes, yielding limited feature discriminability. To address this, we introduce ArcFace loss as a complementary objective. By imposing an additive angular margin in the angular space, ArcFace compacts intra-class feature vectors on a hypersphere, enhancing feature discriminability and classification accuracy. This hybrid loss framework further improves model robustness and training stability.

3. Proposed Method

The architecture of the proposed adaptive threshold-guided partition-based matching algorithm is illustrated in Figure 1 and Figure 2.

During the training phase, the dataset is first preprocessed with brightness alignment to reduce matching complexity. A dual-branch CNN serves as the baseline model for feature extraction from UAV and satellite images. For feature extraction, Squeeze-and-Excitation (SE) attention modules are integrated after each convolutional layer in the ResNet-50 backbone. Building on the square-ring partition strategy proposed by Wang et al., we enhance the partitioning reliability and contextual information utilization by incorporating heatmap-guided adaptive positioning to dynamically determine region boundaries. The extracted features are then fed into a Generalized Mean (GeM) pooling layer, and the model is optimized using a combined loss function of cross-entropy loss and ArcFace loss to ensure robust convergence.

During the testing phase, images undergo brightness-aligned preprocessing before visual features are extracted using the trained model for both query and gallery images. Retrieval is then performed by ranking candidates based on feature similarity scores, with each query image returning the top 5 most similar matches. To further enhance accuracy, keypoint matching is applied to the top 5 candidates, refining their rankings based on the count of geometrically consistent keypoints. This re-ranking strategy enables rapid and precise cross-view geo-localization, effectively bridging the domain gap between UAV and satellite perspectives.

The following sections elaborate on the technical details of the algorithm.

3.1. Brightness Alignment

Analysis of the University-1652 dataset reveals substantial cross-view appearance discrepancies between satellite and drone imagery due to divergent acquisition platforms, capture times, and illumination conditions. These variations significantly increase the difficulty of cross-view image retrieval. To address this, Shao et al.’s SAS first computes cumulative distributions across RGB channels in satellite images to generate transformation mappings. Drone images are subsequently transformed using these mappings to align their visual style with satellite references. Inspired by this principle, we propose a streamlined solution in HSV color space that minimizes appearance gaps by adjusting only the brightness component. This single-parameter alignment achieves comparable style normalization with substantially reduced computational complexity compared to RGB-based transformations.

Therefore, we implement brightness normalization as a preprocessing step. Drone image brightness values are normalized to match corresponding satellite reference targets. As demonstrated in Figure 3, which shows the satellite image, original UAV image, and processed UAV image, this simple yet effective adjustment harmonizes luminance between cross-view images and achieves approximate style alignment. This preprocessing step significantly reduces intra-class variance caused by illumination differences, thereby substantially improving the model’s matching accuracy.

The brightness alignment strategy is implemented in two steps. First, each satellite image I is converted to a grayscale image G, and the mean brightness value B(I) of the satellite image is calculated as defined in Equation (1):

B (I) = \frac{1}{W \cdot H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} G (x, y),

(1)

where W and H are the width and height of the image, and G(x, y) represents the pixel value at position (x, y) in the grayscale image. In Equation (2), the brightness values of all N satellite images are averaged, where

B (I_{i})

denotes the mean brightness value of the i-th satellite image.

\bar{B} = \frac{1}{N} \sum_{i = 1}^{N} B (I_{i}),

(2)

Finally, the brightness of each UAV image J is adjusted to this global mean

\bar{B}

. The brightness of UAV image J, calculated using the same method as for satellite images in Equation (1), is denoted as B(J). The adjustment process is expressed in Equation (3):

J^{'} (x, y) = \frac{\bar{B}}{B (J)} \cdot J (x, y),

(3)

where J(x, y) denotes the pixel value at position (x, y) in the original UAV image, and J’(x, y) denotes the adjusted pixel value after brightness alignment.

3.2. Adaptive Threshold-Guided Ring Partition Framework (ATRPF)

Unlike the four-layer square-ring partition strategy proposed by Wang et al., Shao et al.’s dual-layer approach proves more effective for datasets featuring prominent architectural structures. As illustrated in Figure 4, the four-layer segmentation results in excessively fragmented regions where primary buildings and surrounding environments are divided into disproportionately sized patches, compromising spatial coherence. Conversely, the dual-layer configuration optimally preserves structural integrity by minimizing building fragmentation while maintaining contextual relationships. Additionally, this methodology facilitates subsequent adaptive zoning operations. Consequently, we adopt the dual-ring partitioning strategy as our preferred approach for architectural feature extraction.

After establishing the dual-layer square-ring partitioning strategy, we optimize the partitioning scheme. In the conventional four-layer approach, the positions and sizes of square rings are fixed (uniform distribution), assigning equal computational weights to all partitions. This rigid uniformity amplifies noise interference from peripheral regions. To address this, our adaptive dual-layer square-ring partition strategy dynamically adjusts the size and position of the central square box, as depicted in Figure 5.

Firstly, the size ratio between the central box (Wc) and the original image (W), defined as Wc/W, is initialized as a trainable hyperparameter. This transforms fixed partitions into an automatically learnable mechanism, allowing distinct computational weights for each partition. Furthermore, heatmap visualization highlights regions of high attention (e.g., building locations) within the image. The central box is dynamically positioned over these high-attention regions, enabling instance-specific adaptive partitioning for precise localization. This dual mechanism optimizes both partition geometry and semantic focus, enhancing robustness against environmental noise.

3.3. Loss Functions

In the dual-branch network, discriminative feature learning is achieved by jointly employing cross-entropy loss and ArcFace Loss.

The cross-entropy loss focuses on the predicted probability of the ground-truth class, commonly used in image classification tasks. It is defined as:

L_{cross} = - \sum_{i} (p (x_{i}) \log q (x_{i})),

(4)

where p represents the true probability distribution of known classes in the training dataset. When the sample image belongs to the i-th class,

p (x_{i}) = 1

, and 0 otherwise. q denotes the predicted probability distribution from the classifier.

ArcFace loss, an angular margin-based loss function, enhances feature discriminability by introducing an additive angular margin constraint in the angular space. It compacts intra-class features while separating inter-class features in angular space. The loss is formulated as:

L_{arc} = - \frac{1}{N} \sum_{i = 1}^{N} \log \frac{e^{s \cdot \cos (θ_{y_{i}} + m)}}{e^{s \cdot \cos (θ_{y_{i}} + m)} + \sum_{j = 1, j \neq y_{i}}^{N} e^{s \cdot \cos (θ_{j})}} .

(5)

where

x_{i}

represents the deep feature of the i-th sample, which belongs to the

y_{i}

-th class, and the total number of classes is N.

W_{j}

denotes the j-th column of the weight matrix.

θ_{j}

is the angle between the weight

W_{j}

and the feature

x_{i}

, while

θ_{y_{i}}

represents the angle between the feature

x_{i}

and the true center

W_{y_{i}}

. m is the angular margin, and s is the feature scale.

3.4. Feature Re-Matching

To address the issue of low accuracy in top 1 image matching, we introduce feature point matching as a secondary criterion for ranking during the testing phase, where the number of matched points serves as the sorting metric. This approach reorders the initial top 5 images selected by the network, significantly enhancing the accuracy of top 1 image matching and enabling precise localization of UAV imagery.

Feature matching between two images can generally be categorized into traditional matching methods (SIFT, ORB, etc.) and deep learning-based approaches. SIFT matching identifies stable keypoints through scale-space extrema detection, generates descriptors using gradient orientation information from keypoint neighborhoods, and finally establishes point-to-point correspondences based on descriptor similarity. ORB feature matching combines FAST corner detection with BRIEF binary descriptors, achieving faster matching speeds than SIFT through scale pyramid construction and rotation-invariant feature descriptors. The LoFTR (Local Feature Transformer) method employs a Transformer architecture to enhance feature descriptor matching capabilities by learning long-range dependencies among feature points through self-attention mechanisms [40]. SuperPoint [41] extracts keypoints and descriptors from images, while SuperGlue [42] establishes contextual relationships between features through graph neural networks (GNNs) and attention mechanisms to achieve robust optimal matching. Notably, LoFTR emphasizes precise local feature matching, whereas the SuperPoint+SuperGlue (SP+SG) framework prioritizes global-view feature matching optimization, demonstrating particular efficacy in complex scenes.

Figure 6 illustrates point-to-point matching results between UAV and satellite images using the four aforementioned methods. After applying Random Sample Consensus (RANSAC) to eliminate outliers, SIFT yielded five valid matching pairs, whereas ORB produced fifteen pairs with noticeable mismatches. Transformer-based methods demonstrated superior performance, generating significantly more accurate matches. Notably, SP+SG achieved finer image detail matching and higher reliability compared to LoFTR.

By integrating feature point matching, this strategy not only improves top 1 matching precision but also enables cross-view image registration. The established spatial transformation model between UAV and satellite imagery lays a critical foundation for subsequent high-accuracy localization of small targets in UAV images. This advancement ensures robust geolocation capabilities in challenging environments, aligning with the demands of remote sensing applications.

4. Experiments

4.1. Datasets and Evaluation Protocol

The experiments utilize the multi-source, multi-view University-1652 dataset, which comprises 1652 buildings across 72 universities worldwide, distributed across diverse geographical locations and climatic zones. Significant variations exist in architectural styles and surrounding environments—for instance, universities in urban centers feature densely clustered buildings with busy road networks, while those in suburban areas exhibit more dispersed structures. This diverse data distribution provides robust support for evaluating algorithm performance across heterogeneous real-world scenarios. The training set includes 701 buildings, while the test set comprises the remaining 951 buildings, with no overlapping buildings between the two sets. Each building is represented by three perspectives: one satellite image, fifty-four UAV images captured at varying heights and angles, and multiple ground-level images. UAV images are synthetically generated from 3D models provided by Google Earth, simulated as video recordings at 30 frames per second along spiral trajectories. The dataset also incorporates 21,099 co-view images collected from Google Images as an additional training set. Figure 7 illustrates representative cross-view image samples of a specific building within the dataset, demonstrating its multi-perspective coverage. For this study, we focus exclusively on cross-view geo-localization between satellite and UAV images.

Proposed by Zheng et al., this dataset supports two UAV-centric tasks:

UAV Target Geo-Localization (UAV → Satellite): Retrieving the geographic location of a UAV-captured query image from a satellite gallery.
UAV Navigation (Satellite → UAV): Guiding a UAV to a target area using a satellite query image.

The experiments employ Recall@K and Average Precision (AP) as metrics to evaluate matching model performance. After the model generates a ranked list of candidate images for a query, Recall@K is assigned a value of 1 if the ground-truth match appears within the top K + 1 ranked results; otherwise, it is 0. Common values for K include 1, 5, and 10. Higher Recall@K values indicate superior retrieval performance. However, Recall@K is highly sensitive to the position of the first true match in the ranked list. For UAV navigation tasks, where multiple valid matches may exist for a query target, AP—calculated as the area under the Precision–Recall (P-R) curve—is adopted as an additional metric. AP comprehensively reflects both precision and recall, ensuring a more accurate and holistic assessment of retrieval performance.

4.2. Implementation Details

The experiments were conducted on an Ubuntu 16.04 operating system using the PyTorch 1.7.0 framework with Python 3.6. The hardware configuration featured an Intel(R) Core(TM) i9-10900K CPU operating at 3.70 GHz base frequency. The backbone network, ResNet-50, was pre-trained on the ImageNet dataset and modified for our task; the stride of the second convolutional layer was reduced from 2 to 1, the original classification layer was removed, and a new fully connected (FC) layer followed by a classification layer was inserted after the pooling layer. Input images were resized to 384 × 384 pixels and augmented through random cropping, affine transformations, and horizontal flipping. The initial learning rate was set to 0.001 for the backbone network and 0.01 for the newly added layers. Training spanned 120 epochs, with the learning rate decaying by a factor of 0.1 every 80 epochs, and a dropout rate of 0.5 was applied. The model was optimized using Stochastic Gradient Descent (SGD) with a momentum of 0.9, weight decay of 0.0005, and parameters updated iteratively to ensure convergence.

During the testing phase, the similarity between cross-view images was computed using cosine distance, and the retrieval process ranked candidates based on these similarity scores. Given the absence of ground-level images as auxiliary data in UAV geo-localization applications, this experiment employed the dual-branch baseline model, where weights were updated and shared across branches after each training epoch.

4.3. Comparison with Other Methods

Table 1 presents the evaluation metrics of existing methods and the proposed algorithm on the University-1652 dataset. Zheng et al. introduced a baseline model using ResNet50 for this dataset. The LCM augments satellite data to enhance robustness. The LPN employs square-ring partitioning on extracted features to fully leverage contextual information. The SAFA (Spatial-Aware Feature Aggregation) [43] utilizes a two-step prior-informed approach: first, applying polar coordinate transformation to warp aerial images to approximate ground-level perspectives, and second, incorporating a spatial attention mechanism to align deep features in the embedding space. The USAM (Unit Subtraction Attention Module) [44] combines representation learning with keypoint detection, automatically identifying representative keypoints from feature maps and directing attention to salient regions; Tian et al.’s PCL methodology executes a three-stage workflow: Perspective Projection Transformation (PPT) converts oblique UAV imagery to nadir views, CGAN-based synthesis enhances satellite-style realism, and LPN facilitates discriminative feature extraction for classification; FSRA framework integrates Transformer architectures to partition images according to attention heatmaps, establishes cross-view correspondences through region-wise paired alignment, and ultimately consolidates localized features into unified representations; Wang et al. [45] developed Dynamic Weighted Decorrelation Regularization (DWDR) to address feature redundancy in cross-view image geo-localization.

The proposed algorithm is based on a dual-branch architecture tailored for UAV and ground-view perspectives, and does not employ common-view images collected from Google Image during training. As shown in Table 1, our method outperforms most existing approaches. Specifically, it achieves Recall@1 of 82.50% and AP of 84.28% for UAV→Satellite geo-localization, and Recall@1 of 90.87% with AP of 80.25% for Satellite→UAV navigation. The proposed algorithm achieves the highest recall rates in both drone geo-localization and navigation tasks, though its AP is marginally lower than the FSRA method. Nevertheless, holistic evaluation demonstrates the overall superiority of our approach.

Since the UAV images in the University-1652 dataset are synthesized from 3D models, their results exhibit inherent limitations; hence, we validate the effectiveness of our algorithm on real-world imagery. In this experiment, we employ the SUES-200 dataset captured at varying altitudes and conduct comparative studies with multiple algorithms on these authentic drone images. As shown in Table 2, we compare LCM, SUES-200 Baseline, LPN, and our proposed method. The results demonstrate that our algorithm achieves 73.72% Recall@1 and 76.93% AP in drone geolocation tasks, where the Recall@1 outperforms all four methods; however, its AP slightly trails LPN, which might be attributed to LPN’s superior capability in capturing global image features.

During training, the model converged in approximately 38 min when trained on the 701-category dataset. For inference, the UAV-to-satellite task took 10 min when processing the 700-category test set, while the satellite-to-UAV task required 13 min.

4.4. Ablation Studies

In the ablation study, we first established a baseline model as the control group. The baseline employs a ResNet-50 backbone with a SE attention mechanism appended after each convolutional layer. GeM pooling was adopted for feature aggregation, and the loss function solely utilized cross-entropy loss. As shown in Table 3, the baseline achieved Recall@1 and MAP scores of 66.17% and 70.33%, respectively, for the UAV geolocation task, and Recall@1 and AP scores of 77.43% and 64.79% for the UAV navigation task.

When matching a single UAV image against a satellite gallery, only limited angular information is available. To address this, we implement the multi-query strategy proposed by Zheng et al., utilizing multiple drone images captured from diverse perspectives. This approach enriches feature representation by averaging descriptors across multi-angle drone captures before cross-view retrieval against the satellite database. As presented in Table 3, the multi-query method substantially outperforms single-query baselines in the UAV-to-satellite task, achieving 76.71% Recall@1 and 79.95% AP. These results represent improvements of 10.54% in Recall@1 and 9.62% in AP over conventional single-image matching approaches. Notably, this strategy is applicable only to UAV→Satellite geo-localization tasks (not UAV navigation), due to the inherent diversity of satellite perspectives.

The proposed adaptive double-layer square-ring partition strategy dynamically adjusts the position of the central subject based on heatmap predictions and optimizes its size through hyperparameter iteration to maximize contextual information utilization. To validate the necessity of this strategy, we integrated it into the baseline multi-query framework. Table 3 demonstrates that this integration improved Recall@1 to 81.31% (geolocation) and 82.71% (navigation), with MAP/AP increasing to 83.94% and 70.93%, respectively. Compared to the baseline, this corresponds to absolute gains of 4.60% and 5.28% in Recall@1, and 3.99% and 6.14% in AP, highlighting its effectiveness.

To enhance feature discriminability, we augmented the original cross-entropy loss with ArcFace loss. As evidenced in Table 3, the addition of ArcFace loss further boosted Recall@1 by 3.40% and MAP by 3.03% for the UAV-to-satellite geolocation task. For the satellite-to-UAV navigation task, Recall@1 and AP improved by 2.43% and 2.12%, respectively, confirming the efficacy of ArcFace loss in refining feature separability.

An illumination alignment strategy was applied during dataset preprocessing to mitigate style discrepancies between cross-view images. Building upon previous experiments, its inclusion elevated Recall@1 by 2.74% and MAP by 2.47% for UAV-to-satellite geolocation, while Recall@1 and AP for satellite-to-UAV navigation increased by 5.73% and 5.20%, respectively. These results validate that illumination alignment effectively reduces intra-class variations and enhances matching precision.

To improve top 1 recall, we introduced an additional SP+SG feature point matching step during testing, which re-ranks the top 5 candidates based on the number of matched keypoints. Due to computational constraints, this method was only applied to the UAV geolocation task. As shown in Table 3, this refinement increased Recall@1 by 2.76% and MAP by 1.82% for multi-image matching in geolocation.

This ablation study quantitatively validates the effectiveness of each proposed strategy in improving matching accuracy on the University-1652 dataset. After integrating the adaptive double-layer square-ring partition strategy, the Recall@1 metric exhibits a significant improvement, primarily attributed to the strategy’s ability to precisely localize the target subject within the image and extract more representative features. Furthermore, when combined with complementary strategies such as ArcFace loss, these components synergistically enhance feature expressiveness and discriminability, thereby achieving higher matching accuracy. Notably, the illumination alignment strategy serves as a foundational preprocessing step, which collaborates synergistically with other feature extraction and learning frameworks to collectively elevate the algorithm’s overall performance.

4.5. Impact of Square Ring Proportion on Matching Accuracy

In the proposed adaptive double-layer square-ring partition strategy, the size of the central main body is controlled through a proportional relationship with the original image, where this ratio is defined as a hyperparameter with an initial value. To investigate the impact of this initial ratio (denoted as ratio: Wc/W, where Wc represents the side length of the central region and W the original image dimension) on matching accuracy, we conducted experiments with three distinct initial ratio values: 0.25, 0.5, and 0.75. Additional control groups without feature partitioning (ratio = 0 or 1) were included for comparison. As shown in Table 4, the highest matching accuracy for both tasks was achieved when the initial ratio was set to 0.5, yielding Recall@1 scores of 77.99% and 90.87%, along with AP values of 81.22% and 80.25%, respectively. Compared to the control groups without feature partitioning, these results demonstrate performance improvements of 0.79% and 2.3% in Recall@1, as well as 0.82% and 2.99% in AP for the respective tasks.

4.6. Impact of Different Feature Point Matching Methods on Accuracy

In the feature re-matching strategy, feature point matching methods were applied to re-evaluate the top 5 images, with the number of matched points serving as the criterion for re-ranking. To investigate the influence of different feature point matching methods on matching accuracy, we employed four approaches: SIFT, ORB, LoFTR, and SP+SG, while experiments without feature point matching served as the control group. As shown in Table 5, all methods except ORB improved the Recall@1 and AP, with SP+SG achieving the most significant enhancement. For UAV geo-localization, the single-image and multi-image matching Recall@1 values using SP+SG reached 82.50% and 90.21%, respectively, while AP values were 90.21% and 96.15%. Compared to the control group, these results represent improvements of 4.51% and 2.76% in Recall@1, and 3.06% and 1.82% in AP for the two tasks. The inferior performance of ORB may stem from its higher rate of mismatched points, whereas SP+SG exhibited superior accuracy and finer detail matching capabilities compared to other methods.

4.7. Qualitative Results

Figure 8 and Figure 9 present the retrieval results of our proposed algorithm on the University-1652 dataset, showcasing performance for both UAV→Satellite geo-localization and Satellite→UAV navigation tasks, with LPN-based matching provided as a baseline. In Figure 8, for UAV geo-localization, our algorithm significantly outperforms LPN in ranking ground-truth matches higher. For instance, in rows 1 and 3, the true matches ascend from third to first position. Notably, our method retrieves previously undetected true matches (e.g., row 2, where LPN failed to rank the true match within the top 5, while our algorithm places it second).

Figure 9 highlights the multi-match nature of the dataset, where one satellite image corresponds to fifty-four UAV images from varying altitudes and angles. For UAV navigation tasks, our algorithm retrieves more true matches within the top 5 compared to LPN. In rows 1, 3, 4, and 5, all top 5 candidates are correct matches. However, in row 2, only the top 1 result is correct, likely due to high similarity among white-roofed buildings causing mismatches.

Figure 10 demonstrates the robustness of our algorithm under diverse challenging conditions. In rows 1–2, despite significant seasonal appearance discrepancies—vegetation variations in row 1 and distinct bare-soil exposure in row 2—between query drone images and satellite references, our architecture maintains consistent localization precision by prioritizing architecturally stable features unaffected by seasonal land cover changes, confirming season-invariant performance. Row 3 exhibits partial occlusion of the primary structure due to oblique drone acquisition angles, while rows 4–5 feature off-center principal buildings. The adaptive dual-layer partitioning strategy preserves matching accuracy through dynamic bounding box repositioning that compensates for compositional shifts. The final row demonstrates successful multi-target localization, where the algorithm proactively identifies high-attention regions corresponding to semantically significant features—effectively navigating complex scenes through saliency-guided feature weighting to achieve accurate retrieval despite target multiplicity.

Figure 8 (rows 2 and 4) reveals instances where our algorithm fails to retrieve ground-truth matches at the top 1 position. Consequently, Figure 11 documents representative top 1 failure scenarios. Rows 1–2 exhibit significant structural modifications (notably complete building replacement in row 1) that compromise recognition; row 3 demonstrates severe occlusion hindering architectural feature extraction; row 4 features dominant vegetation obscuring primary structures. The final row presents high-confusion failures where ground-truth matches display striking spectral–spatial similarity to competing candidates in structural and chromatic characteristics, compounded by spatial duplication of targets across satellite imagery—the predominant failure mechanism in cross-view geo-localization systems.

This qualitative analysis robustly validates that our algorithm effectively extracts discriminative features, achieving accurate cross-view image matching even under significant viewpoint and illumination variations.

5. Discussion

As demonstrated by the ablation studies in Section 3 and qualitative results in Section 4, our proposed method achieves robust matching performance, validating its effectiveness. First, the brightness alignment of UAV images based on satellite references harmonizes stylistic differences between the two views, significantly reducing intra-class variance and thereby lowering matching complexity while improving accuracy. Second, after feature extraction, the heatmaps of features clearly highlight the spatial distribution of key structures (e.g., buildings), enabling dynamic square-ring partitioning and adaptive learning of the central region size (via learnable hyperparameters), which enhances local feature discriminability. Furthermore, during testing, the keypoint matching step leverages the number of geometrically consistent keypoints as a re-ranking criterion, refining the top 1 accuracy by revisiting initially mismatched candidates. Finally, in ablation experiments, the integration of ArcFace loss amplifies inter-class feature separability, substantially boosting matching precision for both UAV geo-localization and navigation tasks.

However, in practical applications, the algorithm may underperform in certain datasets (e.g., desert/Gobi regions with sparse structural features or suburban areas lacking dominant buildings) compared to ideal retrieval results reported. This limitation potentially stems from the adaptive ring partitioning strategy’s reduced efficacy in feature extraction from data-scarce environments. Future work will incorporate more diverse and complex imagery for algorithmic refinement through comprehensive characteristic analysis.

To address the performance degradation of current algorithms under scenarios with significant illumination variations, future research could focus on leveraging advanced illumination compensation algorithms or deep learning-based image enhancement techniques to further enhance the robustness of cross-view image matching. Additionally, the integration of prior geospatial information—such as topographic maps, building height data, and 3D urban models—into cross-view geolocation frameworks presents a critical direction for exploration. For instance, one promising approach involves incorporating topographic map information into the feature extraction pipeline of CNNs, or utilizing building height data to facilitate 3D reconstruction and matching of targets in imagery. Such strategies could enable sub-meter-level geolocation accuracy by synergizing multi-modal geospatial data with advanced computational models.

6. Conclusions

This study proposes an adaptive threshold-guided ring partitioning framework for UAV-satellite cross-view image matching, effectively addressing domain gaps caused by heterogeneous imaging platforms and viewpoint variations. By integrating brightness-aligned preprocessing to mitigate illumination discrepancies and a heatmap-guided adaptive ring partitioning mechanism that dynamically adjusts spatial feature extraction through learnable thresholds, the framework significantly reduces noise interference compared to fixed ring-based methods. The hybrid learning strategy combining cross-entropy and ArcFace losses enhances cross-domain feature discriminability, while a keypoint-aware geometric verification module (SP+SG) compensates for neural networks’ localization uncertainty through spatial consistency validation. Extensive evaluations on the University-1652 dataset demonstrate ATRPF’s superior performance, achieving 82.50% Recall@1 and 84.28% AP in UAV→Satellite geo-localization, along with 90.87% Recall@1 and 80.25% AP in Satellite→UAV navigation. These advancements bridge the UAV–satellite domain gap under challenging conditions, providing a robust solution for precise positioning in GNSS-denied environments and accelerating the practical adoption of cross-view geo-localization technologies in real-world remote sensing applications.

Author Contributions

Conceptualization, Y.L., D.M. and J.S.; methodology, Y.L. and D.M.; software, Y.L. and D.M.; validation, D.M., J.S. and C.N.; formal analysis, J.S. and C.N.; investigation, Y.L., D.M. and J.S.; resources, D.M., J.S. and C.N.; data curation, Y.L., D.M. and C.N.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L., D.M. and J.S.; visualization, Y.L. and J.S.; supervision, J.S.; project administration, J.S.; funding acquisition, C.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The code and data used in the manuscript are as follows: Baseline (Zheng et al.): https://github.com/layumi/University1652-Baseline accessed on 4 January 2019; SAFA: https://github.com/shiyujiao/SAFA accessed on 20 April 2024; USAM: https://github.com/AggMan96/RK-Net accessed on 5 May 2024; Dataset: Data address; University-1652 dataset: https://github.com/layumi/University1652-Baseline accessed on 18 May 2024.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Zhao, J.; Zhai, Q.; Zhao, P.; Huang, R.; Cheng, H. Co-Visual Pattern-Augmented Generative Transformer Learning for Automobile Geo-Localization. Remote Sens. 2023, 15, 2221. [Google Scholar] [CrossRef]
Kinnari, J.; Verdoja, F.; Kyrki, V. GNSS-Denied Geolocalization of UAVs by Visual Matching of Onboard Camera Images with Orthophotos. In Proceedings of the 20th International Conference on Advanced Robotics (ICAR 2021), Manhattan, NY, USA, 6–10 December 2021; pp. 555–562. [Google Scholar] [CrossRef]
Liu, Y.; Bai, J.; Wang, G.; Wu, X.; Sun, F.; Guo, Z.; Geng, H. UAV Localization in Low-Altitude GNSS-Denied Environments Based on POI and Store Signage Text Matching in UAV Images. Drones 2023, 7, 451. [Google Scholar] [CrossRef]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1437–1451. [Google Scholar] [CrossRef] [PubMed]
Wilson, D.; Zhang, X.H.; Sultani, W.; Wshah, S. Visual and Object Geo-Localization: A Comprehensive Survey. Available online: https://arxiv.org/pdf/2112.15202.pdf (accessed on 11 October 2023).
Lin, T.-Y.; Belongie, S.; Hays, J. Cross-View Image Geolocalization. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR, USA, 23–28 June 2013; pp. 891–898. [Google Scholar] [CrossRef]
Castaldo, F.; Zamir, A.; Angst, R.; Palmieri, F.; Savarese, S. Semantic Cross-View Matching. In Proceedings of the 2015 IEEE International Conference on Computer Vision Workshops (ICCVW), Santiago, Chile, 7–13 December 2015; pp. 9–17. [Google Scholar] [CrossRef]
Zhai, M.; Bessinger, Z.; Workman, S.; Jacobs, N. Predicting Ground-Level Scene Layout from Aerial Imagery. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 867–875. [Google Scholar] [CrossRef]
Pan, H.X.; Xu, J.L.; Li, J.T.; Wang, Y.; Wang, H. Research and Implementation of Multi-Size Aerial Image Positioning Method Based on CNN. J. Beijing Univ. Aeronaut. Astronaut. 2019, 45, 2170–2176. [Google Scholar] [CrossRef]
Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3961–3969. [Google Scholar] [CrossRef]
Liu, L.; Li, H. Lending orientation to neural networks for cross-view geo-localization. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-View Multi-Source Benchmark for Drone-Based Geo-Localization. In Proceedings of the 28th ACM International Conference on Multimedia (MM’20), Seattle, WA, USA, 13–16 October 2020; pp. 1395–1403. [Google Scholar] [CrossRef]
He, Q.; Xu, A.; Zhang, Y.; Ye, Z.; Zhou, W.; Xi, R.; Lin, Q. A Contrastive Learning Based Multiview Scene Matching Method for UAV View Geo-Localization. Remote Sens. 2024, 16, 3039. [Google Scholar] [CrossRef]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2021, 13, 47. [Google Scholar] [CrossRef]
Zhuang, J.; Dai, M.; Chen, X.; Zheng, E. A Faster and More Effective Cross-View Matching Method of UAV and Satellite Images for UAV Geolocalization. Remote Sens. 2021, 13, 3979. [Google Scholar] [CrossRef]
Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, B.; Zheng, B.; Yang, Y. Each Part Matters: Local Patterns Facilitate Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 867–879. [Google Scholar] [CrossRef]
Huang, J.; Ye, D.; Jiang, S. Ground-to-Aerial Image Geo-Localization with Cross-View Image Transformation. J. Wuhan Univ. (Nat. Sci. Ed.) 2023, 69, 79–87. [Google Scholar] [CrossRef]
Tian, X.Y.; Shao, J.; Ouyang, D.Q.; Shen, H.T. UAV-Satellite View Synthesis for Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4804–4815. [Google Scholar] [CrossRef]
Shao, J.; Jiang, L.H. Style Alignment-Based Dynamic Observation Method for UAV-View Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Nassar, A.; Amer, K.; ElHakim, R.; ElHelw, M. A Deep CNN-Based Framework for Enhanced Aerial Imagery Registration with Applications to UAV Geolocalization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1513–1523. [Google Scholar] [CrossRef]
Nassar, A.; ElHelw, M. Aerial imagery registration using deep learning for UAV geolocalization. In Deep Learning in Computer Vision: Principles and Applications; CRC Press: Boca Raton, FL, USA, 2020; pp. 183–210. [Google Scholar]
Plücker, K.; Scherer, S. Precision UAV Landing in Unstructured Environments. In Proceedings of the 2018 International Symposium on Experimental Robotics, Buenos Aires, Argentina, 5–8 November 2018; Springer: Cham, Switzerland, 2020; pp. 177–187. [Google Scholar] [CrossRef]
Fan, B.; Du, Y.; Zhu, L.; Tang, Y. The Registration of UAV Down-Looking Aerial Images to Satellite Images with Image Entropy and Edges. In Proceedings of the 2010 International Conference on Intelligent Robotics and Applications (ICIRA), Chengdu, China, 15–17 June 2010; pp. 609–617. [Google Scholar] [CrossRef]
Shan, M.; Wang, F.; Lin, F.; Gao, Z.; Tang, Y.Z.; Chen, B.M. Google Map Aided Visual Navigation for UAVs in GPS-Denied Environment. In Proceedings of the 2015 IEEE International Conference on Robotics and Biomimetics (ROBIO), Zhuhai, China, 6–9 December 2015; pp. 114–119. [Google Scholar] [CrossRef]
Workman, S.; Jacobs, N. On the Location Dependence of Convolutional Neural Network Features. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Boston, MA, USA, 7–12 June 2015; pp. 70–78. [Google Scholar] [CrossRef]
Hinzmann, T.; Siegwart, R. Deep UAV Localization with Reference View Rendering. arXiv 2020, arXiv:2008.04619. [Google Scholar] [CrossRef]
Schleiss, M. Translating Aerial Images into Street-Map Representations for Visual Self-Localization of UAVs. ISPRS Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, XLII-2/W13, 575–580. [Google Scholar] [CrossRef]
Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A Transformer-Based Feature Segmentation and Region Alignment Method for UAV-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4376–4389. [Google Scholar] [CrossRef]
Zhu, R.; Yin, L.; Yang, M.; Wu, F.; Yang, Y.; Hu, W. SUES-200: A Multi-Height Multi-Scene Cross-View Image Benchmark Across Drone and Satellite. IEEE Trans. Circuits Syst. Video Technol. 2023, 32, 4376–4389. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded-Up Robust Features. In Proceedings of the 9th European Conference on Computer Vision (ECCV 2006), Graz, Austria, 7–13 May 2006; Springer: Berlin/Heidelberg, Germany, 2006; Volume 3951, pp. 404–417. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An Efficient Alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar] [CrossRef]
Mantelli, M.; Pittol, D.; Neul, R.; Ribacki, A.; Maffei, R.; Jorge, V.; Prestes, E.; Kolberg, M. A Novel Measurement Model Based on ABBRIEF for Global Localization of a UAV Over Satellite Images. Robot. Auton. Syst. 2019, 112, 304–319. [Google Scholar] [CrossRef]
Shan, M.; Charan, A. Google Map Referenced UAV Navigation via Simultaneous Feature Detection and Description. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2015), Hamburg, Germany, 28 September–2 October 2015. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond Part Models: Person Retrieval with Refined Part Pooling (And a Strong Convolutional Baseline). In Proceedings of the 15th European Conference on Computer Vision (ECCV 2018), Munich, Germany, 8–14 September 2018; pp. 501–518. [Google Scholar] [CrossRef]
Zhong, Z.; Zheng, L.; Luo, S.; Li, S.; Yang, Y. Invariance Matters: Exemplar Memory for Domain Adaptive Person Re-Identification. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 598–607. [Google Scholar] [CrossRef]
Fu, Y.; Wei, Y.; Zhou, Y.; Shi, H.; Huang, G.; Wang, X.; Yao, Z.; Huang, T. Horizontal Pyramid Matching for Person Re-Identification. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019), Honolulu, HI, USA, 27–29 January 2019; pp. 8295–8302. [Google Scholar] [CrossRef]
Hu, S.; Feng, M.; Nguyen, R.M.H.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7258–7267. [Google Scholar] [CrossRef]
Luo, H.; Jiang, W.; Gu, Y.; Liu, F.; Liao, X.; Lai, S.; Gu, J. A Strong Baseline and Batch Normalization Neck for Deep Person Re-Identification. IEEE Trans. Multimed. 2019, 22, 2597–2609. [Google Scholar] [CrossRef]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 8922–8931. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar] [CrossRef]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching with Graph Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 4938–4947. [Google Scholar] [CrossRef]
Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-Aware Feature Aggregation for Image-Based Cross-View Geo-Localization. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; pp. 10090–10100. Available online: https://dl.acm.org/doi/10.5555/3454287.3455192 (accessed on 18 May 2025).
Lin, J.L.; Zheng, Z.D.; Zhong, Z.; Luo, Z.M.; Li, S.Z.; Yang, Y.; Sebe, N. Joint Representation Learning and Keypoint Detection for Cross-View Geo-Localization. IEEE Trans. Image Process. 2022, 31, 3780–3792. [Google Scholar] [CrossRef] [PubMed]
Wang, T.; Zheng, Z.; Zhu, Z.; Sun, Y.; Yan, C.; Yang, Y. Learning Cross-View Geo-Localization Embeddings via Dynamic Weighted Decorrelation Regularization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]

Figure 1. Algorithm architecture (training stage).

Figure 2. Algorithm architecture (testing stage).

Figure 3. Schematic diagram of the brightness alignment strategy. (a) A satellite image. (b) A UAV image. (c) The UAV image processed with brightness alignment.

Figure 4. The contrast between four-layer and dual-layer square-ring partitioning. (a) Four-Layer Square-Ring Partition: The green boxes highlight three buildings fragmented into disjointed blocks due to excessive partitioning, disrupting spatial coherence and feature integrity. (b) Dual-Layer Square-Ring Partition: All buildings retain structural integrity. The yellow box denotes the dynamically adjusted central region, where position and size are optimized via heatmap-guided localization and learnable hyperparameters.

Figure 5. Schematic diagram of the adaptive dual-Layer square-ring partition strategy. Row 1: dynamic scale variation of central region; row 2: positional adaptation of central region.

Figure 6. Comparison of feature point matching results using four different methods. (a) SIFT. (b) ORB. (c) LoFTR. (d) SP+SG.

Figure 7. Cross-view imagery of a target building in the University-1652 dataset. Columns from left to right represent satellite imagery, UAV imagery, street-view images, and Google Images. Each building includes 1 satellite image, 54 UAV images, and a variable number of street-view and Google Images.

Figure 8. UAV-view geo-localization results from UAV to satellite-view images. On the left is the query UAV image; in the middle are the top 5 retrieved satellite images using the LPN-based matching; on the right are the top 5 satellite images retrieved by the proposed algorithm. The correctly matched satellite image is highlighted in red boxes.

Figure 9. UAV-view geo-localization results from satellite to drone-view images. On the left is the query satellite image. The middle column shows the top five UAV images matched using the LPN algorithm, while the right column displays the top five UAV images matched using the proposed algorithm. Images within red boxes indicate correct matches.

Figure 10. Schematic illustration of successful UAV-to-satellite cross-view geo-localization across diverse scenarios.

Figure 11. Schematic illustration of top 1 retrieval failure modes in cross-view geo-localization.

Table 1. Performance comparison of different methods on the University-1652 dataset. (w/o G) denotes methods trained without the additional training set collected from Google Image.

Method	UAV→Satellite		Satellite→UAV
	Recall@1	AP	Recall@1	AP
Baseline (Zheng et al.)	62.99	67.69	75.75	62.09
LCM	66.65	70.82	79.89	65.38
SAFA (w/o G)	68.27	72.06	80.16	68.11
SAFA	69.34	73.15	82.60	69.78
SAFA+USAM	72.19	75.79	83.23	71.77
LPN (w/o G)	74.18	77.39	85.16	73.68
LPN	75.93	79.14	86.45	74.79
LPN+USAM (w/o G)	77.07	80.09	85.16	74.06
LPN+USAM	77.60	80.55	86.59	75.96
PCL	79.47	83.63	87.69	78.51
FSRA (k = 1)	82.25	84.82	87.87	81.52
LPN+DWDR	81.51	84.11	88.30	79.38
Ours (w/o G)	82.50	84.28	90.87	80.25