SkyPin: Benchmarking Target Geo-Localization from UAV Imagery on 2.5D Maps

Wang, Zhaochen; Wu, Rouwan; Liu, Yuxiang; Huang, Yudong; Yan, Shen; Zhang, Maojun

doi:10.3390/drones10070500

Open AccessArticle

SkyPin: Benchmarking Target Geo-Localization from UAV Imagery on 2.5D Maps

by

Zhaochen Wang

¹

,

Rouwan Wu

¹,

Yuxiang Liu

^1,*

,

Yudong Huang

²,

Shen Yan

¹ and

Maojun Zhang

¹

College of Systems Engineering, National University of Defense Technology, Changsha 410005, China

²

National Key Laboratory of Intelligent Spatial Information, Beijing 102488, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(7), 500; https://doi.org/10.3390/drones10070500

Submission received: 12 May 2026 / Revised: 20 June 2026 / Accepted: 25 June 2026 / Published: 30 June 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We introduce SkyPin, the first multi-modal benchmark dataset for UAV passive target geo-localization. It integrates 2.5D maps with centimeter-level RTK ground truth, effectively filling the gap in high-quality evaluation data.
We propose a visual-geometric localization pipeline that reformulates the highly pose-sensitive projection problem into a cross-view feature alignment task. Through comprehensive evaluation, we establish RoMa combined with raytracing as the current best-performing baseline.

What are the implications of the main findings?

The proposed framework significantly reduces the heavy reliance on highly accurate prior UAV poses, providing a practical technical path for the real-world deployment of low-cost drones in complex environments.
The error sources identified in the benchmark—particularly under large-tilt, low-texture, and thermal cross-modal conditions—clearly define the limitations of existing methods and provide specific directions for future research to enhance localization robustness.

Abstract

Accurate geolocalization of ground targets from unmanned aerial vehicles (UAVs) is critically limited by pose estimation errors and the scarcity of active ranging sensors. To address these challenges, we propose a pipeline that integrates reference image cropping, robust cross-view matching, and geographic projection to estimate real-world coordinates using 2.5D reference maps. For evaluation, we introduce SkyPin, the first large-scale benchmark of its kind, designed to comprehensively test UAV-based localization methods. It comprises UAV imagery from eight diverse environments, featuring both visible and thermal infrared modalities under a wide range of conditions, including variations in weather, time of day, flight altitude, and camera perspective. All ground targets are annotated with centimeter-accuracy Real-Time Kinematic (RTK) coordinates. We establish a comprehensive benchmark by evaluating a series of feature matching methods combined with different projection strategies, allowing systematic comparison of algorithm performance. Representative results show that RoMa combined with PnP-based raytracing achieves the best overall performance, reaching a median 2D error of 0.87 m and Recall@5m values of 0.94 and 0.98 on RGB and thermal infrared UAV-map settings, respectively. Further analysis reveals that performance degrades in challenging mountainous scenes and under large viewing-angle variations, highlighting terrain relief and UAV perspective changes as remaining critical challenges for robust target geo-localization. The full dataset and implementation code will be made publicly available to facilitate future research in UAV-based geolocalization.

Keywords:

UAV; target geo-localization; benchmark dataset

1. Introduction

UAV target geo-localization aims to estimate the precise geographical coordinates of captured ground targets using observed drone data (see Figure 1) [1,2]. This task is crucial for a wide range of applications, including disaster response, resource exploration, and urban surveying.

Existing methods can be broadly classified into active and passive approaches. In both cases, the drone’s 6-DoF pose in the world coordinate system is first acquired. Active methods [3,4,5,6] employ sensors such as LiDAR to directly measure the distance to the target, obtaining its coordinates via depth back-projection. However, the high cost, size, and power consumption of these sensors make them unsuitable for most civilian small-scale drones, limiting their use in routine missions. In contrast, passive methods [7,8] estimate target coordinates by intersecting back-projected rays with a geographic elevation map, making them more suitable for real-world applications. Nevertheless, because UAV target geo-localization is typically conducted using oblique-view aerial images, these methods are highly sensitive to pose accuracy. Even small pose errors can lead to significant deviations between the back-projected location and the true ground target point.

To address these limitations, we present a pipeline based on 2.5D maps (e.g., reference orthoimages with elevation data). Our pipeline consists of three stages: First, a sub-image is cropped from the reference orthoimages by combining the drone’s intrinsic parameters and the prior pose recorded by its onboard sensors. Second, feature matching is performed between the aerial query imagery and the cropped sub-image to establish feature correspondences. Finally, the matched pixel coordinates are transformed into geographic coordinates through a geometric projection, which outputs the target’s precise latitude, longitude, and elevation.

To the best of our knowledge, there is currently no publicly available dataset in this research domain. To facilitate further studies and validate the proposed pipeline, we introduce the SkyPin dataset. The dataset comprises 2.5D maps derived from 3D reconstructions using drone and satellite imagery, along with multi-modal aerial query images at multiple altitudes (100–200 m) and angles (

- 30 °

to

- 60 °

). Additionally, the dataset provides centimeter-level accurate geographic coordinates of ground targets, obtained using Real-Time Kinematic (RTK) surveying.

Based on the proposed pipeline, we benchmarked various feature matching and geometric projection methods on the SkyPin dataset. The evaluated feature matching approaches include the most prevalent sparse [9,10], semi-dense [11], and dense methods [12,13,14]. Geometric projection techniques comprise homography-based warping and PnP-based raytracing. Experimental results indicate that RoMa matching combined with raytracing achieves optimal accuracy in both RGB and TIR modalities. However, even this top-performing combination struggles under challenging conditions such as shallow viewing angles, low-texture regions, and thermal infrared imagery. These observations highlight promising directions for future research.

Our main contributions are summarized as follows:

1.: We introduce the first dataset for UAV passive target geo-localization, which includes 2.5D maps, aerial multi-modal imagery from drones, and centimeter-level ground-truth coordinates of the targets.
2.: We propose an image-matching-based pipeline and apply it to UAV-based target geo-localization within 2.5D maps.
3.: Using the proposed pipeline, we extensively evaluated multiple feature matching and geometric projection methods, and derived practical recommendations to guide future research.

All data and associated pipeline codes shall be made publicly available to support subsequent research and impartial evaluation.

2. Related Work

2.1. Target Geo-Localization

UAV-based target geo-localization methods can be broadly classified into active [3,4,5,6] and passive approaches, depending on whether they employ active signal transmission. These two paradigms differ fundamentally in both information acquisition and localization principles: active methods actively emit signals to directly infer target positions through distance or angle measurements, whereas passive methods estimate geographic coordinates indirectly by exploiting image-based geometric relationships. The following sections present the representative methodologies and distinctive characteristics of each category.

Active positioning determines the target’s spatial position by transmitting and receiving signals to measure its distance or angle relative to the sensor. Typical active positioning methods employ laser range finders (LRF) [15,16], radio frequency (RF) signals, and multi-sensor fusion [17] techniques. By accurately measuring the target–platform distance or angle and integrating these measurements with imaging geometry models and coordinate transformations [18], the absolute target position can be computed in the geographic coordinate system.

However, the effective ranging capability of active positioning methods is constrained by atmospheric scattering, signal attenuation, and terrain occlusion, and high-precision equipment is often expensive, power-intensive, and bulky, placing strict demands on flight payload, endurance, and environmental adaptability. These factors limit the efficient deployment of active sensors on lightweight civilian UAV platforms. In practical UAV missions—such as disaster response, terrain mapping, resource surveying, and in-field inspection—operations often take place in complex terrain with limited communication and unstable flight conditions, which further constrain signal propagation and energy supply. Consequently, although active positioning provides high accuracy and real-time performance, its reliance on specialized hardware and the challenges of integrating such systems with lightweight UAV platforms limit its deployment in these scenarios.

Passive positioning methods do not require active signaling and instead exploit the spatial correspondences between UAV imagery and known geographic information to indirectly infer the target’s geographic coordinates through techniques such as image matching and attitude estimation. As a result, they offer high flexibility and environmental adaptability on low-power, lightweight platforms, making them particularly suitable for complex field missions. Technically, passive positioning is generally implemented using the Earth ellipsoid model [2,3] or digital elevation models (DEM) [19,20,21,22,23,24,25], leveraging the projection relationships between UAV imagery and geographic coordinates to derive the target location from camera parameters and attitude angles. The method relies solely on onboard cameras and positioning/attitude measurement systems (POS/GPS), ensuring compatibility with UAV platforms. However, the overall performance remains constrained by the accuracy of the UAV’s own positioning and attitude measurements. Even small errors in attitude angles, position, or other pose parameters (e.g., a 0.01

°

angular error or a few meters of positional offset) can result in geographic deviations on the order of tens of meters, indicating that passive positioning still faces challenges such as limited accuracy and sensitivity to measurement errors in practical applications.

2.2. Target Localization Dataset

Publicly available UAV image datasets provide relatively few samples suitable for target geo-localization. Most existing datasets primarily focus on general vision tasks such as object detection, tracking, or cross-modal matching, rather than on precisely localizing ground targets in geographic coordinates. Representative datasets include SARD [26], UAV123 [27], VisDrone [28], UAV-VisLoc [29], and Boson-Nighttime [30] (see Table 1). As a result, the lack of publicly accessible datasets with target-level geographic annotations limits the reproducibility and comparability of localization experiments.

On the other hand, most UAV-based target localization studies rely on self-collected datasets. While the reference maps used in these studies are often obtained from publicly available sources, the corresponding query images are generally not released [8,19,31]. Consequently, these datasets cannot be considered publicly available in their entirety, which hinders reproducibility. Although self-collected datasets can cover specific scenes and task requirements, their limited availability impedes the establishment of a community-wide benchmark and introduces subjectivity in algorithm evaluation.

2.3. UAV–Satellite Image Matching

UAV–satellite image matching is a critical component in UAV-based target geo-localization, aiming to spatially align UAV-captured imagery with satellite maps to infer the geographic positions of targets. Due to significant differences in viewpoint, resolution, scale, and illumination between UAV and satellite images, this task is commonly formulated as a cross-view matching problem.

Early studies primarily relied on template matching [32,33] methods and keypoint-based [34,35] approaches. Template matching identifies optimal correspondence regions through sliding windows or dense correlation searches but incurs high computational cost, while keypoint-based methods depend on highly repeatable local features that exhibit limited robustness under cross-view or cross-modal conditions. With the development of deep learning, researchers have proposed deep-learning-based matching methods, which significantly improve the accuracy and applicability of cross-view UAV–satellite image matching [36,37,38,39,40]. Overall, research on UAV–satellite image matching has gradually shifted from local-feature-based approaches toward end-to-end geometric estimation and deep feature alignment methods.

3. SkyPin Dataset

3.1. Dataset Overview

For the target geo-localization task, we construct a dataset comprising reference maps, aerial query photographs, and annotated target points. The reference 2.5D maps are primarily generated via two methods, including 2.5D maps generated via drone photogrammetric reconstruction and via multi-relative reconstruction from satellite images. Aerial query photographs are acquired through multi-view imaging using drones, and targets are annotated on each image, as shown in Figure 2.

Specifically, the maps comprise eight distinct scene types, which are consolidated into four regions as detailed in Table 2. Corresponding aerial photographs are captured using multi-view imaging. Cameras of different modalities (RGB and thermal) are used under varying temperature, humidity, and day–night conditions, as shown in Figure 3. Furthermore, each aerial photograph is manually annotated with target points, including vehicles, pedestrians, and ground corner points.

3.2. Dataset Collection and Processing

For reference 2.5D map, satellite imagery and UAV imagery are independently processed via photogrammetric reconstruction to generate their corresponding Digital Surface Models (DSM) and Digital Orthophoto Maps (DOM), respectively. The maps are resampled to a geographical resolution of 0.5 m/pixel, followed by precise spatial alignment of the DOM and DSM to ensure consistency between imagery and terrain data.

For aerial query photographs, we use DJI M4T and DJI M4TD drones to fly around a certain target to record RGB and thermal infrared video at varying altitudes (100–200 m) and pitch angles (

- 30 °

to

- 60 °

), as illustrated in Figure 4. Meanwhile, the ground-truth latitude, longitude, and elevation of each target are separately acquired with centimeter-level accuracy using a Hi-Target Huaxing A30 RTK receiver in a ’Fixed Solution’ state. To ensure strict physical-to-pixel mapping, we specifically measure the standing ground points for pedestrians, designated tire contact points for vehicles, and geometric centers for ground corners. Annotators then manually click the corresponding locations on zoomed-in UAV images, with any inaccuracies corrected via visual overlay checks and manual cross-reviews. Frames are extracted from the videos at fixed time intervals, discarding frames in which the targets are not visible, while maintaining the pairing between RGB and thermal data. Notably, the videos in this paper are used only as data acquisition sources. The actual evaluation is conducted on aerial images extracted from the videos at fixed time intervals, with each query image independently annotated and evaluated.

3.3. Dataset Scale and Characteristics

In summary, the proposed dataset comprises 16 maps, 8 of which are drone maps and 8 are satellite maps. The aerial imagery comprises 4192 images, including 1517 RGB and 2675 thermal infrared images, captured under varying conditions. By time of capture, 3822 images are taken during daytime and 370 at night. By weather, 1422 are captured under sunny, 675 under foggy, 1016 under cloudy, 232 under overcast, and 848 under light-rain conditions. Additionally, the dataset provides precise coordinate annotations for 6375 target points, averaging 2.4 points per image.

Compared with existing datasets, the main innovations of this dataset include:

1.: Multi-modal query imagery: The dataset provides synchronously captured infrared and RGB data, establishing a foundation for cross-modal UAV-based target geo-localization research.
2.: Precise target geo-localization labeling: Centimeter-level positioning for target points is achieved using RTK equipment, a level of accuracy that surpasses most existing datasets.
3.: Multi-source map and multi-perspective queries: The dataset incorporates both satellite and UAV reference maps, complemented by aerial queries captured from multiple altitudes and viewpoints around each target, thus providing a rich experimental environment.

4. Method

Given a single UAV image I, the target pixel coordinates

(u_{t}, v_{t})

, the prior pose

ξ

, and a 2.5D map

M_{2.5 D} = {D, S}

(DOM and DSM, respectively), our method estimates the target’s geographic coordinates

(φ_{t}, λ_{t}, h_{t})

. An overview of the proposed pipeline is provided in Figure 5.

4.1. Reference Map Cropping

Based on the intrinsic and extrinsic parameters of the query image, we extract the DOM and DSM regions from

M_{2.5 D}

that cover its field of view, thereby constructing the reference map

R_{ref}

. This selective cropping reduces the search space for feature matching, mitigates false correspondences, and improves both matching efficiency and accuracy, while retaining the full content of the UAV image.

To determine a suitable terrain height for initial back-projection, we first classify the terrain into either plains or mountainous regions based on statistics computed from the

S

elevations. The terrain type T is defined as

T = \{\begin{matrix} 1, & σ_{S} \geq 15 m \land γ_{S} \geq 0.2, \\ 0, & otherwise, \end{matrix}

(1)

where T denotes the terrain type (

T = 1

for mountainous and

T = 0

for plain),

σ_{S}

is the standard deviation of the DSM elevation values, and

γ_{S}

represents the skewness of the elevation distribution. These thresholds were empirically determined from the DSM statistics of the reference regions in this study and are therefore treated as dataset-specific settings rather than universal constants.

Next, we transform the geographic coordinates of the query image camera into the CGCS2000 projection coordinate system. To rapidly estimate the approximate position within the drone’s field of view, we perform a preliminary backward projection. Specifically, for each of the four corner pixels, we first map their pixel coordinates

(u_{i}, v_{i})

into the normalised camera coordinate system using the intrinsic matrix K. This yields the preliminary ray direction in the camera frame:

d_{c, i} = K^{- 1} {[u_{i}, v_{i}, 1]}^{⊤},

(2)

where

d_{c, i}

denotes the unnormalised ray direction of the i-th corner pixel in the camera coordinate system.

We then normalise

d_{c, i}

to obtain the unit direction vector:

{\hat{d}}_{c, i} = \frac{d_{c, i}}{∥ d_{c, i} ∥},

(3)

where

{\hat{d}}_{c, i}

is the unit-length ray direction of the i-th corner pixel, and

∥ d_{c, i} ∥

denotes the Euclidean norm.

Each corner pixel in the query image defines a unit ray

{\hat{d}}_{c, i}

in the camera coordinate system. This ray is transformed into the world coordinate system using the prior rotation matrix

R_{p}

and camera centre

C_{p}

, and its intersection with a horizontal plane at height z is computed as

d_{i} = C_{p} + λ_{i} d_{w, i}, λ_{i} = \frac{z - C_{p, z}}{d_{w, i, z}},

(4)

where

d_{w, i} = R_{p} {\hat{d}}_{c, i}

is the ray direction in the world coordinate system,

C_{p, z}

and

d_{w, i, z}

denote the z-coordinates of the prior camera centre and ray direction, respectively, z is the plane height, and

d_{i}

is the resulting intersection point in world coordinates. This provides the approximate world coordinates of the four corner points for the query image.

The height of the plane is determined based on a coarse terrain classification. For mountainous terrain, the height is set as the intersection of the back-projected ray from the image center with the DSM surface. For flat terrain, the median value of the DSM elevations, denoted as

z_{DSM}^{median}

, is used.

Based on the ground coordinates of the four corner points, we transform them into the pixel coordinate system of the

D

and

S

. The minimum and maximum values of the horizontal and vertical coordinates are then used to define a preliminary cropping window. To preserve the effective content of the UAV imagery and ensure reliable feature matching, the width and height of this window are expanded such that the UAV field of view occupies approximately 40% of the cropping window area.

Finally, based on the pixel coordinates of the cropping window, the corresponding

S

and cropped

D

regions are extracted. The cropped

D

is saved as an image

I_{D}

without georeferencing, while the cropped

S

region is converted into a point cloud

P_{S}

, which preserves the georeferenced information. Together, these form the reference map, denoted as

R_{ref}

.

4.2. Feature Matching Module

We develop a feature matching module that takes

I

and

I_{D}

as input, and outputs the geometric correspondences. These correspondences serve as foundational data for the subsequent localization, directly influencing the overall localization accuracy.

In this module, we first compute the rotation matrix M based on the

ξ

of the query image. Using this matrix, we rotate

I_{D}

, which is initially assumed to be oriented toward true north, to align it with the viewing direction of the query image. This rotation compensates for variations in the UAV’s capture angle, thereby improving matching accuracy.

The module is designed in a modular fashion, supporting sparse, semi-dense, and dense feature matching algorithms. A unified interface guarantees interoperability across different algorithms, allowing researchers to perform fair comparisons under consistent data and evaluation protocols, while also facilitating the integration of novel matching methods.

4.3. Coordinate Estimation Algorithms

To determine the precise geographic coordinates of the target in the query image, we first apply a geometric transformation to its pixel location

(u_{t}, v_{t})

, leveraging the matching results and the DSM. This process recovers the target’s coordinates

(φ_{t}, λ_{t}, h_{t})

in the world coordinate system. To balance localization accuracy and computational efficiency, we adopt two complementary strategies: homography-based warping and PnP-based raytracing.

Homography-based warping

The homography matrix is determined by matching pixels on the query image with their corresponding pixels on the reference image:

[\begin{matrix} u_{r}^{i} \\ v_{r}^{i} \\ 1 \end{matrix}] \sim H [\begin{matrix} u_{q}^{i} \\ v_{q}^{i} \\ 1 \end{matrix}], i = 1, \dots, N,

(5)

where

(u_{q}^{i}, v_{q}^{i})

and

(u_{r}^{i}, v_{r}^{i})

denote the coordinates of the i-th matched feature points in the query and reference images, respectively.

Finally, the target pixel

(u_{t}, v_{t})

is transformed to the reference image using the homography matrix and subsequently mapped to

P_{S}

to obtain

(φ_{t}, λ_{t}, h_{t})

.

PnP-based raytracing

First, the matching points

(u_{q}, v_{q})

on the query image are mapped to their corresponding positions in the reference image

I_{D}

. For the points on

I_{D}

, the inverse rotation of the matching matrix M is applied to recover their original pixel coordinates

(u_{r, rot}, v_{r, rot})

. Subsequently, the 3D geographic coordinates

(X_{r}, Y_{r}, Z_{r})

of each pixel are obtained from the point cloud

P_{S}

, establishing a 2D-3D correspondence between

(u_{q}, v_{q})

and

(X_{r}, Y_{r}, Z_{r})

.

Based on the established 2D-3D correspondences, the camera pose of the query image is estimated using the EPnP algorithm within a RANSAC framework. Let the estimated rotation and translation be denoted as

(\hat{R}, \hat{t})

. The projection of the 3D world points onto the query image is described by:

s_{i} [\begin{matrix} u_{q}^{i} \\ v_{q}^{i} \\ 1 \end{matrix}] = K [\hat{R} | \hat{t}] [\begin{matrix} X_{r}^{i} \\ Y_{r}^{i} \\ Z_{r}^{i} \\ 1 \end{matrix}], i = 1, \dots, N,

(6)

where

s_{i}

is a scale factor, K is the intrinsic matrix,

(u_{q}^{i}, v_{q}^{i})

are the pixel coordinates in the query image, and

(X_{r}^{i}, Y_{r}^{i}, Z_{r}^{i})

are the corresponding 3D world coordinates. The RANSAC framework is used to robustly select inlier correspondences by iteratively estimating

(\hat{R}, \hat{t})

from minimal subsets and rejecting outliers based on reprojection error.

Subsequently, rays are cast from the estimated camera centre

C_{pnp}

through the target pixel

(u_{t}, v_{t})

, and multiple points are sampled along each ray:

r_{t} (λ) = C_{pnp} + λ d_{t}, λ \in [0, λ_{\max}],

(7)

where

d_{t}

is the unit direction vector of the ray through the target pixel.

For each sampled point, the corresponding 3D geographic coordinates are computed and projected into the DSM pixel coordinate system using the horizontal components

(X_{t} (λ), Y_{t} (λ))

. The point exhibiting the minimal height difference relative to the DSM surface is selected as the intersection of the ray with the DSM:

λ^{*} = \arg \min_{λ} | Z_{t} (λ) - Z_{S} (X_{t} (λ), Y_{t} (λ)) |,

(8)

where

Z_{S} (\cdot, \cdot)

denotes the DSM elevation at the corresponding pixel. The intersection point

r_{t} (λ^{*})

is then converted into geodetic coordinates

(φ_{t}, λ_{t}, h_{t})

, yielding the target’s latitude, longitude, and elevation in the world coordinate system.

Overall, raytracing markedly improves positioning accuracy in complex environments, such as mountainous terrain, while homography provides superior computational speed. Consequently, integrating both strategies within a unified framework not only enhances the system’s versatility but also establishes a consistent benchmark for evaluating algorithmic performance.

5. Results and Discussion

In this section, we provide a comprehensive evaluation of the proposed target geo-localization framework. Section 5.1 describes the experimental methodology, including the setup, data modalities, map sources, matching methods, and evaluation metrics. Section 5.2 and Section 5.3 present the localization performance for RGB and thermal infrared query images on UAV and satellite maps. The evaluation considers not only accuracy and robustness but also the cross-modal generalization capability of the matching and localization algorithms, accounting for factors such as humidity and variations in the camera pitch angle.

5.1. Experimental Methodology

Experimental Setup: Query images are first categorized by modality (infrared or visible light) and further subdivided by map type (UAV or satellite). For each subset, all methods are evaluated on the same query–reference image pairs, data split, target annotations, and evaluation metrics. We compare five representative matching methods—RoMa [12], ELoFTR [11], LoFTR [13], XoFTR [14], and SP-LightGlue [9,10]. During inference, each matcher follows the default configuration of its public implementation. The pre-trained weights released by the MINIMA [41] framework are used for RoMa, ELoFTR, LoFTR, XoFTR, and the LightGlue module of SP-LightGlue. This framework is designed for cross-modal matching between infrared and visible images. The SuperPoint module of SP-LightGlue uses the original public SuperPoint weights. The matched pixel coordinates produced by each method are then passed to the same downstream localization pipeline, where homography-based warping and PnP-based raytracing are used to perform target geo-localization. Homography-based warping is faster and suitable for approximately planar scenes, but is less reliable in non-planar terrain; PnP-based raytracing handles terrain relief better using DSM geometry, but requires reliable 2D–3D correspondences under large viewpoint changes.

Runtime Measurement: All runtime measurements are conducted on the same hardware platform, consisting of an Intel Core i9-14900K CPU, one NVIDIA GeForce RTX 4090 GPU with 24 GB VRAM, and 128 GB system memory. Table 3 reports the average runtime and peak GPU memory consumption under this unified setting. The reported pipeline time is decomposed into reference-map cropping, feature matching, and geographic coordinate estimation. Specifically, “Crop” denotes the average time for extracting the reference region, “Match” denotes the inference time of the matching model, and “Homo.” and “Ray.” denote the coordinate estimation time of homography-based warping and PnP-based raytracing, respectively. “Full Homo.” and “Full Ray.” therefore represent the complete runtime of the corresponding pipelines, including cropping, matching, and coordinate estimation.

Table 3 shows that the overall runtime is dominated by the matching model rather than by the downstream coordinate estimation step. RoMa achieves the best localization accuracy in most settings, but it also incurs the highest computational cost. For example, its full PnP-based raytracing pipeline requires 449.6 ms for RGB images and 394.9 ms for thermal infrared images, with a peak GPU memory usage of 7174 MB. In contrast, SP-LightGlue requires only 124.4 ms and 89.5 ms for the same two modalities, with a peak GPU memory usage of 2874 MB. ELoFTR provides an intermediate trade-off, requiring 142.8 ms and 90.9 ms for RGB and thermal infrared images, respectively.

These results indicate that the choice of matcher and localization strategy should depend on deployment requirements. When offline accuracy is the primary objective and sufficient computational resources are available, RoMa combined with PnP-based raytracing is preferred. For speed-sensitive or memory-constrained UAV applications, SP-LightGlue or ELoFTR is more practical, especially when combined with input-resolution control and model acceleration. From the localization-strategy perspective, homography-based warping introduces lower additional computational overhead and is suitable for relatively flat terrain or near-real-time key-frame localization, whereas PnP-based raytracing is more suitable for complex terrain because it explicitly uses camera pose and DSM geometry, at the cost of additional computation. Current experimental results indicate that the proposed pipeline can support offline evaluation and near-real-time single-frame or key-frame localization analysis. However, without further engineering optimization, especially when dense matching methods such as RoMa are used, the proposed method cannot be directly claimed to meet the requirements of real-time UAV video processing.

Metrics: The localization performance is evaluated using three metrics: median 2D distance error, median height error, and 2D recall within 5 m. These metrics provide a comprehensive assessment of both accuracy and robustness in the target geo-localization task.

Median 2D Error ( ${error}_{2 D}$ , m): The median Euclidean distance between the predicted and ground-truth XY coordinates:

\begin{matrix} {error}_{2 D} = & median (\sqrt{{({\hat{x}}_{i} - x_{i})}^{2} + {({\hat{y}}_{i} - y_{i})}^{2}}), \\ i = 1, \dots, N \end{matrix}

(9)

Median Height Error ( ${error}_{Z}$ , m): The median absolute difference between the predicted and ground-truth heights:

\begin{matrix} {error}_{Z} = & median (| {\hat{z}}_{i} - z_{i} |), \\ i = 1, \dots, N \end{matrix}

(10)

2D Recall@5m (Recall@5m): The percentage of queries whose predicted XY positions fall within 5 m of the ground truth:

$\begin{matrix} R e c a l l @ 5 m & = & \frac{1}{N} \sum_{i = 1}^{N} 1 (\sqrt{{({\hat{x}}_{i} - x_{i})}^{2} + {({\hat{y}}_{i} - y_{i})}^{2}} \leq 5 m) \end{matrix}$

(11)

where $1 (\cdot)$ is the indicator function that returns 1 if the condition is satisfied and 0 otherwise. In geo-localization tasks such as search-and-rescue, a 5-m XY tolerance is generally sufficient for practical operations, making Recall@5m a meaningful metric.

5.2. Localization Performance on RGB Data

Table 4 summarizes the overall performance of different matching and localization methods, while Figure 6 illustrates the height-error component of localization performance across different scenarios and map types. We evaluated the localization accuracy of different algorithms on both satellite and UAV maps, with the results presented in Table 5 and Table 6. The visualization of some matches and the final localization results are shown in Figure 7 and Figure 8.

Matching Methods:Table 4 shows the overall performance ranking in the RGB target geo-localization task: RoMa > ELoFTR > SP-LightGlue > XoFTR > LoFTR. RoMa achieves the best performance and produces the largest number of matching points. Using the raytracing method on the UAV map, RoMa attains a

{error}_{2 D}

of approximately 1 m.

ELoFTR ranks second, maintaining a stable Recall@5m of approximately 0.7 across multiple satellite map scenarios. SP-LightGlue and XoFTR show unstable performance; for example, SP-LightGlue’s Recall@5m ranges from 0.1 to 0.7 across different scenes, indicating sensitivity to scene texture.

On the UAV map, SP-LightGlue narrows the performance gap with ELoFTR, even surpassing it in some scenarios, suggesting potential under small viewing-angle variations. XoFTR and LoFTR lag behind on both map types, with low Recall@5m and large

{error}_{2 D}

on satellite maps, indicating severe feature degradation due to viewpoint and resolution differences.

Localization Algorithms:Table 4 shows that the raytracing method generally outperforms homography in localization. On both satellite and UAV maps, Recall@5m improves when using raytracing. This improvement arises because raytracing explicitly models the three-dimensional geometric relationship between the UAV and the map, reducing planar projection offsets caused by pose estimation errors or terrain variations and thus enhancing overall localization accuracy.

Table 5 further indicates that the performance gain of raytracing varies across scenarios. In flat or densely built-up areas (e.g., Lakeside, Pond, Square, Campus, and Residential), where ground elevation changes are minimal, homography still provides a reasonable approximation, resulting in negligible differences compared to raytracing. In some cases (e.g., RoMa in the Pond scene), raytracing may even slightly underperform.

In contrast, raytracing shows clear advantages in mountainous regions with pronounced topographical variations. For example, on the satellite map in the Mountain scenario, RoMa’s Recall@5m increases from 0.11 to 0.67, while

{error}_{2 D}

decreases from 27.09 m to 3.71 m. Elevation differences in such terrain cause severe perspective distortions and projection offsets, which homography cannot accurately capture. By combining pose matrices with depth information, raytracing computes true projection intersections, substantially reducing terrain-induced matching errors.

Map Types:Table 4 shows that localization accuracy on UAV maps generally surpasses that on satellite maps. As shown in Figure 6, the elevation error in UAV maps is generally much smaller than that in satellite maps. For the same matching method, UAV maps consistently achieve higher Recall@5m and lower

{error}_{2 D}

. For example, RoMa, ELoFTR, and SP-LightGlue demonstrate an average improvement of approximately 0.1–0.2 in Recall@5m on UAV maps compared to satellite maps. Figure 7 and Figure 8 further illustrate that the number of correspondences on UAV maps is substantially greater. SP-LightGlue, in particular, shows significant performance gains on UAV maps, suggesting that cross-source matching is more feasible when the reference and query maps share similar spatial resolution, radiometric characteristics, and geometric configuration.

However, satellite maps can typically be obtained in advance via online sources, whereas UAV maps require on-site data collection and modeling. For emergency applications, such as search-and-rescue, the creation of UAV maps and 3D models is more time-consuming and poses greater organizational and technical challenges.

Scenario Types:Table 5 and Table 6 show that localization performance varies significantly across different scenarios. Overall, Lakeside, Square, Industrial, and Farmland achieve superior results, followed by Pond, Campus, and Residential, while Mountain exhibits the poorest performance.

On the satellite map, for example, ELoFTR attains a Recall@5m of 0.7 and a

{error}_{2 D}

of approximately 3 m in Lakeside, Square, Industrial, and Farmland. This suggests that regions with well-defined terrain textures and structural features (e.g., water edges or urban blocks) enable local feature matching models to establish cross-domain correspondences with relative stability. In contrast, in Pond, Campus, and Residential, performance deteriorates markedly: Recall@5m drops to roughly half, and

{error}_{2 D}

exceeds 30 m. These results indicate that texture repetition and low surface reflectance increase matching uncertainty.

Performance further degrades in complex terrains such as the Mountain scenario. Using RoMa with the homography method, Recall@5m reaches only 0.11, and

{error}_{2 D}

rises to 27.09 m, indicating that planar projections cannot effectively capture geometric shifts caused by elevation differences. When employing the raytracing method, Recall@5m improves to 0.68, but still remains lower than in flat scenes. These observations demonstrate that terrain undulations and viewpoint variations are critical factors affecting the stability of cross-domain matching models and highlight key directions for future robustness improvements.

5.3. Localization Performance on Thermal Data

Table 7 and Table 8 present the localization performance of each method on satellite and UAV maps, respectively. The visualization of selected matches and the final localization results are shown in Figure 9 and Figure 10.

Matching Methods:Table 4 shows that the overall performance ranking of the evaluated methods is: RoMa > ELoFTR > XoFTR ≈ SP-LightGlue > LoFTR. On the UAV map, XoFTR and SP-LightGlue exhibit performance gains, approaching ELoFTR, whereas LoFTR remains the least effective. RoMa consistently outperforms all other methods across all metrics, maintaining a two-dimensional positioning error below 4 m on both satellite and UAV maps. Figure 9 and Figure 10 indicate that RoMa achieves significantly higher matching point counts, demonstrating superior stability and robustness.

ELoFTR ranks second. On the satellite map, its Recall@5m is roughly half that of RoMa, while the

{error}_{2 D}

remains below 10 m. On the UAV map, it remains highly competitive, trailing only RoMa. In contrast, XoFTR exhibits pronounced scene dependency. On the satellite map, in Square, Farmland, and Industrial scenarios (particularly with the raytracing method), XoFTR attains a Recall@5m of approximately 0.5 and a

{error}_{2 D}

of around 5 m. However, its performance deteriorates sharply in Pond, Campus, and Residential, with Recall@5m falling below 0.3 and

{error}_{2 D}

exceeding 20 m, indicating significant instability.

SP-LightGlue demonstrates relatively balanced performance on satellite imagery. Using the homography method in Pond, Square, Farmland, and Mountain, its

{error}_{2 D}

is approximately 30 m. In contrast, LoFTR underperforms on both map types, with Recall@5m approaching zero (Table 4), confirming its limitations in cross-modal matching tasks.

Localization Algorithms:Table 4 shows that the raytracing method generally outperforms the homography method, with the performance gain being more pronounced in complex terrains, particularly in the Mountain scenario.

Specifically, as shown in Table 7, the raytracing method improves localization accuracy across most matching algorithms and scenarios. In Campus and Residential, it increases Recall@5m by approximately 0.1 on average. In the Mountain scenario, the improvement is especially significant, with Recall@5m rising from 0.47 to 0.77.

This improvement is mainly due to the raytracing method’s ability to more accurately model the three-dimensional perspective relationship between the camera and the ground, including terrain undulations. As a result, it provides more precise spatial projection estimates under varying viewing angles and topographical conditions, effectively reducing errors caused by the planar projection assumption.

Map Types: Compared to satellite maps, UAV maps generally provide improved localization accuracy and stability. This is because UAV perspectives are closer to those of the query images, reducing cross-view differences. In complex scenes, UAV maps mitigate matching challenges caused by sparse textures and complex terrain in satellite maps, facilitating more reliable cross-modal localization. As shown in Figure 6, the height errors on satellite maps are consistently larger than those on UAV maps. Table 4 shows that across all algorithms, Recall@5m improves and

{error}_{2 D}

decreases on UAV maps. Particularly when using the raytracing method, both RoMa and ELoFTR achieve Recall@5m exceeding 0.9, highlighting the crucial role of viewpoint consistency in cross-modal geo-localization tasks.

Scenario Types: Overall, all algorithms perform most stably in open, structurally clear, and highly textured areas, while performance degrades in scenes with highly undulating terrain or sparse textures. Table 7 shows that in regular areas with distinct structures and textures—such as Lakeside, Square, Industrial, and Farmland—the Recall@5m of RoMa, ELoFTR, and XoFTR is significantly higher than in other scenes. In contrast, in mountainous terrain, complex topography, repetitive textures, and differences between thermal infrared images and reference maps collectively lead to substantial declines in matching performance. Both raytracing and homography methods yield Recall@5m below 0.8, indicating that traditional feature matching algorithms struggle to maintain stable localization in complex terrain or low-texture environments.

Pitch Angle Comparison: To evaluate the impact of perspective variations on localization performance, 168 sample groups were constructed at a fixed altitude of 200 m. Each group contains three images with pitch angles of

- 60 °

,

- 45 °

, and

- 30 °

, and similar yaw angles, yielding a total of 504 thermal infrared images.

As shown in Table 9, the overall trend on both satellite and UAV maps is

- 60 °

>

- 45 °

>

- 30 °

. Small pitch angles (closer to

- 60 °

) are closer to the orthographic view of the map, resulting in more stable feature matching. For instance, on the UAV map, XoFTR achieves Recall@5m exceeding 0.9 at both

- 60 °

and

- 45 °

, with

{error}_{2 D}

around 1 m. Conversely, at

- 30 °

, the raytracing method’s Recall@5m drops to 0.4, while the homography method achieves only 0.2, with

{error}_{2 D}

exceeding 40 m.

Similar trends are observed for ELoFTR, XoFTR, and SP-LightGlue, all showing better performance at

- 60 °

compared to

- 45 °

and

- 30 °

. On satellite imagery, overall performance degrades, yet the relative trends remain consistent. These results indicate that increasing the pitch angle widens the viewing angle, causing geometric mismatch, which is the primary factor affecting cross-modal matching accuracy.

Humidity Comparison: To analyze the impact of humidity on cross-modal localization performance, 1868 thermal infrared images were selected. These images cover all scenarios in the dataset and were chosen with similar viewing angles (yaw angle difference less than

5 °

) at identical altitudes and pitch angles.

As shown in Table 10, lower humidity generally leads to superior localization results. On UAV maps, RoMa, ELoFTR, and XoFTR all perform better under low humidity conditions. For instance, using SP-LightGlue, Recall@5m reaches 0.65 in low humidity scenarios, but drops to 0.50 under high humidity conditions. ELoFTR demonstrates comparable performance under both humidity levels, indicating a degree of robustness.

On satellite maps, overall accuracy declines for all methods, and the effect of humidity variations becomes less pronounced. For example, SP-LightGlue maintains Recall@5m between 0.1 and 0.2 under both humidity conditions. Overall, high humidity reduces infrared texture contrast and feature stability, thereby lowering matching accuracy. This effect is more pronounced on UAV maps where viewing angle variations are minimal.

6. Conclusions

This paper presents SkyPin, the first benchmark dataset for UAV passive target geo-localization. We develop a localization pipeline that performs geometric refinement through image matching and 2.5D maps. Using SkyPin, we systematically evaluate multiple feature matching algorithms together with two geometric projection methods. Experimental results show that RoMa combined with raytracing achieves the best localization accuracy in both RGB and thermal infrared modalities. However, notable performance degradation persists under large pitch angles and complex mountainous terrain, highlighting the critical challenges posed by topographical relief and UAV attitude variations.

To advance research in this field, we will publicly release complete SkyPin dataset and benchmark code. This will provide a unified, reproducible evaluation platform, thereby facilitating the development and deployment of more robust passive UAV target localization methods in complex real-world scenarios. Future work will focus on two main directions. First, we will explore explicit uncertainty propagation by incorporating feature-matching covariance, PnP pose uncertainty, and DSM noise into coordinate-level confidence estimation. Second, while the current benchmark assumes stable geographic structures, extending SkyPin to dynamic scenes through map updating, change detection, and temporal filtering remains an important avenue for future research.

Author Contributions

Conceptualization, Z.W., Y.L., Y.H., S.Y. and M.Z.; methodology, Z.W., R.W. and Y.L.; software, Z.W. and R.W.; investigation, Y.H.; writing—original draft preparation, Z.W.; writing—review and editing, S.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article. The code and the SkyPin dataset developed for this research are openly available at https://github.com/nudt-sawlab/Skypin (accessed on 4 June 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAV	Unmanned aerial vehicle
RTK	Real-Time Kinematic
TIR	Thermal infrared
DOM	Digital orthophoto map
DSM	Digital surface model
PnP	Perspective-n-Point

References

Cai, Y.; Zhou, Y.; Zhang, H.; Xia, Y.; Qiao, P.; Zhao, J. Review of Target Geo-Location Algorithms for Aerial Remote Sensing Cameras without Control Points. Appl. Sci. 2022, 12, 12689. [Google Scholar] [CrossRef]
Stich, E.J. Geo-Pointing and Threat Location Techniques for Airborne Border Surveillance. In Proceedings of the 2013 IEEE International Conference on Technologies for Homeland Security (HST); IEEE: New York, NY, USA, 2013; pp. 136–140. [Google Scholar] [CrossRef]
Liu, X.; Teng, X.; Li, Z.; Yu, Q.; Bian, Y. A Fast Algorithm for High Accuracy Airborne SAR Geolocation Based on Local Linear Approximation. IEEE Trans. Instrum. Meas. 2022, 71, 1–12. [Google Scholar] [CrossRef]
Jin, G.; Dong, Z.; He, F.; Yu, A. Background-Free Ground Moving Target Imaging for Multi-PRF Airborne SAR. IEEE Trans. Geosci. Remote Sens. 2019, 57, 1949–1962. [Google Scholar] [CrossRef]
Tang, X.; Zhang, X.; Shi, J.; Wei, S.; Pu, L. Ground Slowly Moving Target Detection and Velocity Estimation via High-Speed Platform Dual-Beam Synthetic Aperture Radar. J. Appl. Remote Sens. 2019, 13, 026516. [Google Scholar] [CrossRef]
Jin, M.; Bai, Y.; Devys, E.; Di, L. Toward a Standardized Encoding of Remote Sensing Geo-Positioning Sensor Models. Remote Sens. 2020, 12, 1530. [Google Scholar] [CrossRef]
Paulin, G.; Sambolek, S.; Ivasic-Kos, M. Application of Raycast Method for Person Geolocalization and Distance Determination Using UAV Images in Real-World Land Search and Rescue Scenarios. Expert Syst. Appl. 2024, 237, 121495. [Google Scholar] [CrossRef]
Qiao, C.; Ding, Y.; Xu, Y.; Xiu, J. Ground Target Geolocation Based on Digital Elevation Model for Airborne Wide-Area Reconnaissance System. J. Appl. Remote Sens. 2018, 12, 016004. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperPoint: Self-Supervised Interest Point Detection and Description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); IEEE: New York, NY, USA, 2018; pp. 224–236. [Google Scholar]
Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. LightGlue: Local Feature Matching at Light Speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 17581–17592. [Google Scholar] [CrossRef]
Wang, Y.; He, X.; Peng, S.; Tan, D.; Zhou, X. Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 21666–21675. [Google Scholar]
Edstedt, J.; Sun, Q.; Bökman, G.; Wadenbäck, M.; Felsberg, M. RoMa: Robust Dense Feature Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 19790–19800. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-Free Local Feature Matching with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2021; pp. 8922–8931. [Google Scholar]
Tuzcuoğlu, Ö.; Köksal, A.; Sofu, B.; Kalkan, S.; Alatan, A.A. XoFTR: Cross-modal Feature Matching Transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2024; pp. 4275–4286. [Google Scholar]
Abdeen, M.A.R.; Nemer, I.A.; Sheltami, T.R. A Balanced Algorithm for In-City Parking Allocation: A Case Study of al Madinah City. Sensors 2021, 21, 3148. [Google Scholar] [CrossRef] [PubMed]
Su, Y.; Liu, J. Research on UAV Target Location Algorithm of Linear Frequency Modulated Continuous Wave Laser Ranging Method. In Proceedings of the International Conference on Cognitive Computation and Systems; Springer: Singapore, 2023; pp. 107–122. [Google Scholar]
Wang, X.; Liu, J.; Zhou, Q. Real-Time Multi-Target Localization from Unmanned Aerial Vehicles. Sensors 2016, 17, 33. [Google Scholar] [CrossRef] [PubMed]
Qiao, C.; Ding, Y.l.; Xu, Y.s.; Xiu, J.h.; Du, Y.l. Ground target geo-location using imaging aerial camera with large inclined angles. Opt. Precis. Eng. 2017, 25, 1714. [Google Scholar]
Bai, G.; Song, Y.; Zuo, Y.; Song, M.; Wang, X. Multitarget Location Capable of Adapting to Complex Geomorphic Environment for the Airborne Photoelectric Reconnaissance System. J. Appl. Remote Sens. 2020, 14, 036510. [Google Scholar] [CrossRef]
El Habchi, A.; Moumen, Y.; Zerrouk, I.; Khiati, W.; Berrich, J.; Bouchentouf, T. CGA: A New Approach to Estimate the Geolocation of a Ground Target from Drone Aerial Imagery. In Proceedings of the 2020 Fourth International Conference On Intelligent Computing in Data Sciences (ICDS); IEEE: New York, NY, USA, 2020; pp. 1–4. [Google Scholar] [CrossRef]
Huang, C.; Zhang, H.; Zhao, J. High-Efficiency Determination of Coastline by Combination of Tidal Level and Coastal Zone DEM from UAV Tilt Photogrammetry. Remote Sens. 2020, 12, 2189. [Google Scholar] [CrossRef]
Cheng, B.T. A Simulation of Wide Area Surveillance (WAS) Systems and Algorithm for Digital Elevation Model (DEM) Extraction. In Proceedings of the Airborne Intelligence, Surveillance, Reconnaissance (ISR) Systems and Applications VII; SPIE: Bellingham, WA, USA, 2010; Volume 7668, pp. 90–104. [Google Scholar] [CrossRef]
Yang, A.; Li, X.; Xie, J.; Wei, Y. Three-Dimensional Panoramic Terrain Reconstruction from Aerial Imagery. J. Appl. Remote Sens. 2013, 7, 073497. [Google Scholar] [CrossRef]
Belkhouche, Y.; Duraisamy, P.; Buckles, B. Graph-Connected Components for Filtering Urban LiDAR Data. J. Appl. Remote Sens. 2015, 9, 096075. [Google Scholar] [CrossRef]
Athmania, D.; Achour, H. External Validation of the ASTER GDEM2, GMTED2010 and CGIAR-CSI- SRTM v4.1 Free Access Digital Elevation Models (DEMs) in Tunisia and Algeria. Remote Sens. 2014, 6, 4600–4620. [Google Scholar] [CrossRef]
Sambolek, S.; Ivasic-Kos, M. Automatic Person Detection in Search and Rescue Operations Using Deep CNN Detectors. IEEE Access 2021, 9, 37905–37922. [Google Scholar] [CrossRef]
Mueller, M.; Smith, N.; Ghanem, B. A Benchmark and Simulator for UAV Tracking. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2016; pp. 445–461. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision meets drones: A challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar]
Xu, W.; Yao, Y.; Cao, J.; Wei, Z.; Liu, C.; Wang, J.; Peng, M. UAV-VisLoc: A Large-Scale Dataset for UAV Visual Localization. arXiv 2024, arXiv:2405.11936. [Google Scholar]
Xiao, J.; Tortei, D.; Roura, E.; Loianno, G. Long-Range UAV Thermal Geo-Localization with Satellite Imagery. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS); IEEE: New York, NY, USA, 2023; pp. 5820–5827. [Google Scholar]
Morbidi, F.; Mariottini, G.L. Active Target Tracking and Cooperative Localization for Teams of Aerial Vehicles. IEEE Trans. Control Syst. Technol. 2013, 21, 1694–1707. [Google Scholar] [CrossRef]
Van Dalen, G.J.; Magree, D.P.; Johnson, E.N. Absolute Localization Using Image Alignment and Particle Filtering. In AIAA Guidance, Navigation, and Control Conference; American Institute of Aeronautics and Astronautics: Reston, VA, USA, 2016. [Google Scholar] [CrossRef]
Yol, A.; Delabarre, B.; Dame, A.; Dartois, J.É.; Marchand, E. Vision-Based Absolute Localization for Unmanned Aerial Vehicles. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems; IEEE: New York, NY, USA, 2014; pp. 3429–3434. [Google Scholar] [CrossRef]
Shan, M.; Wang, F.; Lin, F.; Gao, Z.; Tang, Y.Z.; Chen, B.M. Google Map Aided Visual Navigation for UAVs in GPS-denied Environment. In Proceedings of the IEEE International Conference on Robotics and Biomimetics (ROBIO), Zhuhai, China, 6–9 December 2015; IEEE: New York, NY, USA, 2015; pp. 114–119. [Google Scholar] [CrossRef]
Mantelli, M.; Pittol, D.; Neuland, R.; Ribacki, A.; Maffei, R.; Jorge, V.; Prestes, E.; Kolberg, M. A Novel Measurement Model Based on abBRIEF for Global Localization of a UAV over Satellite Images. Robot. Auton. Syst. 2019, 112, 304–319. [Google Scholar] [CrossRef]
He, Y.; Cisneros, I.; Keetha, N.; Patrikar, J.; Ye, Z.; Higgins, I.; Hu, Y.; Kapoor, P.; Scherer, S. FoundLoc: Vision-based Onboard Aerial Localization in the Wild. arXiv 2023, arXiv:2310.16299. [Google Scholar] [CrossRef]
Fragoso, A.T.; Lee, C.T.; McCoy, A.S.; Chung, S.J. A Seasonally Invariant Deep Transform for Visual Terrain-Relative Navigation. Sci. Robot. 2021, 6, eabf3320. [Google Scholar] [CrossRef] [PubMed]
Shetty, A.; Gao, G.X. UAV Pose Estimation Using Cross-View Geolocalization with Satellite Imagery. In Proceedings of the International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: New York, NY, USA, 2019; pp. 1827–1833. [Google Scholar] [CrossRef]
Goforth, H.; Lucey, S. GPS-denied UAV Localization Using Pre-Existing Satellite Imagery. In Proceedings of the International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2019; pp. 2974–2980. [Google Scholar] [CrossRef]
Bianchi, M.; Barfoot, T.D. UAV Localization Using Autoencoded Satellite Images. IEEE Robot. Autom. Lett. 2021, 6, 1761–1768. [Google Scholar] [CrossRef]
Ren, J.; Jiang, X.; Li, Z.; Liang, D.; Zhou, X.; Bai, X. MINIMA: Modality Invariant Image Matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New York, NY, USA, 2025; pp. 23059–23068. [Google Scholar]

Figure 1. UAV target geo-localization aims to estimate the precise geographical coordinates of captured ground targets using observed drone data.

Figure 2. Overview of the dataset. The dataset covers different scenes, modalities (RGB-TIR), flight altitudes, and camera pitch angles. Here, Drones 10 00500 i001

indicates that the scene includes both daytime and nighttime images, and Drones 10 00500 i002

indicates multiple weather conditions.

Figure 2. Overview of the dataset. The dataset covers different scenes, modalities (RGB-TIR), flight altitudes, and camera pitch angles. Here, Drones 10 00500 i001

indicates that the scene includes both daytime and nighttime images, and Drones 10 00500 i002

indicates multiple weather conditions.

Figure 3. Samples of SkyPin. Query images with similar yaw angles are shown. The left two columns present RGB query images, while the right two columns show thermal query images. From top to bottom, the pitch angles are

- 60 °

,

- 45 °

and

- 30 °

, with nadir (vertical downward) corresponding to

- 90 °

. The pink and blue boxes correspond to flight altitudes of 100 m and 200 m, respectively.

Figure 3. Samples of SkyPin. Query images with similar yaw angles are shown. The left two columns present RGB query images, while the right two columns show thermal query images. From top to bottom, the pitch angles are

- 60 °

,

- 45 °

and

- 30 °

, with nadir (vertical downward) corresponding to

- 90 °

. The pink and blue boxes correspond to flight altitudes of 100 m and 200 m, respectively.

Figure 4. Query image collection. Images were captured during 360

°

circular UAV flights around targets, acquiring both RGB and thermal data at varying flight altitudes and camera pitch angles.

Figure 4. Query image collection. Images were captured during 360

°

circular UAV flights around targets, acquiring both RGB and thermal data at varying flight altitudes and camera pitch angles.

Figure 5. Overview of the localization pipeline. The pipeline consists of three main steps: (1) cropping a reference map

R_{ref}

(containing the cropped DOM

I_{D}

and DSM

P_{S}

) based on the drone’s prior pose

ξ

and the 2.5D map

M_{2.5 D}

; (2) performing feature matching between the query image I and

I_{D}

to obtain 2D–2D correspondences; (3) computing the geometric transformation from the matching results and DSM, and mapping the target pixel coordinates

(u_{t}, v_{t})

into the world coordinate system to obtain the target’s latitude, longitude, and elevation

(φ_{t}, λ_{t}, h_{t})

.

Figure 5. Overview of the localization pipeline. The pipeline consists of three main steps: (1) cropping a reference map

R_{ref}

(containing the cropped DOM

I_{D}

and DSM

P_{S}

) based on the drone’s prior pose

ξ

and the 2.5D map

M_{2.5 D}

; (2) performing feature matching between the query image I and

I_{D}

to obtain 2D–2D correspondences; (3) computing the geometric transformation from the matching results and DSM, and mapping the target pixel coordinates

(u_{t}, v_{t})

into the world coordinate system to obtain the target’s latitude, longitude, and elevation

(φ_{t}, λ_{t}, h_{t})

.

Figure 6. Median height error of localization results. Subplots show the outcomes for RGB and thermal query images on both satellite and UAV maps.

Figure 7. Example localization results of RGB query images on the satellite map. The figure contains 8 × 5 subplots arranged by scene and method (labels are shown in the figure). For each method, blue boxes indicate the matching point pair relationships between the query and reference images; pink boxes show the query image warped onto the reference via homography; green boxes show the reference map. In the homography and raytracing images, Drones 10 00500 i003