SAVL: Scene-Adaptive UAV Visual Localization Using Sparse Feature Extraction and Incremental Descriptor Mapping

Liu, Ganchao; Li, Zhengxi; Gao, Qiang; Yuan, Yuan

doi:10.3390/rs17142408

Open AccessArticle

SAVL: Scene-Adaptive UAV Visual Localization Using Sparse Feature Extraction and Incremental Descriptor Mapping

¹

School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China

²

Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China

³

General Department IV, Xi’an Institute of Applied Optics, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(14), 2408; https://doi.org/10.3390/rs17142408

Submission received: 12 May 2025 / Revised: 6 July 2025 / Accepted: 7 July 2025 / Published: 12 July 2025

(This article belongs to the Special Issue Advances in Computer Vision and Machine Learning Applications on Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

In recent years, the use of UAVs has become widespread. Long distance flight of UAVs requires obtaining precise geographic coordinates. Global Navigation Satellite Systems (GNSS) are the most common positioning models, but their signals are susceptible to interference from obstacles and complex electromagnetic environments. In this case, vision-based technology can serve as an alternative solution to ensure the self-positioning capability of UAVs. Therefore, a scene adaptive UAV visual localization framework (SAVL) is proposed. In the proposed framework, UAV images are mapped to satellite images with geographic coordinates through pixel-level matching to locate UAVs. Firstly, to tackle the challenge of inaccurate localization resulting from sparse terrain features, this work proposes a novel feature extraction network grounded in a general visual model, leveraging the robust zero-shot generalization capability of the pre-trained model and extracting sparse features from UAV and satellite imagery. Secondly, in order to overcome the problem of weak generalization ability in unknown scenarios, a descriptor incremental mapping module was designed, which reduces multi-source image differences at the semantic level through UAV satellite image descriptor mapping and constructs a confidence-based incremental strategy to dynamically adapt to the scene. Finally, due to the lack of annotated public datasets, a scene-rich UAV dataset (RealUAV) was constructed to study UAV visual localization in real-world environments. In order to evaluate the localization performance of the proposed framework, several related methods were compared and analyzed in detail. The results on the dataset indicate that the proposed method achieves excellent positioning accuracy, with an average error of only 8.71 m.

Keywords:

UAV visual localization; image matching; incremental learning; basic model

1. Introduction

Recently, the rapid advancement of UAV technology has led to the emergence of various new UAVs. In contrast to manned aircraft, UAVs have garnered widespread attention due to their advantages of low cost, enhanced safety, flexibility, and convenience. As a result, they are now widely employed across diverse applications, ranging from agriculture to emergency response, and have become increasingly significant [1]. With improvements in onboard computing power and lightweight algorithms, various visual algorithms—such as target tracking [2], target detection [3], and Simultaneous Localization and Mapping (SLAM) [4]—have been extensively used in UAVs. A critical factor in achieving the multifunctionality of UAVs is precise positioning, typically provided by Global Navigation Satellite Systems (GNSS). However, GNSS signals are vulnerable to interference from environmental factors. Therefore, research into autonomous navigation and positioning technologies for UAVs holds considerable practical significance [5]. Navigation and localization are inherently connected tasks, and many navigation systems rely on accurate localization as a foundation. Visual localization, as a passive and self-contained positioning method, is immune to electromagnetic interference and serves as a key alternative or complement to GNSS, especially in challenging scenarios [6].

In the domain of autonomous navigation and localization, the use of visual information has emerged as a critical area of research. Visual localization systems can be broadly categorized into two types based on different prior information: relative visual localization methods and absolute visual localization methods. Common relative visual localization techniques include Visual Odometry (VO) [7], Visual-Inertial Odometry (VIO) [8,9], and Simultaneous Localization and Mapping (SLAM) [10]. VO estimates relative motion by analyzing the variation of image features between adjacent frames, while VIO [11] further improves upon this by combining data from visual sensors and an Inertial Measurement Unit (IMU) to estimate device orientation and trajectory. These methods extract significant features solely from the observed visual images, without the need to generate geographic reference maps for simultaneous localization and mapping (SLAM), which has been widely adopted in the field of autonomous control. SLAM enables the generation of three-dimensional maps during operation to facilitate positioning [12]. These relative localization techniques are also fundamental to many vision-based navigation systems, which use visual information to support autonomous path following, obstacle avoidance, and control in GPS-denied environments [13]. Vision-based navigation typically integrates modules for localization, mapping, and planning to achieve real-time closed-loop control.

As shown in Figure 1, scene matching visual localization is a system that maps UAV images onto satellite images to obtain absolute positional information [14,15]. Unlike relative visual localization methods, absolute visual positioning technology enables the precise determination of the geographic location of UAVs, thereby eliminating cumulative errors. As a result, absolute visual localization methods present a challenging yet valuable area of research.

In image-matching-based visual localization systems, the images captured by UAVs and satellites often differ in platform, season, and scene, leading to inconsistencies in style and perspective. Furthermore, some images exhibit sparse textures that are difficult to describe. Examples of this are shown in Figure 2. The heterogeneity of multi-source images [16] presents significant challenges in image feature extraction and matching. This results in sparse feature representations and inaccurate scene localization, as well as weak generalization to unknown scenes. Moreover, the lack of labeled real-world scene datasets poses a considerable challenge for algorithm training and evaluation.

To address the above challenges, many researchers have conducted corresponding work. The image matching method based on manual feature descriptors has been applied in this field. Generally speaking, they detect specific keypoints from the image, such as lines/edges [17] and corners [18]. Then, patch descriptors are constructed around key points using local information and described in high-dimensional vectors. However, the design of manual descriptors relies on experience and specific application conditions and cannot extract stable matching descriptions in complex dynamic environments.

In recent years, feature extraction and matching methods based on deep learning have been applied in scene classification [19], cross-modal retrieval [20,21,22,23], multi-source image matching [24], and localization [25,26]. It has experienced rapid development. Some works attempt to extract global features from the entire image and use image retrieval methods to solve localization problems. In [27], UAV views were introduced for cross-view geolocation, and tasks such as UAV field of view target localization and UAV navigation were proposed, expanding the application of UAV geolocation. DenseUAV [28] uses dense sampling to fix the UAV’s perspective as a top–down view, enabling the UAV to obtain more accurate geographic locations. LPN [29] proposes a square partitioning strategy that extracts useful information from edge and partitioning schemes, significantly improving performance. However, these methods focus on low-altitude multi-view issues and cannot obtain accurate geographic locations. The generalization ability of the scene is weak and cannot meet the working requirements of UAVs in various scenarios.

Unlike methods based on global feature extraction, keypoint matching can obtain pixel-level localization results. LIFT [30] attempts to achieve keypoint detection, direction estimation, and feature description simultaneously through end-to-end convolutional neural networks. SuperPoint [31] proposed a self-supervised framework to train interest point detectors and descriptors. Xfeat [32] reduces the weight of the model by separating the descriptor and feature point encoder parts, greatly improving the running speed without sacrificing accuracy. However, these methods only consider scenes with rich textures and perform poorly in scenes with weaker textures.

This article proposes a multi-scale feature fusion visual localization framework, SAVL, to address the above issues. Firstly, to address the issue of poor performance in weakly textured scenes, SAVL has designed a feature reconstruction module that combines coarse and fine features to effectively capture low-texture features in complex scenes and obtain feature maps with richer semantic information. In addition, given the weak generalization ability of unknown scenes, SAVL designed a descriptor incremental mapping module that reduces multi-source image differences at the semantic level through UAV satellite image descriptor mapping, and constructs a confidence-based incremental strategy to dynamically adapt to the scene. Finally, a scene-rich UAV dataset (RealUAV) was constructed to study UAV visual localization in real-world environments.

In summary, our work makes the following contributions:

This article proposes a UAV adaptive visual localization framework, SAVL. In the proposed SAVL framework, a feature reconstruction module was designed that combines coarse and fine features, effectively capturing low-texture features in complex scenes and obtaining feature maps with richer semantic information.
A descriptor incremental mapping module was established based on incremental learning. This module dynamically focuses on salient features in different scenarios, reduces multi-source image differences at the semantic level, and dynamically adapts to different scenarios based on an incremental strategy of confidence.
Extensive real-world experiments were conducted to demonstrate the effectiveness of the algorithm proposed in this article in real-world scenarios. The dataset used for these experiments has been released and is called RealUAV.

The remainder of this article is structured as follows: relevant prior work is reviewed in Section 2. The proposed model architecture is described in Section 3, followed by an introduction to the dataset and the design of a novel evaluation metric in Section 4. Experimental results and visualization analyses are presented in Section 5. Finally, conclusions are drawn in Section 6.

2. Realted Work

This section outlines the key issues and related work in UAV visual positioning technology. Specifically, Self-supervised visual models enhance feature extraction. Image feature description techniques improve generalization. Absolute visual localization enables precise geographic positioning of UAVs. The relevant work is introduced separately below.

2.1. Self-Supervised Vision Models

Inspired by the language model used in Transformers [33], pre-trained base models on large-scale datasets have recently shown great potential for learning universal features for various visual tasks through self-supervised learning. Caron et al. IBOT [34] integrates Masked Image Modeling (MIM) within the self-distillation framework to develop semantically rich visual markers that produce robust features, which perform well in various dense downstream tasks. Similarly, DINOv2 [35] emphasizes that self-supervised methods can generate multifunctional visual features that can be applied across different image distributions and tasks, provided they are trained on sufficiently large datasets without the need for fine-tuning. In OmniGlue [36] leverages coarse features to guide feature propagation, suppressing irrelevant keypoints without direct use. RoMa [37] and MatchAnything [38] build feature pyramids to extract dense, fine-grained features and improve matching precision. However, these methods often rely on deeper and more complex networks, leading to higher computational costs that are unsuitable for resource-constrained UAV platforms. Moreover, the focus on fine details may come at the expense of coarse-grained structural cues, which are essential for robust localization across wide-baseline UAV and satellite views. Balancing coarse and fine features, while keeping the model lightweight, is key to reducing localization errors in practical UAV scenarios.

2.2. Image Feature Description

Recently, learning-based local features have been developed in an attempt to replace manually crafted descriptors. Dense pixel features naturally emerge from CNN and provide powerful representations for image matching and localization [39,40]. However, due to limited computing power, it is difficult to match dense features. Sparse learning features composed of keypoints and descriptors provide an attractive alternative to handcrafted features. CNN keypoint detection has also been proven to be superior to classical methods, although they are clearly difficult to learn. DELF [41] uses an attention mechanism to optimize landmark recognition tasks. However, these methods have limited generalization ability, and their performance rapidly declines when there are changes in lighting and ground conditions. Inspired by style transformation and dynamic increment in [42,43], we adopt descriptor increment mapping to reduce the differences between UAV and satellite images, thereby improving the robustness of the model.

2.3. Geo-Localization Datasets

Absolute visual localization refers to the estimation of a UAV’s absolute position within a scene based on the analysis of images. This technology typically employs pre-built maps or reference datasets, achieving localization by matching visual features obtained from sensors with those in the map. Zheng et al. [27] introduced a cross-geolocation dataset and proposed tasks such as target localization and UAV navigation within the field of view of UAVs. In [28,44], the dataset was constructed using unmanned platforms at varying heights and perspectives to enrich application scenarios, bringing them closer to real-world environments. UAV-VisLoc [45] provides diverse visual data captured under different heights and brightness conditions. However, some datasets primarily focus on texture-rich areas, such as buildings, and their diversity remains limited. Additionally, the collection conditions of certain datasets are inconsistent. The creation of a multi-scenario dataset to better align with real-world applications remains an urgent challenge.

3. Method

In this section, we provide a detailed explanation of the proposed SAVL. The flowchart of the visual positioning framework is shown in Figure 3. It primarily consists of two parts: multi-source image feature extraction and descriptor incremental mapping. The multi-source image feature extraction is performed using a feature extraction network that simultaneously extracts key points and descriptors from both UAV and satellite images. This process establishes a matching relationship between features from different sources. In descriptor incremental mapping, a confidence evaluation mechanism is employed to assess the reliability of the matching results. Based on the high-confidence matching results, the descriptors are dynamically adjusted to reduce the differences between the multi-source images. Below, this article provides a detailed introduction to the implementation principles and specifics of this method, focusing on three aspects: problem definition, feature reconstruction, descriptor incremental mapping, and localization inference.

3.1. Problem Definition

UAV visual positioning aims to determine the position of a UAV through image matching between satellite and UAV images. Given a set of n consecutive UAV images

Q = {q_{1}, q_{2}, \dots, q_{n}}

, the task can be expressed as finding their corresponding positions

P = {p_{1}, p_{2}, \dots, p_{n}}

using a large-scale orthorectified satellite image M, where each pixel is annotated with longitude and latitude coordinates. These n UAV images are typically captured consecutively during the UAV’s flight, and the ground coordinates of each frame are represented as

{pcoords}_{i}

, which denotes the UAV’s current position. The satellite image includes GPS coordinates, LLcoords and URcoords, corresponding to the lower left and upper right corners of the map, respectively. By matching UAV images with satellite images and calculating the homography matrix, the geographic location of the UAV can be obtained.

3.2. Feature Reconstruction

Keypoint extraction is a critical issue, ranging from manually designed SIFT [46] to current deep learning-based methods such as Xfeat [32]. These methods perform well in indoor and architectural settings. However, due to limitations in high-quality, large-scale annotated datasets, model parameters, and model architecture, their performance is poor in low-texture scenarios. In addition to establishing additional feature extraction modules at key points, there may be another approach: Can we introduce a self-supervised feature extraction network trained on a large amount of data to solve the problem of sparse feature extraction?

Recently, general visual models such as self-supervised visual backbones [47] and diffusion models [48] have demonstrated powerful semantic understanding capabilities. By training on big data, these models demonstrate strong cross-domain generalization ability [49], which enables them to obtain rough patch-level matching results. However, using the image features extracted by these models for matching has limited performance on regions/keypoints without strong semantic information. So, how to combine the extensive domain knowledge encoded in the model has become the key to the problem. To solve this problem, the FR module proposed in this work incorporates the coarse semantic features from general visual models into the generation of keypoints and descriptors. By fusing coarse (semantic-level) and fine (local texture) features, the FR module enhances the generalization capability of the network across different domains and improves feature extraction performance in low-texture or structurally sparse environments.

The input RGB image is processed by SAVL, which uses a pre-trained ViT/S as a coarse feature encoder to extract coarse patch feature maps with a height and width of 28, and a channel count of 384. This approach is effective because extensive knowledge helps guide the model to capture overall semantic information by identifying domain-specific features that were not seen during training. Next, a fine-grained encoder is employed to extract detailed features, resulting in a more accurate fine-grained feature map with a reduction in both length and width by a factor of 1/8, and a channel count of 128. This module focuses on texture features, such as corners and lines in the image, in order to capture local semantic information. In the feature reconstruction module, the coarse patches are first upsampled using bilinear interpolation. Then, the number of channels is adjusted using a

1 \times 1

convolution to align the feature maps. After channel concatenation and feature fusion, the fused feature map is obtained by performing weighted fusion with the fine-grained feature maps. This process enhances the robustness and richness of the feature representation.

Feature descriptors and feature points are generated through a dual-branch decoder. In the feature point branch, the network adjusts the number of feature map channels to 65. Here, 65 indicates that each pixel of the feature map represents a local area of

8 \times 8

in the original image, with an additional “garbage bin” channel for regions where no feature points are detected. Using Softmax and reshaping operations, the final features are restored to the size of the original image. In the feature descriptor generation branch, the dimensionality of the feature map channel is adjusted to M, where M represents the dimensionality of the output features. By employing channel normalization and bilinear interpolation based on feature point positions, feature vectors are generated, and descriptors are obtained.

The loss function comprises two components. The first part corresponds to the loss of the feature point decoder:

L_{p} (X, Y) = \frac{1}{H \times W} \sum l_{p} (x; y)

(1)

let X be the output of the feature point decoder and Y be the corresponding keypoint labels. Let x represent an

8 \times 8

pixel cell, and y denote the entity corresponding to x in Y. The cross-entropy loss

l_{p}

for a cell

x \in X

is defined as follows:

l_{p} (x; y) = - \log (softmax (x_{y})) .

(2)

The descriptor loss is defined as follows:

L_{d} (D, D^{'}, S) = \frac{1}{{(H \times W)}^{2}} \sum \sum l_{d} (d, d^{'}; s)

(3)

here,

D^{'}

represents the labeled descriptors corresponding to D. S is an indicator matrix that specifies whether d corresponds to

d^{'}

. The hinge loss

l_{d}

used to constrain descriptor consistency expression is defined as follows:

\begin{matrix} l_{d} (d, d^{'}; s) & = λ_{d} * s * \max (0, m_{p} - d^{T} d^{'}) \\ + (1 - s) * \max (0, d^{T} d^{'} - m_{n}) \end{matrix}

(4)

among them

λ_{d}

is 250,

m_{p}

is 1,

m_{n}

is 0.2.

This article uses a homography adaptive self-monitoring strategy. This process generates new image pairs with self-labeled keypoints and trains the model using the aforementioned loss function, enabling the network to effectively focus on features in sparse scenes and enhancing the localization capability in low-texture areas.

3.3. Descriptor Increment Mapping

Due to the influence of land cover changes, seasonal variations, and imaging methods, there are differences between UAV images and satellite images, which lead to differences in descriptors. The descriptor increment module proposed in this paper aims to maintain a unified semantic space, capture invariant salient features, and enhance the invariant measurement capability of descriptors.

In UAV and satellite imagery, the update frequency of satellite images for a given area is relatively low. Therefore, this paper proposes a strategy to enhance the descriptors extracted from UAV images. Specifically, the descriptors from UAV images are processed through a multi-layer MLP network for feature mapping. Starting with 256-dimensional descriptors as input, the network first maps them to 512 dimensions, and then maps back to 256 dimensions to learn the residual differences between UAV and satellite image features. These residuals are then merged and added to the input to obtain descriptors in a unified semantic space.

Due to the lack of rich, labeled multi-source image matching data, this paper adopts a self-supervised training method to learn the residuals from the existing matching results. In a matching process, sort all feature point matching pairs based on the confidence score:

T = {(p_{i}, q_{j}) ∣ score (i, j)}, where score (i, j) \in [0, 1]

(5)

Select the top k matching pairs with the highest confidence and consider these feature points to be correctly matched.

T_{geo} = {(p_{i}, q_{j}) \in S_{k} ∣ ∥ q_{j} - H p_{i} ∥ \leq ϵ}

(6)

Among them,

ϵ

is the preset projection error threshold, H is the homography matrix used for this matching calculation, and a loss function is then used to bridge the gaps in the feature space mapping. The loss function is defined as follows:

L_{d s p} = \frac{1}{k} \sum_{i = 1}^{K} {∥ d_{i} - d_{i}^{'} ∥}_{2}^{2}

(7)

where

d_{i}

is the descriptor extracted from UAV images, and

d_{i}^{'}

is the descriptor extracted from satellite images. Through this loss function, the network learns the mapping relationship and constructs a unified metric space. During inference, high-confidence matching points are used to perform low-rate network updates, allowing dynamic adaptation to changes in scene content, lighting, and other environmental factors. This online adaptation serves as a lightweight form of incremental learning, achieved through a compact two-layer MLP with minimal computational cost. It enhances the network’s feature representation and improves localization accuracy, particularly in dynamic or previously unseen environments.

3.4. Localization Inference

To obtain end-to-end positioning results, this paper utilizes the algorithm flow shown in Algorithm 1, which can output accurate pixel-level geographic locations for the input UAV images. At the start of the mission, a coarse initial position is assumed to be available, providing sufficient prior information to guide the selection of the initial satellite image for localization. To balance real-time performance and localization accuracy, the localization radius is constrained to 50 m, achieving a trade-off between inference latency and positioning precision while ensuring compatibility with typical UAV motion speeds. During subsequent steps, the system continuously updates the satellite search region by selecting images centered on the previously estimated positions.

For a series of UAV images

Q = {q_{1}, q_{2}, \dots, q_{n}}

, each image

q_{i}

is associated with the corresponding satellite image block

{p_{i}, \dots, p_{j}}

. First, feature points and descriptors are extracted separately from the images, followed by the incremental updating of the descriptors of the UAV images, and finally, the images are matched. Feature points with the highest confidence are selected for matching to form a pair

{q_{i}, p_{k}}

, which provides the key-point matching relationship between the UAV images and satellite image patches. The RANSAC algorithm is then used to estimate the homography matrix. After completing this positioning process, the descriptor incremental mapping network is dynamically updated by selecting high-quality descriptors based on matching confidence.

The per-frame computational complexity of the proposed UAV localization pipeline can be expressed as follows: Feature extraction for each image is denoted as

α (image)

, representing the time to extract keypoints and descriptors. Feature matching using SuperGlue is denoted as

β (n \cdot n^{'})

, where n and

n^{'}

are the numbers of keypoints in the UAV and satellite images, respectively. Homography estimation has complexity

O (\max | M_{i} |)

, where

| M_{i} |

is the number of matches for the selected satellite image. Coordinate transformation and satellite image selection are both constant-time operations:

O (1)

. Assuming K candidate satellite tiles per frame, the total time complexity per frame is as follows:

T_{pre_locate} = O (α (I^{d}) + K \cdot α (S_{i}) + K \cdot β (n \cdot n^{'}) + \max |M_{i}| + 1)

(8)

where

I^{d}

is the UAV image and

S_{i}

are the candidate satellite images.

As shown by the homography formula, the homography matrix H is a

3 \times 3

matrix, but it has 8 degrees of freedom, as it is typically normalized with

h_{22} = 1

. Since Equation (9) applies to all corresponding point sets in the same plane, if there are more than 4 corresponding point sets between the image pairs, H can be robustly fitted in an optimal way. Given that the UAV image is orthorectified, the center of the image

C_{c} = (x_{c}, y_{c})

represents the current UAV position, and the predicted point

C_{p} = (x_{p}, y_{p})

in the satellite image can be obtained using the following:

[\begin{matrix} x_{p} \\ y_{p} \\ 1 \end{matrix}] = H [\begin{matrix} x_{c} \\ y_{c} \\ 1 \end{matrix}] = [\begin{matrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{matrix}] [\begin{matrix} x_{c} \\ y_{c} \\ 1 \end{matrix}]

(9)

Algorithm 1: UAV Visual Localization for Image Sequence

Subsequently, the coordinates

C_{p} = (x_{p}, y_{p})

in the satellite image can be calculated based on LLcoords and URcoords, yielding the predicted position

p_{i}

, thereby enabling visual localization.

4. The RealUAV Dataset

In this section, this article introduces the key features and collection methods of the proposed RealUAV dataset. Then, the evaluation indicators of the dataset were introduced in detail.

4.1. Dataset Description

The proposed RealUAV dataset is designed to facilitate pixel-level self-localization for UAVs, addressing a gap in UAV visual localization within real-world scenarios, which has not been covered by prior datasets. Detailed information about RealUAV can be found in Table 1. This includes a thorough analysis of various aspects, such as image sources, application platforms, sampling methods, data sources, and relevant evaluation metrics. The following section highlights some significant features of RealUAV at the data level.

As shown in Figure 4, RealUAV includes various scene types, including campus buildings, parks, schools, lakes, public buildings, grasslands, forests, etc. Rich scenes are closer to the working state of UAVs in real environments. The dataset is organized sequentially, with details on the number of localization instances and the flight distance for each sequence. Different sequences are tailored to various scenarios, with varying levels of localization difficulty. A total of 15 sequences are provided, covering 6044 localization instances and a total flight distance of 24,063 m.

Flying platform. The dataset was collected using a DJI Mavic 3 Classic UAV equipped with an integrated camera system. The camera is mounted on a three-axis mechanical gimbal, ensuring stable imaging with an angular jitter of ±0.007°. It features a 4/3 CMOS sensor capable of capturing 20-megapixel images. For ground-truth positioning, we use the onboard GNSS module, which integrates signals from GPS, Galileo, and BeiDou constellations. The positioning accuracy is approximately 0.5 m under open-sky conditions.

Dataset Collection. The UAV conducts flights at a constant altitude of 200 m above ground level, with a downward-facing camera (pitch angle −90°) and a fixed heading angle of 0°. It maintains a stable speed of 6 m/s. The onboard camera records video at 60 frames per second with a resolution of

1920 \times 1080

pixels. Frames are sampled every 15 frames (i.e., at 4 Hz), and each selected frame is center-cropped to

1080 \times 1080

pixels before being downsampled to

490 \times 490

pixels to reduce computational load.

The corresponding satellite imagery is retrieved from Google Maps with a zoom level of 19, corresponding to a ground sampling distance (GSD) of approximately 0.247 m per pixel. To facilitate matching, each satellite image is divided into overlapping patches of

512 \times 512

pixels using a sliding window with a stride of 100 pixels. Each UAV image is then paired with a candidate set of satellite patches located within a 50 m radius centered around its GNSS-based ground truth location.

During testing, pixel-level geographic location information is obtained by matching the UAV images with the satellite image blocks. Using a satellite map with real latitude and longitude data, the pixel coordinates are converted into latitude and longitude values, completing the UAV’s visual localization. This process mirrors real-world scenarios.

4.2. Evaluation Protocol

This section introduces a real UAV evaluation system. A new and more robust evaluation metric is proposed, which includes the success rate, position error (PE), and evaluation criteria for acceptable error standards across different systems, namely positioning distance measurement(PDM@K).

In UAV visual positioning, a positioning error greater than 50 m is considered a failure [50,51]. The positioning success rate is defined as follows:

R = \frac{N}{N + M}

(10)

here, N denotes the number of correct matches, and M the number of incorrect matches. A match is considered incorrect if the homography matrix cannot be computed (i.e., fewer than four matching feature points) or if the error exceeds 50 m, making the match unreliable.

P E = \frac{\sum_{i = 1}^{N} l_{d} (A, A^{'}) + M \cdot P}{N + M}

(11)

l_{d}

represents the difference between the UAV’s real geographic location and the algorithmic location. A and

A^{'}

are UAV visual positioning and GPS latitude and longitude. Calculate the actual distance between two points using the Haversine formula:

l_{d} (A, A^{'}) = 2 R \arcsin (\sqrt{Δ ϕ^{2} + \cos ϕ \cos ϕ^{'} Δ λ^{2}})

(12)

where

Δ ϕ = \frac{ϕ^{'} - ϕ}{2} \cdot \frac{π}{180}

is half the latitude difference in radians, and

Δ λ = \frac{λ^{'} - λ}{2} \cdot \frac{π}{180}

is half the longitude difference is calculated in radians, with the radius value set at 6378.137 km. Next, the PDM evaluation criteria are defined as follows:

P D M @ K = \frac{1}{N} \sum \frac{1}{e^{S \times \max (0, l_{d} - K)}}

(13)

S is a constant, set to 0.1, and K is the threshold error. Positioning is considered successful if the error is less than K. These indicators provide a comprehensive evaluation of the UAV’s positioning capability across multiple dimensions, bridging the gap with its actual operational environment. The new features of the Real-UAV dataset are summarized as follows:

Continuous frames: RealUAV is a continuous UAV visual dataset that closely resembles the real working environment of UAVs.
These indicators comprehensively evaluate the positioning capability of UAVs in various operational scenarios. Covering most scenarios of UAV operation.
Containing sequences of different difficulty levels, it can become a universal evaluation criterion.

5. Experiments

In this section, after introducing the implementation details in A, the superiority and generalization ability of the method on RealUAV are demonstrated in B. The analysis includes matching accuracy, average positioning error, PDM, and single positioning time, with a detailed discussion of the results. In C, ablation experiments are conducted to investigate the impact of the feature reconstruction module (FR) and descriptor incremental mapping module (DIM) on the final outcomes.

5.1. Implement Details

The proposed method uses the pre-trained DINO-ViT/S model with frozen weights to extract coarse features. Fine features are extracted using a 6-layer convolutional network with 3 layers of max pooling, where the kernel size is 3, and the number of channels is [64, 64, 64, 64, 128, 128]. The pooling layer has a stride of 2, and the pre-trained weights of SP are used for initialization. The training was conducted on the simulated LA-10000 dataset [21], where satellite images were sourced from Google Maps, and UAV images were collected using the Google Earth simulator. Representative sample images are shown in Figure 5.

The optimizer used for the neural network is SGD (momentum = 0.9), with the initial learning rates for the feature reconstruction network and descriptor mapping network set to 0.1 times and 1 times the base learning rate, respectively. During the training process, the DIM module is initially disregarded. After training the feature point extraction network, the network is frozen, and the DIM is trained by constructing self-supervised feature point pairs on the LA-10000 dataset. The model is built using the PyTorch v2.2.2 framework. The experiments in B and C were performed on an NVIDIA RTX 4090 GPU, while the experiment in D was conducted on an Nvidia Orin Nano.

5.2. Comparative Experiments with Mainstream Methods

The proposed method was compared with other methods, including SuperPoint [31] (baseline), ORB [52], Alike [53], and Xfeat [32], to evaluate its performance in terms of accuracy, efficiency, and other aspects. Nearest Neighbor (NN) and SuperGlue [54] methods were used as matching tools, respectively. Due to the limited size of the training dataset, retraining the models could lead to poor performance due to overfitting. To ensure a fair comparison, the pre-trained weights of the models were used, as they were trained on larger datasets with better generalization ability.

The quantitative results are shown in Table 2, where positioning error (PE) and positioning distance measurement (PDM) are used as evaluation metrics. The best results are highlighted in bold. When using SuperGlue as a matching tool, the proposed method outperforms all other methods across all metrics, with a localization error of only 8.71 m and most localization distances below 5 m. Compared to SuperPoint (the baseline method), the proposed method demonstrates significant improvement in sparse scenes. When using nearest neighbors for matching, the method proposed in this study also achieves the best results. This article proposes a method that effectively captures features in sparse scenes, greatly enhancing the ability to extract and describe feature points, thereby reducing localization errors.

To better highlight the differences in localization performance, the results of using SuperGlue as a matching tool are visualized. Figure 6 shows the localization trajectory of our method in urban, village, park, and school scenarios, covering areas such as teaching buildings, residential buildings, wastelands, squares, forests, lakes, and more. These scenarios represent the majority of environments encountered during UAV operations. In the figure, green represents the GPS location, red represents the location determined by our proposed method, blue represents the location determined by the SuperPoint method, and the highlighted area within the yellow dashed box corresponds to regions with sparse texture. In the dashed box area of (a), the terrain has changed due to seasonal influences. In the dashed box area of (b), there has been a significant change in the scene. The dashed box area in (c) and (d) is primarily composed of scenes such as trees and forests, where features are sparse and difficult to capture. It can be observed that our method effectively extracts low-texture features from these scenes through the feature reconstruction and descriptor increment modules, successfully addressing the challenge of inaccurate matching in sparse environments. The positioning error for urban scenes is 6.02 m, for village scenes is 3.07 m, for park scenes is 3.41 m, and for school scenes is 6.01 m, achieving strong localization results.

Figure 7 shows the results of the proposed method and the SuperPoint method when using SuperGlue to match feature points. The matching results for four scenes are displayed, including both richly textured and sparsely textured areas. The method proposed in this study constructs a feature map with rich semantics, capturing both coarse-grained and fine-grained features of the image. It is evident that, although the results are similar in areas with dense textures, more detailed features are effectively extracted in sparse scenes, resulting in more feature points and stronger metric descriptors, thereby achieving better matching results.

To evaluate the computational efficiency of the UAV and the reliability of the positioning algorithm, this paper measures the correct positioning frequency, FPS, and the time required for keypoint extraction. The experimental results are presented in Table 3 and Table 4, where the bolded values indicate the best results. The FPS calculation considers the total time required for multiple matches in a single positioning process of the UAV, including post-processing tasks such as I/O operations, keypoint extraction, and keypoint matching. The proposed method in this study achieved the highest positioning accuracy with a slight reduction in operating speed. Notably, the ORB method runs very fast; however, its positioning accuracy is poor, making it unsuitable for UAV localization.

Summary and Reflection. SuperPoint, Alike, and XFeat improve the generation of feature points and descriptors through architectural modifications to the network, yielding significant performance gains over traditional handcrafted operators such as ORB. However, these methods are mainly designed for structured environments like indoor and outdoor buildings, and they tend to underperform in regions with sparse features due to their limited texture representation capability. In contrast, the method proposed in this study leverages a general visual model to extract coarse-grained patch features and integrates them with fine-grained details, resulting in semantically enriched feature maps. This design enhances robustness in texture-sparse scenes and contributes to superior localization performance.

Comprehensive experimental results demonstrate that the proposed method consistently outperforms existing approaches across multiple evaluation metrics. Specifically, it achieves the lowest average positioning error (8.71 m), the highest localization reliability (R = 0.95), and maintains real-time performance with an average frame rate of 1.95 FPS. Although this reflects a slight decrease from the 2.15 FPS achieved by the SuperPoint baseline, it yields a substantial 33.46% improvement in localization accuracy. Compared to other methods such as ORB+NN, Alike+NN, and Xfeat+NN, our approach also demonstrates superior robustness and efficiency. As shown in Figure 8, the proposed method maintains moderate GPU and memory consumption in addition to achieving high accuracy, demonstrating its practicality for drone localization in resource-limited or real-time scenarios.

5.3. Ablation Experiment

In this section, we explored the roles of the feature reconstruction (FR) module and the descriptor incremental mapping (DIM) module. Use superglue for matching, keep other experimental settings consistent with Section 5.2.

Role of the Feature Reconstruction. The feature point extraction network based on deep learning can effectively extract texture features from images, but it often overlooks sparse features. This issue is particularly prominent when there is a large number of repetitive features and similar semantic information in the image, which can result in significant errors in feature matching and localization. To address this, this paper proposes a feature reconstruction module that captures low-texture areas by integrating rough patch features from the general visual model. As shown in Figure 9, the feature reconstruction module effectively identifies the features of areas such as lawns, farmland, and village roads, and extracted more keypoints.

Role of the Descriptor Incremental Mapping. For multi-source images, factors such as imaging methods, lighting conditions, and seasonal changes often cause significant variations in ground objects, leading to measurement discrepancies. This study employs descriptor mapping technology to align UAV and satellite images within the same metric space, reducing these differences. PCA dimensionality reduction was used for visualization. As shown in Figure 10, there was a difference in the descriptors between the UAV and satellite images before using the descriptor increment. After using the descriptor increment, this difference decreased. It is evident that the method proposed in this article effectively brings the descriptors generated from multiple sources of images closer to each other.

Experiments were conducted on all sequences, and the quantitative results are shown in Table 5. This paper proposed a feature reconstruction module and a descriptor incremental mapping module for achieving improved results. Among them, the feature reconstruction module focuses on extracting sparse features, while the descriptor increment module focuses on reducing multi-source image differences.

Role of the Search Radius. To analyze the effect of search radius on localization performance, we conducted experiments across a range of radii from 10 m to 100 m. As shown in Table 6 and Figure 11, increasing the search radius improves accuracy, thus reducing the average positioning error (PE), boosting PDM@K scores, and increasing reliability (R). However, this also leads to a notable drop in frame rate (FPS), reflecting higher computational cost. Among all settings, a radius of 50 achieves the best trade-off, with a low error (8.71 m), high reliability (R = 0.95), and real-time speed (1.95 FPS). Larger radii offer only marginal gains in accuracy while significantly reducing efficiency, making radius = 50 the optimal choice in our system. In future work, range estimation techniques will be investigated to reduce the localization radius without compromising accuracy. This is expected to improve real-time performance and better accommodate fast UAV motion.

5.4. Real-Time Edge Device Performance Evaluation

This section examines the performance of the proposed method and visual localization framework on edge devices, with a primary focus on power consumption and visual processing. The evaluation was conducted on the Nvidia Orin Nano, which provides robust computing power and efficient power management, making it well-suited for embedded system applications. Its real-time processing capability was assessed while running the visual localization framework to ensure stable operation in UAV localization and practical application scenarios.

As shown in Table 7, the proposed method achieves nearly 1 FPS with a power consumption of 15W. Future work will explore optimizations through model quantization, model distillation, and the use of TensorRT during deployment and inference.

6. Conclusions

This paper proposes a scene adaptive UAV visual localization framework (SAVL). In the proposed framework, UAV images are mapped onto satellite images with geographic coordinates through pixel-level matching to locate UAVs. First, to address the issue of inaccurate localization caused by sparse terrain features in a scene, a feature extraction network based on a general visual model is introduced to extract sparse features from UAV and satellite images. Second, to overcome the problem of weak generalization ability in unknown scenarios, a descriptor incremental mapping module is designed to reduce multi-source image differences at the semantic level through UAV-satellite image descriptor mapping, and a confidence-based incremental strategy is constructed to dynamically adapt to different scenes. Finally, due to the lack of annotated public datasets, a scene-rich UAV dataset (RealUAV) is developed to study UAV visual localization in real-world environments.

Experimental results on real data demonstrate that, compared with other representative methods, this approach not only significantly reduces localization errors but also exhibits strong scene generalization ability, improving localization performance in sparse scenes. However, in some highly complex environments, such as areas with drastic land cover changes, the proposed method is not yet sufficiently stable. Fully utilizing limited visual information remains a challenge.

Future research will focus on refining feature extraction strategies and optimizing descriptor mapping algorithms by integrating environmental awareness and real-time feedback, thereby enhancing stability in extreme environments. The impact of perspective distortion, particularly in urban environments with high-rise buildings and lens distortion on positioning accuracy, will also be investigated. More comprehensive geometric models and correction mechanisms will be considered to mitigate these effects and further enhance positioning accuracy. Moreover, the feasibility of real-time localization under diverse UAV motion conditions will be further investigated, emphasizing the trade-off between accuracy and computational efficiency to enable practical deployment in dynamic environments. In addition, LiDAR, IMU, and other sensors will be integrated through multi-modal fusion techniques such as Visual Odometry (VO) and Visual-Inertial Odometry (VIO), enhancing system robustness and enabling autonomous navigation in complex scenarios. Further efforts will also focus on reducing localization latency by refining the range estimation process, aiming to improve responsiveness without compromising accuracy during dynamic flight.

Author Contributions

Conceptualization, G.L. and Z.L.; Methodology, G.L. and Z.L.; Validation, G.L., Z.L. and Q.G.; Formal analysis, G.L., Z.L. and Q.G.; Investigation, G.L., Z.L. and Q.G.; Writing—original draft, G.L. and Z.L.; Writing—review & editing, G.L., Z.L. and Y.Y.; Supervision, G.L. and Y.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by National Key R&D Program of China (No. 2022ZD0160601), National Natural Science Foundation of China (No. 62273282), Research on Core Algorithm Technologies for Mixed Reality, Open Research Fund of Laboratory of Intelligent Control (2024-ZKSYS-KF04-03), Open Research Fund of The State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS2024113), and Taicang Basic Research Program—General Project (TC2024JC23).

Data Availability Statement

The RealUAV dataset is available at https://github.com/Arancew/RealUAV (accessed on 11 May 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, X. Vicinagearth. Vicinagearth 2024, 1, 1. [Google Scholar] [CrossRef]
Deng, A.; Han, G.; Chen, D.; Ma, T.; Liu, Z. Slight aware enhancement transformer and multiple matching network for real-time UAV tracking. Remote Sens. 2023, 15, 2857. [Google Scholar] [CrossRef]
Chhapariya, K.; Ientilucci, E.; Buddhiraju, K.M.; Kumar, A. Target Detection and Characterization of Multi-Platform Remote Sensing Data. Remote Sens. 2024, 16, 4729. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, L.; Zhang, X.; Xu, S.; Bu, S.; Jiang, H.; Han, P.; Li, K.; Wan, G. RTSfM: Real-time structure from motion for mosaicing and DSM mapping of sequential aerial images with low overlap. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607415. [Google Scholar] [CrossRef]
Tong, P.; Yang, X.; Yang, Y.; Liu, W.; Wu, P. Multi-UAV collaborative absolute vision positioning and navigation: A survey and discussion. Drones 2023, 7, 261. [Google Scholar] [CrossRef]
Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [Google Scholar] [CrossRef]
Geneva, P.; Eckenhoff, K.; Lee, W.; Yang, Y.; Huang, G. Openvins: A research platform for visual-inertial estimation. In Proceedings of the International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 4666–4672. [Google Scholar]
Gao, Y.; Li, G. A GNSS instrumentation covert directional spoofing algorithm for UAV equipped with tightly-coupled GNSS/IMU. IEEE Trans. Instrum. Meas. 2023, 72, 8501413. [Google Scholar] [CrossRef]
Li, J.; Yang, B.; Cong, Y.; Cao, L.; Fu, X.; Dong, Z. 3D forest mapping using a low-cost UAV laser scanning system: Investigation and comparison. Remote Sens. 2019, 11, 717. [Google Scholar] [CrossRef]
Wan, X.; Shao, Y.; Zhang, S.; Li, S. Terrain aided planetary UAV localization based on geo-referencing. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4602018. [Google Scholar] [CrossRef]
Luo, H.; Li, G.; Zou, D.; Li, K.; Li, X.; Yang, Z. UAV navigation with monocular visual inertial odometry under GNSS-denied environment. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1001615. [Google Scholar] [CrossRef]
Sonugür, G. A Review of quadrotor UAV: Control and SLAM methodologies ranging from conventional to innovative approaches. Robot. Auton. Syst. 2023, 161, 104342. [Google Scholar] [CrossRef]
Huang, L.; Wang, Z.; Xiong, Q.; Qu, R.; Yao, C.; Li, C. Mamba-VNPS: A Visual Navigation and Positioning System with State-Selection Space. Drones 2024, 8, 663. [Google Scholar] [CrossRef]
Gu, J.; Stefani, E.; Wu, Q.; Thomason, J.; Wang, X.E. Vision-and-language navigation: A survey of tasks, methods, and future directions. arXiv 2022, arXiv:2203.12667. [Google Scholar]
Yuan, Y.; Sun, B.; Liu, G. Drone-based scene matching visual geo-localization. Acta Autom. Sin. 2024, 51, 287–311. [Google Scholar]
Sun, B.; Liu, G.; Yuan, Y. Dimensionally Unified Metric Model for Multi-Source and Multi-View Scene Matching. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5624411. [Google Scholar]
Etezadifar, P.; Farsi, H. A new sample consensus based on sparse coding for improved matching of SIFT features on remote sensing images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 5254–5263. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.; Tardós, J.D. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Zheng, X.; Yuan, Y.; Lu, X. A deep scene representation for aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 4799–4809. [Google Scholar] [CrossRef]
Sun, B.; Liu, G.; Yuan, Y. F3-Net: Multiview scene matching for drone-based geo-localization. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5610611. [Google Scholar] [CrossRef]
Liu, G.; Liu, C.; Yuan, Y. Locate where you are by block joint learning network. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6507005. [Google Scholar] [CrossRef]
Liu, G.; Li, C.; Zhang, S.; Yuan, Y. VL-MFL: UAV Visual Localization Based on Multi-Source Image Feature Learning. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5618612. [Google Scholar]
Li, S.; Hu, M.; Xiao, X.; Tu, Z. Patch similarity self-knowledge distillation for cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 5091–5103. [Google Scholar] [CrossRef]
Han, X.; Leung, T.; Jia, Y.; Sukthankar, R.; Berg, A.C. Matchnet: Unifying feature and metric learning for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3279–3286. [Google Scholar]
Li, C.; Liu, G.; Yuan, Y. A multi-source image matching network for UAV visual location. In Proceedings of the International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 1651–1655. [Google Scholar]
He, M.; Liu, J.; Gu, P.; Meng, Z. Leveraging Map Retrieval and Alignment for Robust UAV Visual Geo-Localization. IEEE Trans. Instrum. Meas. 2024, 73, 2523113. [Google Scholar] [CrossRef]
Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th ACM international conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1395–1403. [Google Scholar]
Dai, M.; Zheng, E.; Feng, Z.; Qi, L.; Zhuang, J.; Yang, W. Vision-based UAV self-positioning in low-altitude urban environments. IEEE Trans. Image Process. 2023, 33, 493–508. [Google Scholar] [CrossRef]
Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, Y.; Zheng, B.; Yang, Y. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 867–879. [Google Scholar] [CrossRef]
Yi, K.M.; Trulls, E.; Lepetit, V.; Fua, P. Lift: Learned invariant feature transform. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 467–483. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Potje, G.; Cadar, F.; Araujo, A.; Martins, R.; Nascimento, E.R. XFeat: Accelerated Features for Lightweight Image Matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 2682–2691. [Google Scholar]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. ibot: Image bert pre-training with online tokenizer. arXiv 2021, arXiv:2111.07832. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. Dinov2: Learning robust visual features without supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Jiang, H.; Karpur, A.; Cao, B.; Huang, Q.; Araujo, A. OmniGlue: Generalizable Feature Matching with Foundation Model Guidance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19865–19875. [Google Scholar]
Edstedt, J.; Sun, Q.; Bökman, G.; Wadenbäck, M.; Felsberg, M. RoMa: Robust dense feature matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19790–19800. [Google Scholar]
He, X.; Yu, H.; Peng, S.; Tan, D.; Shen, Z.; Bao, H.; Zhou, X. MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training. arXiv 2025, arXiv:2501.07556. [Google Scholar]
Novotny, D.; Albanie, S.; Larlus, D.; Vedaldi, A. Self-supervised learning of geometrically stable features through probabilistic introspection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3637–3645. [Google Scholar]
Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Torii, A. InLoc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7199–7209. [Google Scholar]
Noh, H.; Araujo, A.; Sim, J.; Weyand, T.; Han, B. Large-Scale Image Retrieval with Attentive Deep Local Features. In Proceedings of the International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 3476–3485. [Google Scholar]
Liu, G.; Yuan, Y.; Zhang, Y.; Dong, Y.; Li, X. Style transformation-based spatial–spectral feature learning for unsupervised change detection. IEEE Trans. Geosci. Remote Sens. 2020, 60, 5401515. [Google Scholar] [CrossRef]
Mirza, M.J.; Micorek, J.; Possegger, H.; Bischof, H. The norm must go on: Dynamic unsupervised domain adaptation by normalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14765–14775. [Google Scholar]
Zhu, R.; Yin, L.; Yang, M.; Wu, F.; Yang, Y.; Hu, W. SUES-200: A multi-height multi-scene cross-view image benchmark across drone and satellite. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4825–4839. [Google Scholar] [CrossRef]
Xu, W.; Yao, Y.; Cao, J.; Wei, Z.; Liu, C.; Wang, J.; Peng, M. UAV-VisLoc: A Large-scale Dataset for UAV Visual Localization. arXiv 2024, arXiv:2405.11936. [Google Scholar]
Ng, P.C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31, 3812–3814. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9650–9660. [Google Scholar]
Hedlin, E.; Sharma, G.; Mahajan, S.; Isack, H.; Kar, A.; Tagliasacchi, A.; Yi, K.M. Unsupervised semantic correspondence using stable diffusion. Adv. Neural Inf. Process. Syst. 2024, 36, 8266–8279. [Google Scholar]
Jiang, H.; Ramakrishnan, S.K.; Grauman, K. Single-stage visual query localization in egocentric videos. Adv. Neural Inf. Process. Syst. 2024, 36, 24143–24157. [Google Scholar]
Couturier, A.; Akhloufi, M.A. A Review on Deep Learning for UAV Absolute Visual Localization. Drones 2024, 8, 622. [Google Scholar] [CrossRef]
Xiao, J.; Tortei, D.; Roura, E.; Loianno, G. Long-range uav thermal geo-localization with satellite imagery. In Proceedings of the International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 5820–5827. [Google Scholar]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the International Conference on Computer Vision (ICCV), Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Zhao, X.; Wu, X.; Miao, J.; Chen, W.; Chen, P.C.; Li, Z. Alike: Accurate and lightweight keypoint detection and descriptor extraction. IEEE Trans. Multimed. 2022, 25, 3101–3112. [Google Scholar] [CrossRef]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. SuperGlue: Learning Feature Matching With Graph Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4937–4946. [Google Scholar]

Figure 1. The Framework of UAV Visual Localization. At a height of 200 m and a distance of 24 km, the positioning error is 8.71 m.

Figure 2. Typical differences between UAV images and satellite images: (a) seasonal differences; (b) perspective differences; (c) sparse texture.

Figure 3. Structure of Proposed Methods. Given a UAV image and a satellite image, each is first processed independently through a dual-branch feature extraction network to obtain both coarse and fine features. These branches ensure that each image maintains its own set of hidden features. The extracted features are then jointly processed via a feature reconstruction module to generate rich semantic representations that incorporate both global context and local details. Subsequently, keypoints and descriptors are decoded separately for each image, enabling accurate cross-view image matching. Finally, a coordinate transformation is applied to the matched keypoints to estimate the precise geographic location of the UAV.

Figure 4. The RealUAV dataset is rich in scenes, including scenes with different texture features.

Figure 5. Representative UAV image samples, Simulated LA-10000 dataset.

Figure 6. The green trajectory represents the GPS location, the red trajectory represents the location determined by OUR method, the blue trajectory represents the location determined by the SP method, and the area highlighted by the yellow dashed box corresponds to the region with sparse texture.

Figure 7. Comparison between our method and SP method in sparse texture. On the left is our method, and on the right is the SP method. Our method can achieve better matching results in sparse scenes.

Figure 8. Performance comparison of different visual localization methods. (a) Relationship between positioning error and inference latency. (b) Resource consumption during inference. The proposed method achieves the lowest positioning error and maintains competitive inference latency while requiring only moderate computational resources.

Figure 9. UAV and satellite images extract feature points and feature maps for visualization: (a,c) Baseline; (b,d) Baseline + FR. In the feature map, the more activated a feature is, the redder the color becomes. If it is suppressed, the color turns bluer. The method presented in this article places greater emphasis on sparse scenes.

Figure 10. Visualize descriptor metric space: (a) no descriptor increment was used; (b) use descriptor increment.

Figure 11. Impact of search radius on localization performance. Increasing the radius reduces the positioning error but leads to a significant drop in processing speed (FPS). Radius = 50 m achieves a good balance between accuracy and real-time performance.

Table 1. Comparison of datasets.

Datasets	Target	Platform	Task	Flight Route	Collection Strategy	Evaluation
CVUSA	Street	Ground-Satellite	Retrieval	Separate Images	Inconsistent	R@K
university-1652	Building	Ground-UAV-Satellite	Retrieval	Spiral Curves	Consistent	R@K and mAP
DenseUAV	University	UAV-Satellite	Retrieval	Separate Images	Consistent	R@K and SDM@K
SUES-200	University	UAV-Satellite	Retrieval	Separate Images	Consistent	R@K, AP, RB, PF
UAV-VisLoc	Various	UAV-Satellite	Locate	11 Trajectories	Inconsistent	–
RealUAV (Ours)	Various	UAV-Satellite	Locate	15 Trajectories	Consistent	PE and PDM

Table 2. Results under SG and NN Matchers (PE and PDM).

Method	PE↓	PDM@1↑	PDM@3↑	PDM@5↑
OURS + SG	8.71	0.64	0.76	0.84
SP + SG	13.09	0.58	0.69	0.77
OURS + NN	17.01	0.53	0.64	0.71
SP + NN	20.94	0.46	0.55	0.61
Xfeat + NN	18.87	0.52	0.62	0.69
Alike + NN	18.82	0.51	0.61	0.68
ORB + NN	46.72	0.12	0.15	0.23

Table 3. Comparison of execution time of keypoint extraction methods (ms).

Method	OURS	SP	Xfeat	Alike	ORB
Time (ms)	13.75	4.01	3.86	3.42	0.78

Table 4. Comparison of SuperGlue+ and NN+ Matching Methods (R and FPS).

Method	R ↑	FPS ↑
OURS + SG	0.95	1.95 ± 0.15
SP + SG	0.92	2.15 ± 0.15
OURS + NN	0.86	2.2 ± 0.2
SP + NN	0.81	2.39 ± 0.3
Xfeat + NN	0.85	2.53 ± 0.2
Alike + NN	0.84	2.93 ± 0.15
ORB + NN	0.36	10 ± 0.5

Table 5. Results of ablation experiments using SG as a matching tool and FR and DIM.

Method	PE↓	PDM@1↑	PDM@3↑	PDM@5↑
baseline	13.09	0.58	0.69	0.77
baseline + FR	10.79	0.62	0.74	0.82
baseline + FR + DIM	8.71	0.64	0.76	0.84

Table 6. Effect of search radius on localization performance.

Radius	PE↓	PDM@1↑	PDM@3↑	PDM@5↑	R↑	FPS↑
10	28.30	0.38	0.45	0.49	0.42	5.78 ± 0.15
25	18.20	0.51	0.62	0.68	0.82	4.18 ± 0.15
30	13.60	0.57	0.68	0.76	0.91	3.93 ± 0.15
50	8.71	0.64	0.76	0.84	0.95	1.95 ± 0.15
75	8.62	0.64	0.77	0.85	0.95	1.05 ± 0.15
100	8.56	0.64	0.77	0.86	0.95	0.59 ± 0.15

Table 7. Comparison of Power Consumption and FPS.

Device	Power	FPS ↑
Nvidia Orin Nano	15 W	0.76 ± 0.1
Nvidia Orin Nano	7 W	0.47 ± 0.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, G.; Li, Z.; Gao, Q.; Yuan, Y. SAVL: Scene-Adaptive UAV Visual Localization Using Sparse Feature Extraction and Incremental Descriptor Mapping. Remote Sens. 2025, 17, 2408. https://doi.org/10.3390/rs17142408

AMA Style

Liu G, Li Z, Gao Q, Yuan Y. SAVL: Scene-Adaptive UAV Visual Localization Using Sparse Feature Extraction and Incremental Descriptor Mapping. Remote Sensing. 2025; 17(14):2408. https://doi.org/10.3390/rs17142408

Chicago/Turabian Style

Liu, Ganchao, Zhengxi Li, Qiang Gao, and Yuan Yuan. 2025. "SAVL: Scene-Adaptive UAV Visual Localization Using Sparse Feature Extraction and Incremental Descriptor Mapping" Remote Sensing 17, no. 14: 2408. https://doi.org/10.3390/rs17142408

APA Style

Liu, G., Li, Z., Gao, Q., & Yuan, Y. (2025). SAVL: Scene-Adaptive UAV Visual Localization Using Sparse Feature Extraction and Incremental Descriptor Mapping. Remote Sensing, 17(14), 2408. https://doi.org/10.3390/rs17142408

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAVL: Scene-Adaptive UAV Visual Localization Using Sparse Feature Extraction and Incremental Descriptor Mapping

Abstract

1. Introduction

2. Realted Work

2.1. Self-Supervised Vision Models

2.2. Image Feature Description

2.3. Geo-Localization Datasets

3. Method

3.1. Problem Definition

3.2. Feature Reconstruction

3.3. Descriptor Increment Mapping

3.4. Localization Inference

4. The RealUAV Dataset

4.1. Dataset Description

4.2. Evaluation Protocol

5. Experiments

5.1. Implement Details

5.2. Comparative Experiments with Mainstream Methods

5.3. Ablation Experiment

5.4. Real-Time Edge Device Performance Evaluation

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI