CPFL: Resilient Continuous UAV Localization via Cross-View Perception and Particle Filtering

Su, Chao; Yuan, Jiayu; Zheng, Enhui; Xu, Wangpin; Liu, Zhanghua; Hu, Jianhong

doi:10.3390/drones10060437

Open AccessArticle

CPFL: Resilient Continuous UAV Localization via Cross-View Perception and Particle Filtering

by

Chao Su

,

Jiayu Yuan

,

Enhui Zheng

^*

,

Wangpin Xu

,

Zhanghua Liu

and

Jianhong Hu

School of Mechanical and Electrical Engineering, China Jiliang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Drones 2026, 10(6), 437; https://doi.org/10.3390/drones10060437

Submission received: 9 April 2026 / Revised: 26 May 2026 / Accepted: 1 June 2026 / Published: 3 June 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A vision-only Cross-view Particle Filter Localization (CPFL) framework is proposed. By fusing homography-based visual motion propagation with a Dual-Granularity Adaptive Gating (DGAG) network, it restrains cumulative drift into a bounded error during continuous localization.
Experiments on the real-world MAFS dataset demonstrate that CPFL achieves a 5.28 m mean error and an 89.7% success rate (under a 10 m threshold), outperforming mainstream vision-only and IMU-fusion baselines in global accuracy and long-term robustness.

What are the implications of the main findings?

By addressing the perceptual aliasing problem in homogeneously textured regions and constraining cumulative drift via the filtering mechanism, this approach provides a resilient, map-dependent solution for continuous UAV localization in outdoor GNSS-denied environments.
The introduction of the TCI@d metric establishes a more rigorous standard for evaluating long-term localization stability. Furthermore, the CPFL framework provides a new self-localization paradigm with engineering deployment potential for resource-constrained UAVs executing long-endurance missions in complex dynamic environments.

Abstract

Achieving long-term, continuous, and accurate localization for Unmanned Aerial Vehicles (UAVs) in outdoor GNSS-denied environments where pre-existing reference maps are available is challenging. To this end, this paper proposes a Cross-view Particle Filter Localization (CPFL) framework. Unlike existing particle filter approaches that rely on inertial sensors for state propagation or sparse semantic labels for observation updates, CPFL is a vision-driven solution. This framework introduces specific adaptations into the two core stages of particle filtering: In the motion propagation stage, it achieves visual state transition by calculating a feature-based inter-frame homography mapping to estimate the 2D global relative motion components, eliminating the dependency on inertial priors; in the observation correction stage, a Dual-Granularity Adaptive Gating (DGAG) cross-view network is designed to mitigate perceptual aliasing and generate discriminative absolute position weights for the particles. By fusing these two stages through a filter mechanism, the framework transforms unbounded cumulative drift into bounded absolute localization errors. Furthermore, addressing the measurement deficiencies of traditional single-frame metrics, this paper also proposes a Trajectory Continuity Index (TCI@d) tailored for continuous localization tasks. Experiments on the real-world MAFS dataset confirm that this framework achieves a mean localization error of 5.28 m and a localization success rate of 89.7% under a 10-m threshold. Compared with mainstream vision-only algorithms and IMU-fusion baselines, this framework demonstrates lower mean errors and improved trajectory continuity, validating its effectiveness for long-term robustness.

Keywords:

UAV; localization; particle filter; cross-view perception; trajectory continuity

1. Introduction

In recent years, Unmanned Aerial Vehicles (UAVs) have been utilized in fields such as agricultural monitoring, intelligent transportation, logistics distribution, and emergency rescue [1]. A fundamental requirement for executing these complex tasks is that UAVs must possess accurate and reliable autonomous positioning capabilities. Reliable global pose estimation is critical for intelligent path planning and autonomous obstacle avoidance, facilitating safe operations in complex and dynamic environments [2]. However, current autonomous UAV flights rely on Global Navigation Satellite Systems (GNSS). In complex application scenarios such as urban high-rise building clusters, dense forest canopies, or strong electromagnetic interference, GNSS signals face attenuation, multipath effects, and even complete denial (GNSS-denied) [3,4]. Once external satellite signals are unavailable, UAVs are susceptible to positioning failure. Consequently, cross-view geo-localization [5,6] has been proposed as an alternative to compensate for GNSS limitations. By matching real-time local top-down images with pre-stored satellite maps containing absolute coordinates, this technology enables planar pose estimation without external signals, providing technical support for stable UAV flight in complex environments.

When applying cross-view techniques to continuous flight missions, the challenge lies in maintaining stable trajectory tracking with controllable cumulative errors. Whether the temporal consistency of localization results can be guaranteed determines if a UAV can complete long-term autonomous missions in map-dependent GNSS-denied environments. Existing mainstream cross-view localization methods essentially belong to the “discrete” localization paradigm. Although these methods have achieved progress in single-frame localization accuracy, their independent processing of each image frame ignores sequential temporal correlations, making localization results susceptible to inter-frame jitter [7]. To introduce temporal constraints, researchers have integrated probabilistic filter frameworks. For example, SWA-PF [8] drives particle propagation using motion priors provided by an Inertial Measurement Unit (IMU) and updates weights based on semantic consistency. However, this sensor fusion approach introduces hardware dependency, and its sparse semantic observation model is prone to perceptual aliasing in homogeneously textured regions (e.g., grasslands or water surfaces), leading to filter divergence.

The core research question this paper seeks to answer is: How can we effectively constrain cumulative drift and overcome perceptual aliasing to achieve resilient, long-term UAV tracking in GNSS-denied environments, relying solely on visual observations? To this end, we propose a vision-only Cross-view Particle Filter Localization (CPFL) framework tailored for extended missions.

By replacing inertial priors with a vision-driven motion propagation mechanism and introducing a dual-granularity feature fusion network for robust observation correction, CPFL integrates sequential state transitions with absolute spatial anchors. Through this probabilistic filtering mechanism, CPFL transforms the unbounded cumulative drift inherent in visual dead reckoning into bounded absolute localization errors, ensuring systemic resilience, as illustrated in Figure 1.

Furthermore, existing UAV visual localization methods predominantly follow discrete evaluation metrics commonly used in image retrieval tasks (e.g., Recall@K [9]) or distance measurements based on single-frame errors (e.g., SDEM [8]). These methods fail to capture the error accumulation characteristics and overall fault tolerance of UAVs during continuous operation. To address this measurement deficiency, we propose the Trajectory Continuity Index (TCI@d), a trajectory-level metric designed to measure tracking stability.It is crucial to clarify the application scope of this work: our CPFL framework is a map-dependent absolute localization solution. Its effectiveness relies on the availability of pre-existing satellite maps, restricting its application to outdoor environments.

The contributions of this work are summarized as follows:

1.: A vision-only Cross-view Particle Filter Localization (CPFL) framework. By fusing visual motion propagation with cross-view observation corrections, it effectively bounds cumulative drift and achieves continuous global localization;
2.: A Dual-Granularity Adaptive Gating (DGAG) cross-view feature extraction network. It utilizes dynamic routing to adjust the fusion ratio of global semantics and local details, which mitigates perceptual aliasing and improves cross-view matching robustness in complex environments;
3.: A trajectory-level evaluation metric, the Trajectory Continuity Index (TCI@d). By incorporating temporal smoothness constraints and cumulative error suppression penalties, it provides an objective standard specifically tailored for evaluating the long-term stability of continuous UAV localization.

The remainder of this paper is organized as follows: Section 2 reviews related work in UAV cross-view geo-localization and feature representation learning. Section 3 details the methodology of the proposed CPFL framework, including the motion inference path, the DGAG network, and the formulation of the TCI@d metric. Section 4 presents experimental results, comparative studies, and ablation analyses on the real-world MAFS dataset. Finally, Section 5 draws the conclusions and discusses future research directions.

2. Related Work

This section provides the context and background for the proposed Cross-view Particle Filter Localization (CPFL) framework. Developing a vision-only continuous localization system for UAVs in GNSS-denied environments involves addressing the challenges associated with discrete single-frame matching and hardware-dependent sequential filtering. The purpose of this section is to review the evolution of relevant technologies and identify current methodological gaps—specifically, the difficulty of bounding cumulative trajectory drift without relying on Inertial Measurement Units (IMUs) or experiencing perceptual aliasing. This review serves as the foundation for the proposed CPFL architecture and outlines how the current study builds upon existing approaches to support continuous state estimation.

The remainder of this section is organized into three subsections to present the theoretical basis of the methodology. Section 2.1 reviews existing UAV cross-view geo-localization methods, including image retrieval, dense coordinate regression, and probabilistic filtering, to outline the transition from discrete location recognition to continuous state estimation. Section 2.2 examines feature representation learning techniques in cross-view tasks, focusing on the extraction of features required for filter observation updates. Finally, Section 2.3 discusses inter-frame visual displacement and motion estimation techniques, which establish the geometric basis for the IMU-free state propagation model used in this study. Together, these subsections transition from general localization strategies to specific visual algorithms, connecting prior literature with the methods detailed in the subsequent sections.

2.1. UAV Cross-View Geo-Localization Methods

2.1.1. Image Retrieval

Cross-view geo-localization [5,6] is the core perception technology for achieving autonomous UAV navigation in GNSS-denied environments. Its task is to determine the geographic coordinates of the UAV by matching the downward-facing view captured in real-time with a satellite reference map. Image retrieval-based methods represent the foundational paradigm in this field, with the fundamental idea of framing the localization problem as a task of searching for the most similar patch within a large-scale satellite image database. Workman et al. [10] introduced hand-crafted geometric correction methods using polar coordinate transformations, establishing the basic framework for image pair matching and initially verifying the feasibility of cross-view retrieval. As the research focus shifted towards UAV platforms, Zheng et al. [9] proposed the University-1652 dataset in 2020. This dataset provided a multi-view benchmark for UAV perspectives for the first time, prompting retrieval methods to evolve from hand-crafted features to deep learning. CVM-Net, proposed by Relja et al. [11], adopted a Siamese network and the NetVLAD descriptor to pull cross-view positive sample pairs closer in the feature space via metric learning, thereby introducing deep feature matching into the retrieval task.

However, the mismatch between discrete landmark retrieval and the requirements of continuous navigation triggered a fundamental rethinking of the paradigm. The DenseUAV dataset proposed by Dai et al. [12] redefined the retrieval task itself through dense sampling: transitioning from “recognizing salient buildings” to “distinguishing highly similar adjacent scenes.” This shift directly drove a methodological revolution. Subsequently, Li et al. [13] proposed an incomplete N-Pair loss function, enabling the model to learn correct associations with multiple overlapping reference images, thereby resolving the “one-to-many” matching ambiguity caused by reference map patching.

Nevertheless, image retrieval methods are inherently a discrete localization paradigm, presenting a fundamental trade-off between scalability and precision. Their localization accuracy is limited by the sampling density of the pre-constructed database, where high-density sampling demands prohibitive memory, and sparse sampling causes spatial jumps. Furthermore, in real-world scenarios, UAVs often operate with a limited Field-of-View (FoV) [14], making discrete patch matching highly susceptible to discontinuous errors. Consequently, the inherent sparsity of retrieval boundaries renders it inadequate for high-precision, sub-meter continuous UAV navigation.

2.1.2. FPI

To overcome the database and computational limitations of traditional image retrieval, UAV cross-view localization has been redefined as an end-to-end dense regression problem. This paradigm, known as FPI (Finding Point with Image), draws inspiration from response maps in object tracking. Discarding database construction, FPI processes UAV and satellite images simultaneously. Specifically, the UAV image is encoded into a local feature vector, while the satellite map is encoded into a spatial feature map. A cross-view matching mechanism (e.g., cross-attention or correlation) is then used to measure feature similarity, generating a 2D response heatmap. The heatmap’s peak dictates the UAV’s estimated position, which is directly mapped to geographic coordinates, significantly reducing resource demands. A representative work in this direction is SSPT [15], which employs a Transformer to model global spatial context for precise coordinate prediction.

However, the practical applicability of existing FPI methods is hindered by their reliance on static, single-frame matching. Processing frames in isolation makes the coordinate regression vulnerable to sudden jitters caused by dynamic distractors or illumination shifts. While recent breakthroughs attempt to mitigate perspective distortions by introducing 3D geometric perception and Bird’s-Eye-View (BEV) transformations [16], the inability to utilize explicit temporal information during continuous flight still inevitably leads to error accumulation and trajectory discontinuity. This highlights the critical need for paradigms that effectively leverage sequential temporal constraints.

2.1.3. Particle Filter

As a probabilistic inference framework based on Bayesian estimation and Monte Carlo sampling, the particle filter models localization as a sequential state estimation problem, providing a new paradigm for resolving state estimation issues during continuous UAV operation. Early research applied particle filter to visual UAV localization within known maps. Van Dalen et al. [17] achieved matching and localization between UAV imagery and pre-collected aerial reference maps by combining image alignment with a particle filter. Patel [18] constructed a particle filter visual localizer based on normalized information distance in GNSS-denied environments. These works validated the feasibility of particle filter in UAV planar pose estimation. However, the “maps” they processed were typically high-altitude oblique or orthographic aerial images, fundamentally belonging to same-view or near-view matching tasks.

To address the robustness issue in cross-view matching, the research by Miller et al. [19] heuristically utilized LiDAR point cloud semantics to match with satellite semantic maps, demonstrating the effectiveness of semantic features in cross-view localization. Building upon this, the Semantic-Weighted Adaptive Particle Filter (SWA-PF) proposed by Yuan et al. [8] systematically applied the particle filter framework to the “UAV-to-satellite image” cross-view continuous localization task. This method no longer relies on the direct matching of a single image and a satellite map; instead, it transforms the localization problem into a state estimation problem. Based on the pose estimation from the previous time step, it introduces an inertial prior provided by an IMU to explicitly compensate for the missing visual scale. It performs motion diffusion on the particle swarm with an absolute physical scale using IMU data, and then updates the weights of the particles using the semantic consistency between the current frame’s UAV image and the corresponding satellite image regions of each particle. Among these, the semantic weighting mechanism effectively overcomes the interference of appearance differences on matching, while the propagation and resampling of particles in the state space explicitly introduce temporal continuity, achieving continuous state estimation in four degrees of freedom (longitude, latitude, altitude, and yaw).

Despite these structural advantages, existing filtering paradigms introduce restrictive hardware dependencies and observation vulnerabilities. Relying heavily on an Inertial Measurement Unit (IMU) for absolute scale recovery and motion propagation shifts the operational burden to external sensors, thereby increasing payload constraints and system calibration complexity. Furthermore, utilizing exclusively sparse semantic labels for observation updates often triggers perceptual aliasing in map regions with homogeneous textures (e.g., dense forests or water surfaces), frequently leading to filter divergence. Achieving a robust, resilient continuous track without these dependencies remains a significant challenge.

2.2. Feature Representation Learning Methods in Cross-View Tasks

In cross-view geo-localization tasks, the core objective of feature representation learning is to extract essential features robust to viewpoint, illumination, and seasonal changes from UAV and satellite views with enormous appearance differences. The quality of these features directly determines the accuracy of matching and localization. Early methods primarily employed hand-crafted descriptors such as SIFT [20] and SURF [21], establishing matches by detecting keypoints and computing their local gradient histograms. However, such local appearance-based features struggle to establish reliable cross-view correspondences and lack an understanding of high-level scene semantics. With the rise of deep learning, Convolutional Neural Networks (CNNs) were introduced to learn more discriminative representations. Arandjelovic et al. [11] aggregated local features into global image descriptors by introducing a differentiable VLAD layer on top of CNN feature maps.

To mine finer-grained spatial information, Wang et al. [22] divided the image into square ring regions diffusing from the center, performing feature learning and matching independently for each partition to enhance the model’s perception of local details and spatial structures. Shi et al. [23] proposed the Spatial-Aware Feature Aggregation (SAFA) method, attempting to explicitly align the spatial layout between the two views through polar coordinate transformations and attention mechanisms.

In recent years, Vision Transformers (ViTs) [24], relying on their global self-attention mechanism, have provided a new paradigm for global semantic understanding between cross-view images. Zhu et al. [25] adopted an attention-guided non-uniform cropping strategy, utilizing attention maps to identify high-response regions in the image, thereby improving feature discriminability while reducing computational cost. Dai et al. [26] utilized the thermal distribution of Transformer feature maps to automatically segment the image into semantically meaningful regions, thereby achieving region-level feature alignment across views. Concurrently, to further enhance robustness, recent architectures have introduced multi-scale partitioning and attention-enhanced Transformers [27] to dynamically balance localized and holistic features. Furthermore, the field is experiencing a shift towards zero-shot deep feature matching [28], which leverages foundation models to bypass the need for extensive domain-specific fine-tuning, demonstrating the growing emphasis on generalized representation in complex environments.

2.3. Motion Estimation Based on Inter-Frame Visual Displacement

In the research of visual state estimation for UAVs, utilizing inter-frame image displacement for motion estimation is the key to achieving visual localization and dead reckoning. Based on multi-view geometry, the correspondence between camera motion and image points in planar scenes is fundamentally described by a homography matrix. Building upon the mathematical foundation of decomposing rotation and translation parameters from this matrix, recent advancements, such as the fast and robust deep homography estimation model proposed by Nguyen et al. [29], have further modernized the process of calculating UAV motion from image displacement. Tracing back the practical implementations, Amidi et al. [30] in their pioneering visual odometry calculated the inter-frame displacement of image patches through template matching, and then directly integrated 3D displacement combined with binocular depth, validating the feasibility of using visual sensors as odometers. With the rise of UAV platforms, Caballero’s team [31] explicitly simplified the high-altitude UAV downward-looking scene into a planar motion model, systematically constructed a visual odometry and SLAM framework based on the homography matrix, and introduced online mosaic maps as global references to correct cumulative errors.

Meanwhile, the independent development of feature extraction and robust estimation technologies provided reliable guarantees for inter-frame displacement computation. Lowe, Bay, and Rublee et al. successively proposed local feature descriptors such as SIFT [20], SURF [21], and ORB [32], making feature matching possible under large displacements. More importantly, as emphasized in the comprehensive study by Ma et al. [33], the RANSAC algorithm and its subsequent variants have become the standard pipeline for robustly estimating homography or essential matrices from feature correspondences fraught with mismatches (outliers), ensuring the reliability of motion calculation under dynamic interference environments.

In recent years, deep learning technology has brought fundamental changes to feature extraction. Matching methods based on deep neural networks, such as SuperPoint [34] and LoFTR [35], have surpassed traditional hand-crafted features in performance within weakly textured regions. While deep homography networks continue to advance, recent studies highlight their limitations in uncertainty estimation under severe domain shifts [36]. Consequently, classical feature matching paired with robust estimation remains a highly practical and computationally efficient baseline for edge devices.

In synthesis, while the existing literature provides robust foundations for single-frame cross-view matching and relative motion estimation, a distinct gap remains in achieving vision-only, long-term continuous localization. Retrieval and FPI methods are constrained by spatial discretization and single-frame isolation, whereas current sequential filters rely heavily on inertial hardware and suffer from semantic aliasing in homogeneous map regions. Addressing these limitations motivates the development of our CPFL framework, which fuses an IMU-free visual motion propagation mechanism with a dual-granularity adaptive gating network, aiming to explicitly bound cumulative drift and ensure resilient trajectory tracking.

3. Method

This section details the methodology of the proposed Cross-view Particle Filter Localization (CPFL) framework. The primary objective is to formulate a continuous, vision-only state estimation pipeline that bounds the cumulative drift typical of visual dead reckoning. To achieve this without relying on inertial sensors, the framework integrates a visual motion propagation model with an adaptive cross-view observation model. This section explains the mathematical and algorithmic construction of these components, outlining how relative inter-frame motion and absolute satellite map observations are fused within a probabilistic filtering structure to compute globally consistent planar poses.

The remainder of this section is organized into three main components that mirror the sequential processing pipeline. Section 3.1 delineates the cross-view feature extraction network, detailing the dual-branch Vision Transformer baseline and the Dual-Granularity Adaptive Gating (DGAG) head used to generate observation weights. Section 3.2 formulates the particle filter localization framework, specifying the procedures for state prediction via homography-based visual odometry, cross-view observation updates, and density-based resampling. Finally, Section 3.3 defines the Trajectory Continuity Index (TCI@d), an evaluation metric introduced to mathematically quantify long-term tracking stability over continuous flight trajectories.

3.1. Cross-View Feature Extraction Network

As the core component of the UAV visual localization system, the cross-view feature extraction network undertakes the crucial task of mapping visual information with significant spatial geometric differences (UAV views and satellite views) into a unified feature space. The quality of the features extracted by this network directly determines the accuracy and robustness of the subsequent particle filter localization. The complete architecture of this network is illustrated in Figure 2. This section systematically elaborates on its construction process: first, the baseline model based on a dual-branch Siamese ViT is introduced, establishing the basic feature extraction framework; subsequently, addressing the baseline model’s deficiency in utilizing fine-grained features, our designed Dual-Granularity Adaptive Gating (DGAG) head is detailed; finally, the calculation method of feature similarity is clarified, which will serve as the core input of the particle filter observation model, providing the resampling basis for UAV planar pose estimation.

3.1.1. Baseline Model: Dual-Branch Siamese ViT Architecture

To achieve precise matching between UAV and satellite views, we employ a dual-branch Siamese model based on the Vision Transformer (ViT) as our cross-view feature extraction network. The core idea of this model is to map images from two different views into a unified feature space through a weight-sharing ViT backbone network, thereby ensuring that the feature vectors of images from different viewpoints of the same geographic location are as close as possible. Its complete workflow primarily comprises the following stages:

Input and Preprocessing

The UAV downward-facing image and the corresponding candidate region image cropped from the satellite map serve as the model inputs. Before feeding into the network, both images undergo identical standardization and resizing processes to ensure consistency in input conditions.

Backbone Network and Feature Extraction

The Vision Transformer is adopted as the feature extraction backbone to construct the weight-sharing dual-branch Siamese architecture. After the input images undergo patching, linear projection, position encoding, and Transformer encoding, the [CLS] token is extracted as the global feature representation. Both branches share the same ViT parameters, focusing on mining the essential scene structural information common to both views, and forcing the model to learn a cross-view shared feature representation.

Feature Processing and Output

The baseline model follows the design paradigm of the Vision Transformer, utilizing the global feature of the [CLS] token as the final output. This global feature encapsulates the semantic summary information of the image. The output feature vectors will be used for loss calculation and parameter optimization during the training phase, as well as for cross-view similarity measurement during the inference phase.

Supervised Training and Loss Functions

This baseline model employs a joint supervision strategy, combining representation learning and metric learning for loss calculation.

The total loss function

L_{total}

is defined as:

L_{total} = L_{CE} + λ L_{triplet}

(1)

where

L_{CE}

is the cross-entropy loss, which treats each geographic location as an independent class to enhance feature discriminability:

L_{CE} = - \frac{1}{B} \sum_{i = 1}^{B} log P (y_{i} ∣ f_{i})

(2)

where B is the training batch size,

f_{i}

is the feature vector of the i-th sample, and

y_{i}

is its corresponding ground-truth geographic location label.

L_{triplet}

is a triplet loss incorporating soft-margin optimization [37], aiming to further optimize the clustering structure of the feature space by pulling positive sample pairs closer and pushing negative sample pairs further apart:

L_{triplet} = softplus (d (A, P) - d (A, N) + margin)

(3)

where

d (A, P)

and

d (A, N)

represent the cosine distances between the Anchor feature and the Positive and Negative sample features, respectively; margin is a hyperparameter controlling the boundary margin between positive and negative sample pairs.

softplus (\cdot)

is a smooth activation function capable of providing smooth gradient backpropagation near the distance boundary. The hyperparameter

λ

is used to balance the overall contribution of the cross-entropy loss and the triplet loss.

Through the above process, the baseline model can transform UAV and satellite images with large viewpoint spans into directly comparable deep feature vectors. The feature similarity computed from these will provide the core resampling basis for the observation model of the subsequent particle filter.

3.1.2. Dual-Granularity Adaptive Gating (DGAG) Head Design

In cross-view localization, significant viewpoint and appearance differences exist between UAV views and satellite maps. While the baseline model solely relies on the [CLS] token to extract global semantics—which effectively captures the macroscopic attributes of the scene—it inevitably loses the high-frequency local details that are crucial for sub-pixel-level precise localization. Conversely, purely utilizing the patch token sequence easily leads to a lack of global contextual constraints when facing large-scale homogeneous texture regions (e.g., water surfaces, forest canopies), thereby plunging the system into severe “perceptual aliasing.” To this end, this paper designs the Dual-Granularity Adaptive Gating (DGAG) head. This module adaptively adjusts the fusion ratio of global semantics and local details through a dynamic gating network, thereby constructing a composite feature representation that possesses both macroscopic perception and microscopic precise anchoring capabilities.

Local Feature Aggregation: Generalized Mean (GeM) Pooling [38]

To effectively extract local geometric information from patch tokens, this paper employs Generalized Mean (GeM) pooling for aggregation. Given the feature map

X \in R^{N \times H \times W \times C}

, where N is the batch size,

H \times W

is the spatial dimension, and C is the number of channels, the GeM operator is defined as:

f_{GeM} = {(\frac{1}{| Ω |} \sum_{u \in Ω} x_{u}^{p})}^{\frac{1}{p}}

(4)

where

Ω

denotes the spatial region, and p is a learnable pooling parameter. The core motivation for introducing GeM lies in its ability to adaptively adjust the aggregation behavior according to scene characteristics:

When

p \to 1

, GeM degenerates into Average Pooling (AvgPool), and the network treats all spatial regions equally. A large area of uninformative background will dominate the feature vector, diluting crucial landmark signals.

When

p \to \infty

, GeM degenerates into Max Pooling (MaxPool), and the network exhibits extremum-dominant characteristics, retaining only the local extrema with the strongest responses. While this highlights salient features, it strips away contextual constraints, making it susceptible to dynamic distractors.

During end-to-end joint training, through the gradient backpropagation of the loss function, the parameter p can converge to a finite intermediate state. This non-linear aggregation can effectively amplify discriminative local landmark features while preserving moderate contextual information to suppress noise. The aggregated local feature is denoted as

f_{local} \in R^{N \times C}

.

Adaptive Fusion Mechanism for Cross-Granularity Features

After obtaining the global feature

f_{cls} \in R^{N \times C}

and the local feature

f_{local} \in R^{N \times C}

, considering the complex terrain the UAV traverses during long-endurance cruising, a fixed feature concatenation ratio cannot adapt to the information entropy fluctuations of all scenes. Therefore, DGAG designs a channel interaction-based adaptive gating network to dynamically weight the two.

First, the two types of features are concatenated along the channel dimension as

f_{concat} = [f_{cls} ‖ f_{local}] \in R^{N \times 2 C}

. Subsequently, a two-layer Multilayer Perceptron (MLP) with a bottleneck layer (dimensionality reduction ratio

r = 4

) is used to calculate the correlation of the cross-view features, and normalized weights

w \in R^{N \times 2}

are output via a function:

{[w_{cls}, w_{local}]}^{T} = Softmax (W_{2} δ (W_{1} f_{concat} + b_{1}) + b_{2})

(5)

where

W_{1} \in R^{\frac{C}{r} \times 2 C}

and

W_{2} \in R^{2 \times \frac{C}{r}}

are linear projection matrices, and

δ

is the ReLU activation function. The final composite feature is expressed as:

f_{fused} = w_{cls} f_{cls} + w_{local} f_{local}

(6)

This gating module provides the system with dynamic scene adaptation capabilities. When the UAV flies over large industrial parks with regular layouts and severe homogenization of individual buildings, local geometric details are prone to trigger mismatches. At this time, the network tends to assign a higher weight to

w_{cls}

, relying on the macroscopic scene topological layout to maintain spatial constraints. Conversely, when entering regions with rich textures and high heterogeneity (e.g., urban intersections), the network automatically tilts the weight towards

w_{local}

, fully utilizing the high-frequency details extracted by GeM to execute precise cross-view anchoring. This adaptive mechanism guarantees the temporal robustness of the localization system under varying environmental complexities from the perspective of feature representation.

3.1.3. Feature Similarity Calculation

After obtaining the feature representations of the UAV and satellite views, it is necessary to evaluate their matching degree through a similarity metric. We adopt cosine similarity as the fundamental metric criterion. Specifically, for the L2-normalized UAV view feature vector

f_{drone}

and satellite view feature vector

f_{satellite}

, their cosine similarity is directly computed via the dot product:

similarity = f_{drone} \cdot f_{satellite}

(7)

where · denotes the vector dot product operation. The range of cosine similarity is

[- 1, 1]

, and a value closer to 1 indicates a higher directional similarity between the two feature vectors.

The obtained similarity score will serve as the core input for the observation model in the subsequent particle filter framework, constituting the particle weight distribution and guiding the resampling process.

3.2. Particle Filter Localization Framework

3.2.1. Particle Set Construction

The continuous UAV localization problem in complex environments can be formalized as a sequential state estimation problem. Based on Bayesian filter theory [39], its core lies in utilizing recursive estimation to fuse motion priors and visual observations, thereby approximating the posterior probability distribution of the true UAV planar pose. At time t, this posterior distribution

p (x_{t} ∣ z_{1 : t})

can be recursively expressed as:

p (x_{t} ∣ z_{1 : t}) \propto p (z_{t} ∣ x_{t}) \int p (x_{t} ∣ x_{t - 1}, u_{t}) p (x_{t - 1} ∣ z_{1 : t - 1}) d x_{t - 1}

(8)

where

x_{t} = {[x_{t}, y_{t}, θ_{t}]}^{T}

denotes the 3-Degree-of-Freedom (3-DoF) state vector of the UAV, encompassing the planar position

(x_{t}, y_{t})

and the heading angle

θ_{t}

within the map coordinate system,

z_{t}

represents the current visual observation, and

u_{t}

is the control input.

p (x_{t} ∣ x_{t - 1}, u_{t})

is the state transition model describing the evolution of the state subject to the control input;

p (z_{t} ∣ x_{t})

is the observation model measuring the likelihood of acquiring the visual observation given the current state.

This paper adopts the particle filter method, approximating the posterior distribution through a weighted particle set

{x_{t}^{i}, w_{t}^{i}}_{i = 1}^{N}

. Each particle

x_{t}^{i}

is an independent hypothesis sample in the state space, and its weight

w_{t}^{i}

represents the likelihood of that sample, satisfying

\sum_{i = 1}^{N} w_{t}^{i} = 1

. The particle filter achieves Bayesian updating through the following recursive steps:

State Prediction: Each particle undergoes independent state propagation according to the state transition model $p (x_{t} ∣ x_{t - 1}, u_{t})$ .
Observation Update: The weight $w_{t}^{i}$ of each particle is calculated and updated according to the observation model $p (z_{t} ∣ x_{t})$ .
Resampling: The particle set is filtered based on the weight distribution to maintain particle diversity and approximate the true posterior.

During initialization, the particle set is randomly generated within a pre-defined 2D search space: the planar positions

(x^{i}, y^{i})

cover the entire satellite map area, and the heading angles

θ^{i}

are randomly distributed within

[0, 2 π)

. All particles share the unified reference altitude

H_{ref}

.

3.2.2. State Prediction

In the state prediction step, each particle

x_{t - 1}^{i}

propagates independently according to the state transition model

p (x_{t} ∣ x_{t - 1}, u_{t})

. To achieve visual localization, this paper constructs a state inference path based on visual inter-frame displacement. By processing two consecutive UAV downward-facing images,

I_{t - 1}

and

I_{t}

, this path directly estimates their 2D relative motion and utilizes this motion prior in the image domain as the foundation for constructing the subsequent global control input

u_{t}

.

Specifically, the visual inter-frame displacement estimation algorithm calculates the relative motion components through the following steps:

Feature Extraction and Matching

After preprocessing the images with grayscaling and downsampling, the SIFT algorithm [20] is employed to extract keypoints and their descriptors. Subsequently, the FLANN [40] matcher is used for coarse feature point matching, and Lowe’s [20] ratio test is applied to filter out a high-precision set of matching pairs.

Robust Estimation of Motion Parameters

Based on the filtered matching pairs, the RANSAC algorithm [33] is adopted to robustly estimate the homography matrix H between the two images. The homography matrix [29] describes the projective mapping relationship between the image planes:

[\begin{matrix} x^{'} \\ y^{'} \\ 1 \end{matrix}] \sim H [\begin{matrix} x \\ y \\ 1 \end{matrix}] = [\begin{matrix} h_{11} & h_{12} & h_{13} \\ h_{21} & h_{22} & h_{23} \\ h_{31} & h_{32} & h_{33} \end{matrix}] [\begin{matrix} x \\ y \\ 1 \end{matrix}]

(9)

The system extracts the relative yaw angle

Δ θ_{t}

around the optical axis from the rotation submatrix of H:

Δ θ_{t} = \arctan 2 (h_{21}, h_{11})

(10)

Simultaneously, to more robustly represent the global translation, let the center pixel coordinates of the image be

P_{c} = {[u_{c}, v_{c}, 1]}^{T}

. By calculating the coordinate difference of this center point before and after the mapping by H, the relative translation vector

(Δ x_{img, t}, Δ y_{img, t})

of the UAV in the image coordinate system can be directly obtained:

[\begin{matrix} Δ x_{img, t} \\ Δ y_{img, t} \\ 0 \end{matrix}] = H [\begin{matrix} u_{c} \\ v_{c} \\ 1 \end{matrix}] - [\begin{matrix} u_{c} \\ v_{c} \\ 1 \end{matrix}]

(11)

Through the aforementioned solution, the system acquires a high-confidence relative motion prior in the image domain, which serves as the system’s global control input

u_{t} = {[Δ x_{img, t}, Δ y_{img, t}, Δ θ_{t}]}^{T}

.

To clarify the direct extraction of planar motion parameters from the homography coefficients formulated in Equations (9) and (10), the underlying hardware configuration must be specified. During flight, UAVs experience roll and pitch variations due to aerodynamic disturbances. The CPFL framework assumes the use of a 3-axis stabilized gimbal to compensate for these dynamics and maintain a downward-looking perspective. Under the condition of an orthogonal projection onto a predominantly flat ground plane, the general 3D homography reduces to a 2D Euclidean transformation. This geometric property provides the mathematical basis for extracting translation and yaw parameters directly from the matrix coefficients, serving as an approximation for full-pose decomposition.

3.2.3. Cross-View Coordinate Transformation and Scale Mapping

The

{[Δ x_{img, t}, Δ y_{img, t}, Δ θ_{t}]}^{T}

output from Step 2 is a relative observation defined in the adjacent UAV image coordinate system. To drive the state propagation of particles in the global map coordinate system, a mapping relationship from image pixel displacement to global map displacement needs to be constructed. In monocular visual dead reckoning, the unobservability of depth leads to inherent Scale Ambiguity [41]. However, under high-altitude (e.g., 200 m) vertically downward-looking flight conditions, the Ground Sample Distance (GSD) of the UAV images remains stable. Because the terrain relief displacement is macroscopically negligible relative to the flight altitude, and the RANSAC algorithm used in inter-frame displacement estimation can robustly extract the motion parameters of the dominant ground plane while rejecting elevation outliers, the complex depth recovery problem can be circumvented. Consequently, a globally fixed linear scaling relationship can be directly established between the UAV image plane and the satellite map plane.

The system takes the image-domain displacement

u_{t} = {[Δ x_{img, t}, Δ y_{img, t}, Δ θ_{t}]}^{T}

output by the visual front-end as the globally shared control input. To maintain the diversity of the particle swarm during state propagation, each particle

x_{t - 1}^{i}

needs to independently resolve

u_{t}

into a particle-specific map displacement increment based on its own heading angle

θ_{t - 1}^{i}

and the scale mapping factor

β

:

[\begin{matrix} Δ x_{map, t}^{i} \\ Δ y_{map, t}^{i} \end{matrix}] = β \cdot [\begin{matrix} cos (θ_{t - 1}^{i}) & - sin (θ_{t - 1}^{i}) \\ sin (θ_{t - 1}^{i}) & cos (θ_{t - 1}^{i}) \end{matrix}] [\begin{matrix} Δ x_{img, t} \\ Δ y_{img, t} \end{matrix}]

(12)

where

β

is the Cross-View Scale Mapping Factor. It represents the fixed proportional coefficient between the UAV image pixels and the global satellite map pixels under the stable GSD condition, which can be obtained through pre-calibration in engineering implementation.

Based on the aforementioned cross-view mapping mechanism, when the system receives the control input

u_{t}

, combining the particle’s own heading angle

θ_{t - 1}^{i}

, the state propagation equation of the particle is specifically formalized as:

x_{t}^{i} = x_{t - 1}^{i} + {[Δ x_{map, t}^{i}, Δ y_{map, t}^{i}, Δ θ_{t}]}^{T} + ϵ

(13)

where

ϵ = {[ϵ_{x}, ϵ_{y}, ϵ_{θ}]}^{T} \sim N (0, Σ)

is the 3D additive Gaussian process noise, used to absorb residual terrain elevation undulations in the physical world, minor flight altitude jitter, and RANSAC estimation residuals. This design dimensionally reduces the complex 3D scale recovery problem into a 2D spatial scaling, enabling each particle to establish diversified planar pose propagations at an extremely low computational cost while maintaining scale stability.

However, gimbal stabilization may not fully eliminate attitude variations, and ground surfaces often exhibit elevation changes. Uncompensated roll or pitch introduces perspective distortions, particularly in environments with significant vertical relief (e.g., urban skyscrapers). The current framework addresses these geometric deviations through two standard procedures. First, during the feature matching stage (Section 3.2.2), the RANSAC algorithm rejects feature pairs that severely violate the planar constraint, treating them as spatial outliers. Second, residual biases in translation or heading are modeled as system uncertainty using the additive Gaussian process noise

ϵ

in Equation (13). Through the probabilistic distribution of the particle swarm and subsequent weight updates based on absolute cross-view observations (Section 3.2.4), the filter adjusts for these localized systematic errors, which limits error accumulation over time. While this processing pipeline allows the system to operate under minor perspective distortions, the motion estimation accuracy is structurally bounded in environments with severe elevation changes or unconstrained maneuvers, representing a known limitation of the planar assumption.

3.2.4. Observation Update

In the observation update stage, each particle

x_{t}^{i}

updates its weight based on the observation likelihood

p (z_{t} ∣ x_{t})

. This likelihood is obtained by computing the cross-view feature similarity between the real-time UAV downward-facing view and the corresponding satellite image patch of the particle. For the planar pose hypothesis represented by particle

x_{t}^{i}

, the system executes the following steps to calculate its matching degree with the current UAV view: First, based on the particle’s planar position

(x_{t}^{i}, y_{t}^{i})

and the unified reference altitude

H_{ref}

, an image region with a fixed physical receptive field is cropped from the satellite map. This design ensures that all particles possess a consistent absolute scale reference during cross-view feature extraction. Considering the difference between the particle’s heading angle and the UAV’s actual heading, each particle aligns with the pre-rotated UAV view according to its heading angle

θ_{t}^{i}

. After obtaining the aligned UAV view and the satellite image patch, they are fed into the aforementioned cross-view feature extraction network. Based on the dual-branch Siamese ViT architecture, this network maps the cross-view images into a unified feature space through a weight-sharing backbone:

f_{drone}^{i} = Φ (I_{drone}^{rot (θ_{t}^{i})}; W), f_{sat}^{i} = Φ (I_{sat}^{crop (x_{t}^{i}, y_{t}^{i}, H_{ref})}; W)

(14)

where

Φ (\cdot; W)

represents the feature extraction network parameterized by W, and

f_{drone}^{i}, f_{sat}^{i} \in R^{d}

represent the L2-normalized deep feature vectors.

After obtaining the feature vector pairs, the base cosine similarity

sim (f_{drone}^{i}, f_{sat}^{i}) \in [- 1, 1]

is first obtained by calculating their dot product. In long-term cross-view matching tasks, due to the texture repetitiveness of natural scenes (e.g., grasslands, water surfaces), the base similarity distribution is often relatively smooth, easily leading to insufficient discriminability among particle weights. To this end, this paper adopts a shifted non-linear power function to construct the observation likelihood, thereby updating the particle weight

w_{t}^{i}

:

w_{t}^{i} = p (z_{t} ∣ x_{t}^{i}) = \frac{{(sim (f_{drone}^{i}, f_{sat}^{i}) + 1)}^{γ}}{\sum_{j = 1}^{N} {(sim (f_{drone}^{j}, f_{sat}^{j}) + 1)}^{γ}}

(15)

where the constant

+ 1

shifts the cosine similarity to the interval

[0, 2]

to satisfy the non-negativity requirement of the likelihood function; the exponent

γ

is an adjustment parameter used to control the sharpening degree of the likelihood distribution.

Mathematically, the raw cosine similarity output by the visual front-end provides a bounded metric, which does not naturally constitute a valid probability density function. By applying this exponential transformation coupled with

γ

, the observation likelihood formulation functions analogously to a Softmax operation with a temperature scaling factor (where temperature

T \propto 1 / γ

), or a Boltzmann distribution in statistical mechanics. This theoretically justified transformation converts the bounded similarity scores into a strictly positive, appropriately peaked probability distribution, enabling the Bayesian update step to effectively differentiate correct particles from noise.

By introducing the power operation of

γ > 1

, the system can non-linearly amplify the weight differences of high-confidence particles, prompting the filter to accelerate convergence under ambiguous observations. The observation update stage thus completes the injection of the current visual observation information into the probabilistic framework of the particle filter.

3.2.5. Resampling and State Estimation

After completing the observation update, the particle set must undergo a resampling mechanism to resolve the particle degradation problem. This paper adopts the systematic resampling strategy, constructing a cumulative distribution function based on the normalized particle weights. By generating a sequence of uniformly distributed random numbers, the system probabilistically selects high-weight particles and discards low-weight particles. During the resampling process, the system dynamically adjusts the particle population to a preset number, maintaining computational efficiency while ensuring particle diversity.

To obtain the final planar pose estimation, the system evaluates the convergence of the resampled particle set. When the particle covariance drops below a preset threshold, the DBSCAN clustering algorithm is employed to identify high-density regions within the particle set. Based on the principle of density reachability, this algorithm merges spatially adjacent particles into the same cluster using the neighborhood radius and minimum points parameters, while effectively distinguishing core particles from outliers.

After obtaining multiple candidate clusters, the system calculates the centroid coordinates of each cluster. By evaluating the average weight and the spatial distribution compactness of the particles within each cluster, the cluster center with the highest comprehensive score is selected as the final planar pose estimation.

To clarify the sequential execution and the closed-loop logic of the proposed CPFL framework, the overall pipeline—from motion inference to cross-view observation and resampling—is summarized in Algorithm 1.

Algorithm 1 Cross-View Particle Filter Localization (CPFL)

Require:: Real-time UAV Image Stream ${I_{t}}_{t = 1}^{T}$ , Satellite Map M, Number of Particles N, Scale Mapping Factor $β$
Ensure:: Estimated Global UAV Poses ${{\hat{x}}_{t}}_{t = 1}^{T}$
1:: Initialization: Randomly distribute N particles ${x_{0}^{i}}_{i = 1}^{N}$ uniformly over the 2D search space in M, with initial weights $w_{0}^{i} = 1 / N$ .
2:: for $t = 1$ Tdo
3:: {Stage 1: Motion Inference Path (State Prediction)}
4:: Extract SIFT features from consecutive frames $I_{t - 1}$ and $I_{t}$ .
5:: Robustly estimate inter-frame Homography matrix H via RANSAC.
6:: Decompose H to extract image-domain relative motion $u_{t} = {[Δ x_{i m g, t}, Δ y_{i m g, t}, Δ θ_{t}]}^{T}$ .
7:: for $i = 1$ N do
8:: Resolve $u_{t}$ into map displacement ${[Δ x_{m a p, t}^{i}, Δ y_{m a p, t}^{i}]}^{T}$ using $β$ and particle heading $θ_{t - 1}^{i}$ .
9:: Propagate state: $x_{t}^{i} \leftarrow x_{t - 1}^{i} + {[Δ x_{m a p, t}^{i}, Δ y_{m a p, t}^{i}, Δ θ_{t}]}^{T} + ϵ$ ${ϵ \sim N (0, Σ)}$
10:: end for
11:: {Stage 2: Observation Path (Cross-View Update)}
12:: for $i = 1$ N do
13:: Crop candidate satellite patch $I_{s a t}^{i}$ from M based on particle position $(x_{t}^{i}, y_{t}^{i})$ and $H_{r e f}$ .
14:: Align UAV view $I_{t}$ using particle heading $θ_{t}^{i}$ to obtain $I_{d r o n e}^{i}$ .
15:: Extract composite features $f_{d r o n e}^{i}$ and $f_{s a t}^{i}$ via DGAG network.
16:: Calculate base cosine similarity: $s i m_{t}^{i} = f_{d r o n e}^{i} \cdot f_{s a t}^{i}$ .
17:: Update unnormalized likelihood: ${\tilde{w}}_{t}^{i} = {(s i m_{t}^{i} + 1)}^{γ}$ . {Non-linear amplification}
18:: end for
19:: Normalize particle weights: $w_{t}^{i} \leftarrow {\tilde{w}}_{t}^{i} / \sum_{j = 1}^{N} {\tilde{w}}_{t}^{j}$ .
20:: {Stage 3: Resampling and State Estimation}
21:: Perform Systematic Resampling based on cumulative weight distribution to dynamically maintain particle diversity.
22:: Apply DBSCAN algorithm to cluster the resampled particle set.
23:: ${\hat{x}}_{t} \leftarrow$ Centroid of the cluster with the highest comprehensive evaluation score.
24:: end for
24:: return ${{\hat{x}}_{t}}_{t = 1}^{T}$

3.3. Trajectory Continuity Index (TCI@d)

In UAV self-localization tasks, commonly used evaluation metrics (such as R@1, mean position error, etc.) primarily act on single localization moments, focusing more on whether the instantaneous position error is close enough to the ground truth. Such metrics struggle to characterize whether the localization system can continuously and stably remain within an acceptable error range throughout the entire flight process. However, for autonomous UAV flight scenarios, long-term stable continuous localization capability is often more critical than high accuracy at a few sporadic moments. Therefore, starting from the trajectory level, this paper designs the Trajectory Continuity Index (TCI@d) to measure the continuous localization performance under a given distance threshold d, as shown in Figure 3.

Suppose a flight trajectory contains T localization moments, and the ground truth position and estimated position at time t are denoted as

p_{t} = {[x_{t}, y_{t}]}^{T}

and

{\hat{p}}_{t} = {[{\hat{x}}_{t}, {\hat{y}}_{t}]}^{T}

, respectively, where

x, y

represent planar coordinates. This paper adopts the 2D Planar Euclidean Distance as the position error metric for each moment:

e_{t} = {∥ p_{t} - {\hat{p}}_{t} ∥}_{2} = \sqrt{{(x_{t} - {\hat{x}}_{t})}^{2} + {(y_{t} - {\hat{y}}_{t})}^{2}}

(16)

Given a tolerance distance threshold d, the error at each moment is converted into a binary “success/failure” indicator:

s_{t} = \{\begin{matrix} 1, & e_{t} \leq d \\ 0, & e_{t} > d \end{matrix} t = 1, \dots, T

(17)

where

s_{t} = 1

indicates that the localization error at time t is within the acceptable range, regarded as a “localization success”;

s_{t} = 0

is regarded as a “localization failure.” In this way, each trajectory can be represented as a 0/1 sequence of length T, denoted as

{s_{t}}_{t = 1}^{T}

.

In this sequence, consecutive intervals with a value of 1 correspond to time segments that “continuously remain within the threshold.” Let the lengths of all continuous successful localization segments composed of 1s be

l_{1}, l_{2}, \dots, l_{K}

, where K is the number of continuous successful segments on this trajectory. Based on this, the Trajectory Continuity Index of a single trajectory is defined as:

TCI @ d = \frac{1}{T^{2}} \sum_{i = 1}^{K} l_{i}^{2}

(18)

It can be seen that the value range of TCI@d is

[0, 1]

. When the localization system satisfies

e_{t} \leq d

throughout the entire trajectory, there is only one continuous successful segment of length T; at this time,

TCI @ d = 1

, indicating the most ideal continuous localization performance. Conversely, when the successful moments account for a certain proportion but are fragmented into multiple shorter successful segments, due to the adopted quadratic weighting form of

l_{i}^{2}

, long uninterrupted successful segments are assigned higher weights, while the contribution of scattered, intermittent successes is limited. This significantly penalizes trajectories with “frequent failures and repeated convergences.” Consequently, TCI@d can distinguish between different localization methods with “similar proportions of successful moments but significant differences in continuity,” which better aligns with the practical requirements of continuous localization scenarios.

The threshold d can be set according to specific application scenarios. For example, in UAV localization in urban environments,

d = 10 m

can be chosen to measure the capability of “how long the system can continuously maintain within an acceptable localization error.” Compared with traditional single-moment metrics, TCI@d supplements the characterization of the localization system’s performance from the perspective of trajectory continuity, making the evaluation results closer to the actual demands of UAV continuous self-localization tasks.

4. Experiments

This section evaluates the proposed Cross-view Particle Filter Localization (CPFL) framework to quantify its accuracy, robustness, and computational feasibility under real-world conditions. The primary purpose of these experiments is to systematically validate the theoretical mechanisms established in the methodology—specifically, the capacity of the Dual-Granularity Adaptive Gating (DGAG) network to provide reliable absolute observations and the ability of the probabilistic filter to bound the cumulative drift of visual odometry. By testing the system in a GNSS-denied environment and under various induced disturbances, this section empirically establishes the operational boundaries and practical viability of the proposed vision-only localization paradigm.

The evaluation is structured to progressively analyze the system from macroscopic tracking performance to specific component behaviors and boundary limits. Section 4.1 introduces the MAFS dataset utilized for all subsequent tests. Section 4.2 presents a comparative baseline analysis, benchmarking the CPFL framework against existing retrieval, regression, and sequential localization methods. Section 4.3 isolates the feature extraction module to analyze the impact of different aggregation strategies on observation accuracy. Section 4.4 explicitly verifies the closed-loop drift elimination mechanism by contrasting it with open-loop visual dead reckoning. To assess deployment feasibility, Section 4.5 examines the accuracy–latency trade-off and real-time performance on an embedded edge platform. Finally, Section 4.6 and Section 4.7 stress-test the framework’s resilience against sudden spatial displacements (e.g., wind gusts) and its sensitivity to sub-optimal conditions (e.g., map degradation and scale perturbations), thereby comprehensively defining the system’s physical and algorithmic constraints.

4.1. Dataset

The experiments in this section are conducted on the publicly available MAFS dataset [8], a large-scale UAV localization benchmark explicitly designed to reflect real-world operational conditions. The data was natively captured on-site by a physical UAV (DJI MAVIC III E) flying over a university campus. All video sequences were recorded in 4K resolution at 30 frames per second, with each frame precisely synchronized with onboard geospatial metadata, including latitude, longitude, altitude, yaw angle, and velocity.

This study specifically utilizes the MAFS-10 subset, which features a continuous flight trajectory at a constant altitude of 200 m within a campus environment. The captured scenes encompass diverse topographical features such as buildings, road networks, and vegetation, thereby providing a challenging and representative benchmark for evaluating continuous cross-view localization under authentic GNSS-denied conditions. The ground truth position reference is provided by an onboard RTK-GPS system, while corresponding Google Maps (Google LLC, Mountain View, CA, USA, accessed on 15 February 2025) Level 19 satellite imagery serves as the geolocation basemap.

4.2. Main Localization Performance Comparison

To verify the effectiveness and advancement of the proposed cross-view particle filter localization framework in complex dynamic environments, we conducted comprehensive comparative experiments on the MAFS dataset (200 m constant-altitude scenario). We horizontally compared our method (Ours) with three representative mainstream paradigms in the current UAV cross-view localization field. The comparison baselines include:

DenseUAV [12]: A representative method based on the Image Retrieval paradigm, which localizes by retrieving the most similar reference image and performing geographic tag interpolation.
SSPT [15]: A coordinate regression paradigm based on FPI (Finding Point with Image), employing a Transformer architecture to directly predict the pixel coordinates of the UAV within the satellite image.
SWA-PF [8]: The SOTA method based on Sequential filtering, which utilizes semantic segmentation features for matching and relies on IMU sensors for state prediction.

4.2.1. Experimental Setup and Baseline Implementation

To ensure the fairness and reproducibility of the comparative experiments, we strictly followed the official experimental protocols of each baseline method. It is particularly noted that, for the IMU-dependent SWA-PF method, to eliminate the interference caused by the inconsistent noise characteristics of specific IMU hardware models, we adopted a data generation strategy consistent with the original paper: inter-frame displacements were calculated based on the UAV’s ground-truth GPS/RTK trajectory, and random noise conforming to a Gaussian distribution, along with scale and heading biases simulating cumulative drift, were superimposed to generate standard inertial measurement data. This setup ensures that the baseline method can operate under an ideal and controllable sensor noise level, providing it with an optimal performance reference.

Furthermore, considering the inherent randomness of the particle filter algorithm, all quantitative metrics in Table 1 are the averages of 10 independent repeated experiments to eliminate the random fluctuations of a single run, ensuring the statistical significance of the evaluation results.

4.2.2. Quantitative Performance Analysis

Table 1 presents the detailed experimental results of each method on the MAFS dataset. To account for the randomness of Monte Carlo sampling, quantitative metrics for particle filter-based methods (SWA-PF and Ours) are reported as Mean ± Standard Deviation (SD) across 10 independent runs. The evaluation metrics include Mean Localization Error (Mean Error), Root Mean Square Error (RMSE), localization success rate within a 10-m range (Success Rate @10m), and the trajectory continuity index (TCI@d) proposed in this paper.

Experimental data demonstrate that traditional single-frame matching methods underperform in continuous localization tasks. Limited by the discrete sampling density of the database, DenseUAV exhibits a high mean error of 176.15 m, struggling to meet precise navigation requirements. Although SSPT introduces a coordinate regression mechanism, it lacks temporal constraints during long-trajectory flights, leading to the amplification of single-frame prediction jumps, with its mean error remaining at a high level of 42.08 m.

In contrast, particle filter-based methods incorporating temporal information (SWA-PF and Ours) achieve orders-of-magnitude improvements across all metrics. Notably, our method (Ours) achieves the optimal comprehensive localization accuracy without the assistance of any IMU sensors. Specifically, the mean error of our method is reduced to 5.28 ± 0.38 m, and the RMSE is reduced to 5.92 ± 0.44 m, representing improvements of 8.3% and 10.3%, respectively, compared to the IMU-dependent baseline method SWA-PF. The lower standard deviations also indicate less variance across multiple independent runs. On the 10 m success rate metric, which measures robustness, our method reaches 89.74%, outperforming SWA-PF’s 80.77%. This indicates that compared to the semantic segmentation features used by SWA-PF, our designed Dual-Granularity Adaptive Gating (DGAG) head possesses stronger fine-grained discriminative capabilities, effectively compensating for and even surpassing the motion prior advantages brought by inertial sensors.

4.2.3. Cumulative Error Distribution (CED) Analysis

To more intuitively demonstrate the performance of each method under different error thresholds, Figure 4 plots the Cumulative Error Distribution (CED) curves. This curve selects data from a representative experiment that is closest to the average performance in Table 1.

It can be clearly observed from Figure 4 that:

The curve of our method (Ours, solid red line) has the largest initial slope and rises the fastest. The curve exceeds a 60% recall rate at the 5 m error threshold and approaches 90% at 10 m, demonstrating a faster convergence speed and accuracy.
SWA-PF (dashed blue line), although sharing a similar overall trend with our method, is consistently outperformed by our method in the high-accuracy interval of 3 m–8 m.
The curves for SSPT (dash-dotted green line) and DenseUAV (dotted gray line) grow slowly and exhibit significant long-tail effects, further confirming the limitations of non-sequential localization methods in dynamic environments.

4.2.4. Continuity and Smoothness Trade-Off Analysis

Regarding the distribution of the Trajectory Continuity Index (TCI), we observed that although our method outperforms SWA-PF on TCI@10 (0.319), it is slightly lower than SWA-PF on the stricter TCI@5 and the broader TCI@15 metrics.This difference reflects the physical characteristics of the two motion inference mechanisms: the simulated IMU data used by SWA-PF is generated based on the smooth ground-truth trajectory, and its inherent physical continuity makes the relative displacement of particles extremely stable over short periods. In contrast, the visual motion inference proposed in this paper relies entirely on inter-frame image feature matching. Limited by the image grid resolution and the randomness of feature point matching, purely visual calculations inevitably introduce high-frequency observation jitter.

However, this compromise in microscopic smoothness is exchanged for the complete removal of hardware dependency and a significant improvement in macroscopic localization accuracy. Despite a slightly lower TCI@5, our method surpasses SWA-PF in Mean Error and Success Rate, which determine the success or failure of the task. This proves that: in the absence of expensive inertial navigation hardware support, high-precision and robust continuous localization can be achieved relying solely on visual information. This design reduces the hardware cost and calibration complexity of the UAV system, possessing practical value for practical applications.

4.3. Comparative Analysis of Feature Extraction Strategies

To verify the adaptability of the proposed Dual-Granularity Adaptive Gating (DGAG) head in complex dynamic environments, this section compares three different types of feature aggregation paradigms: (1) Global Baseline; (2) Naive Pooling; (3) Structured Aggregation. The experiment aims to explore the mechanism by which the feature interaction method affects the accuracy of the particle filter observation model.

4.3.1. Quantitative Performance Evaluation

Table 2 summarizes the performance metrics of each feature extraction strategy on the MAFS dataset.

The data reveals a three-tier impact mechanism of feature aggregation methods on localization performance:

First, the introduction of non-salient local features disrupts global semantics. Experiments show that the performance of simple MaxPool and AvgPool strategies (>8.5 m) is significantly inferior to the Global baseline using only the [CLS] Token (6.43 m). This indicates that, in the absence of attentional filtering, naively aggregating all Patch features introduces background noise (e.g., dynamic distractors, cloud occlusions), thereby reducing the signal-to-noise ratio of the features.

Second, rigid spatial partitioning is unsuitable for UAV 3-DoF localization. The significant failure of the LPN method (10.25 m) confirms the sensitivity of the Hard Partition strategy to geometric transformations. The yaw and translation present during UAV flight destroy the spatial correspondence of image patches, leading to feature matching failures.

Finally, adaptive fusion possesses advantages in fine-grained tracking. Comparing the current SOTA method FSRA with our method (Ours), although both converge to 5.28 m in mean error, they exhibit different characteristics in stability and continuity:

Robustness Improvement:

Despite the tied mean error, the RMSE of our method (5.92 ± 0.44 m) is significantly lower than that of FSRA (6.20 ± 0.54 m), and exhibits a smaller standard deviation across multiple runs. Concurrently, the 10 m localization success rate increases by nearly 4 percentage points (89.74% vs. 85.90%). The reduction in both RMSE and its standard deviation indicates that our method generates fewer outliers and possesses a higher system lower-bound.

Trade-off Between Convergence and Retention:

Regarding trajectory continuity metrics, FSRA holds a slight advantage on TCI@20 (0.937 vs. 0.906), while our method overtakes it on the stricter TCI@10 (0.329 vs. 0.306). Analyzing the detailed logs reveals that this phenomenon stems from a minor difference in Initialization Latency: FSRA tends to complete clustering convergence around the 3rd frame of the sequence, whereas our method typically achieves lock-on at the 4th frame, resulting in slightly fewer valid count frames for TCI@20. However, once initialization is complete, relying on stronger feature discriminability, our method can lock the UAV more tightly within a 10 m high-precision corridor, rather than merely maintaining it within a coarse 20 m range.

4.3.2. Error Distribution and Temporal Stability Analysis

Figure 5 and Figure 6 further analyze the aforementioned performance differences through probability density distributions and temporal heatmaps, respectively.

The Ridgeline plot (Figure 5) intuitively visualizes the error concentration of each method. Compared to the divergent distributions of LPN and AvgPool, both Ours (red) and FSRA (orange) exhibit a distinct convergent shape concentrated toward the lower error range (left side). Notably, our method (Ours) possesses the sharpest peak morphology (Highest Kurtosis) and a shorter distribution tail. This error distribution indicates that our algorithm maintains higher confidence during the vast majority of time steps, corroborating the advantage observed in the RMSE metric.

The heatmap (Figure 6) reveals the temporal evolution of localization errors. The large areas of dark color blocks in the row corresponding to LPN reaffirm its unsuitability in dynamic environments. Comparing FSRA and Ours, although both have generally light overall tones, FSRA exhibits intermittent color darkening (error fluctuations) in regions with homogeneous terrain features in the middle of the trajectory. In contrast, the row for our method maintains a uniformly light color distribution throughout. This benefits from the ability of the Dual-Granularity Adaptive Gating (DGAG) head to adaptively adjust the weights of global and local features according to the scene entropy—relying on global semantics to maintain constraints in weakly textured regions, and utilizing GeM features to enhance accuracy in richly textured regions—thereby achieving smoother and more stable state estimation across the entire trajectory segment.

4.3.3. Summary of Feature Extraction Strategies

The comparative experiments prove that the feature fusion strategy proposed in this paper is not a naive module stacking. By overcoming the geometric sensitivity of hard partitioning (outperforming LPN) and optimizing the filtering mechanism of local features (outperforming FSRA), our method significantly enhances the fine-grained tracking capability and robustness of the localization system in complex scenes while ensuring fast convergence.

4.4. Drift Elimination and Global Consistency Verification

In long-endurance flight missions, suppressing accumulated error is a core metric for evaluating the robustness of a localization system. This section aims to verify whether the particle filter framework proposed in this paper can effectively utilize satellite map observation information to eliminate the trajectory drift introduced by the vision-based inter-frame displacement estimation, thereby achieving globally consistent localization.

4.4.1. Experimental Setup and Comparison Schemes

To decouple the contributions of “motion inference” and “observation correction” within the system, we constructed two experimental schemes with different loop characteristics:

Visual Dead Reckoning (Open-Loop):

Relies solely on the inter-frame displacement estimation algorithm proposed in this paper for dead reckoning, without introducing the resampling and observation update steps of the particle filter. This scheme represents the system’s performance when relying exclusively on visual relative motion estimation.

Ours (Closed-Loop):

The integrated localization system proposed in this paper. It utilizes Visual Dead Reckoning to provide the prior state prediction, and combines cross-view matching embedded with the Dual-Granularity Adaptive Gating (DGAG) head for likelihood constraint and resampling (Update & Resample).

4.4.2. Qualitative Analysis of Trajectory Consistency

Figure 7 demonstrates the 2D trajectory reconstruction results in a long-distance flight scenario. Specifically, (a) is the trajectory view overlaid on the satellite orthophoto, and (b) is the trajectory comparison plot on a white background.

From the visualization results, we can clearly observe the essential differences between the open-loop and closed-loop systems:

Drift Accumulation in Open-Loop Dead Reckoning:

The Visual Dead Reckoning baseline (dashed yellow line) can marginally align with the ground truth (solid green line) during the initial phase of the flight. However, due to the inherent scale ambiguity of monocular vision, minor inter-frame estimation biases are continuously amplified as they integrate over time, inevitably deviating from the intended flight path. This confirms that relying solely on relative incremental dead reckoning cannot physically overcome the natural law of long-term error accumulation.

Error Correction Capability of the Closed-Loop System:

Our method (solid red line) demonstrates superior global consistency. Although its underlying prediction model receives the same displacement input as Visual Dead Reckoning, the system did not experience trajectory drift; instead, it was consistently and strictly constrained near the true Trajectory Manifold. Especially at large-curvature turns, the cross-view observation model forcibly corrected the accumulated heading bias by adjusting the particle weights, achieving a high degree of coincidence with the ground truth trajectory.

4.4.3. Quantitative Error Evolution

To rigorously quantify the dynamic effect of drift elimination, we introduce the Absolute Trajectory Error (ATE) curves over time (line chart) and full-cycle error distribution statistics (box plot), as shown in Figure 8.

The quantitative data and statistical features profoundly reveal the closed-loop error correction mechanism of the cross-view particle filter:

Drift Accumulation and High Variance:

As shown in the line chart, the error of Visual Dead Reckoning (orange curve) does not exhibit a simple linear growth; instead, it exhibits cumulative drift characteristics that fluctuate drastically with the trajectory morphology. Particularly in the middle segment of the trajectory, due to the unobservability of the monocular visual scale and the integration effect of the heading error, its open-loop error surges rapidly and approaches a peak of 20 m. Combined with the box plot distribution, it can be seen that the error of the Visual Dead Reckoning baseline presents high uncertainty: its Interquartile Range (IQR, i.e., the box height) is extremely broad, and the Upper Whisker exceeds 20 m. This confirms that a purely visual system lacking absolute observations cannot provide reliable pose confidence.

Bounded Oscillation and Distribution Compression:

In contrast, our method (Ours, red curve) demonstrates a strong suppression capability against accumulated error. In the line chart, even when the Visual Dead Reckoning baseline experiences severe drift in the middle segment, the error curve of Ours maintains a stable bounded oscillation, avoiding the local divergence of the trajectory. From the perspective of the box plot, after introducing the cross-view observation model, the error distribution of the system undergoes a transformation: not only is the Median significantly pulled down to around 6 m, but its Interquartile Range (IQR) is also compressed within the interval of [4 m, 7 m]. Simultaneously, the upper limit of the system’s error is capped at around 10 m.

4.4.4. Summary of Drift Elimination

This experiment demonstrates that: although vision-based inter-frame displacement estimation can capture short-term relative motion, it cannot fundamentally eliminate long-term drift. The Dual-Granularity Adaptive Gating (DGAG) head proposed in this paper can provide an effective absolute position reference. Through the particle filter mechanism, the system transforms the unbounded accumulated error generated solely by the Visual Dead Reckoning baseline into a stable bounded error, achieving long-term, high-precision autonomous localization in outdoor GNSS-denied environments using pre-existing maps.

4.5. Real-Time Performance on Edge Platforms

4.5.1. Edge Hardware Setup and System Optimization

To evaluate the practical deployment feasibility, we migrated the CPFL framework to an embedded edge computing platform. The hardware utilized is an NVIDIA Jetson Xavier NX developer kit, configured with a 6-core NVIDIA Carmel ARMv8 CPU, a 384-core Volta GPU with 48 Tensor Cores, and 8GB LPDDR4x memory, operating at a 20W 6-core power mode. The software environment is based on JetPack 5.1.2 with TensorRT 8.5.2.2. Given the computational and memory bandwidth constraints of actual UAV onboard devices, deploying the original desktop-grade pipeline is physically prohibitive.

Therefore, two necessary, computation-aware engineering optimizations were introduced to the edge system:

Backend Acceleration via TensorRT: The DGAG-ViT cross-view observation model was converted from the native PyTorch(V2.1.0) dynamic graph into an optimized TensorRT static engine. By quantizing the network to FP16 precision, we explicitly exploited the GPU’s Tensor Cores for accelerated matrix multiplication and reduced the memory bandwidth bottleneck.
Frontend Lightweighting: The visual odometry was optimized to ensure continuous relative pose estimation without saturating the ARM CPU. We introduced a $0.25 \times$ image downsampling strategy and replaced computationally intensive feature extractors (e.g., SIFT) with lightweight ORB features. By utilizing binary descriptors and Hamming distance matching, the frontend processing time was compressed to under 100 ms.

It is important to note that these system-level approximations inherently introduce a slight accuracy penalty. Compared to the theoretical desktop baseline (Mean Error: 5.28 m), the optimized edge pipeline exhibits a base error of 6.70 m (under the

N = 550

setting). However, this degradation is acceptable for high-altitude UAV navigation, as it generally remains within the 10 m continuous tracking tolerance (TCI@10). Thus, in the following ablation studies, we use this optimized edge pipeline as our practical baseline to explore the limits of real-time performance.

4.5.2. State-Dependent Particle Decay and Accuracy-Latency Trade-Off

Despite the hardware optimizations, executing the full iterations for a large particle swarm frame-by-frame remains a computational bottleneck. As shown in Table 3, running the baseline setting (

N = 550

tracking particles) on the Jetson NX yields a tracking latency of ∼3.58 s per frame, which is sub-optimal for continuous tracking.

To bridge the gap between global search and real-time edge tracking, we propose a State-Dependent Particle Decay strategy. While the global initialization requires 1000 particles to prevent the “particle deprivation” problem in expansive environments, the tracking phase can operate with fewer particles once the spatial uncertainty is reduced via DBSCAN clustering.

Figure 9 illustrates the accuracy-latency trade-off across different decay thresholds. The results demonstrate a clear trade-off point. Owing to the discriminative capacity of our DGAG-ViT, we can decay the tracking particles down to

N = 50

without causing filter divergence. At

N = 50

, the mean error slightly increases to 8.94 m, but the full iteration latency drops to ∼0.95 s per frame. Conversely, for the divergent

N = 20

setup, the insufficient particle swarm frequently fails to form valid clusters. To properly visualize this tracking loss, a 30 m error penalty was applied to these unclustered frames, resulting in the observed error surge and demonstrating the unsuitability of such extreme decay. Ultimately, the 0.95 s complete iteration cycle of the

N = 50

configuration satisfies the “soft real-time” navigation requirements for high-altitude UAVs without requiring expensive inertial hardware.

4.5.3. Tracking Stability and Resilience Analysis

A common concern with particle reduction is the potential for mid-flight localization failure or divergence. To verify the tracking stability, Figure 10 plots the full time-series absolute positioning error comparing the

N = 550

and

N = 50

configurations.

As observed in Figure 10, while the

N = 550

configuration maintains a smooth error curve, the

N = 50

configuration exhibits occasional transient error spikes (reaching 20–30 m). These spikes typically occur during challenging frames with weak visual textures or rapid viewpoint shifts, where the reduced particle swarm momentarily fails to capture the true pose.

However, rather than leading to divergence, these occurrences demonstrate the resilience and self-recovery capability of the proposed CPFL framework. Driven by the global anchoring of the DGAG-ViT observation model, the system pulls the deviated particles back into the <10 m region within subsequent frames. This proves that

N = 50

particles provide a computationally efficient lower bound that tolerates transient estimation glitches while guaranteeing long-term continuous tracking without cumulative drift.

4.5.4. Summary of Real-Time Edge Performance

In summary, deploying the CPFL framework on resource-constrained edge devices necessitates a practical balance between computational efficiency and localization accuracy. By combining TensorRT acceleration, frontend lightweighting, and the state-dependent particle decay strategy, the system’s computational burden is significantly alleviated. The optimal

N = 50

tracking configuration compresses the full iteration latency to approximately 0.95 s per frame, all while preserving the framework’s core resilience and self-recovery capabilities. These engineering efforts validate the practical feasibility of CPFL for robust, continuous UAV navigation in real-world edge environments.

4.6. Robustness Analysis Under Sudden Wind Gust Disturbances

During the UAV localization process, the intrinsic recovery capability of the proposed framework relies heavily on the probabilistic nature of the particle filter coupled with continuous absolute position observations. Rather than depending on a single deterministic state estimate, the filtering mechanism continuously maintains a non-Gaussian belief of the state space through a diversified sample set. Consequently, even when the entire swarm experiences a severe common-mode translation, the structural combination of motion propagation and observation updates enables the filter to recursively re-evaluate state probabilities and naturally direct the particle evolution back toward high-likelihood regions once informative visual observations are incorporated.

To evaluate the system’s resilience against tracking failures when the UAV is deflected from its standard flight path by sudden external disturbances (e.g., wind gusts), we conducted a series of artificial offset injection experiments. During trajectory-following flight, these wind-induced deflections are simulated by applying instantaneous, uniform spatial shifts directly to the entire particle swarm, thereby verifying the framework’s relocalization and recovery capacities.

4.6.1. Intrinsic Fault Tolerance (50 m Gust Disturbance)

To verify the inherent fault tolerance of the standard CPFL framework, 50 m spatial offsets were injected along the forward direction, the lateral direction, and simultaneously in both the forward and lateral directions. As illustrated in Figure 11, immediately following the disturbance injection, the localization error spiked to approximately 50 to 75 m.

However, without triggering any additional recovery mechanisms, the system absorbed these offsets. Driven by the synergistic combination of the probabilistic filtering mechanism and the high-confidence observation weights generated by the DGAG network, the particles naturally re-converged toward the true pose over successive frames. Shortly after the disturbance, the absolute position errors for all three directions smoothly decreased back to the baseline operational level (below 10 m). This demonstrates that the standard framework possesses strong intrinsic robustness against moderate-to-large spatial disturbances.

4.6.2. Extreme Disturbance and Adaptive Noise Amplification (90 m Gust)

To further explore the system’s limits and prevent particle deprivation under extreme deviations, an extended 90 m gust was injected along the lateral direction. Under such severe displacement, the standard filter struggles to recover promptly, as the true pose may fall outside the standard particle distribution range (as shown by the red curve in Figure 12). To address this, an adaptive noise amplification strategy was introduced as a systemic extension.

As shown in Figure 13, the maximum similarity score (

S_{m a x}

) serves as an effective tracking confidence indicator. Prior to the gust,

S_{m a x}

maintained a stable value of 0.390. Following the 90 m injection,

S_{m a x}

sharply dropped to 0.097, falling well below the predefined safe zone threshold of 0.25. This drop reliably identified a “Loss State” and immediately triggered the adaptive noise amplification mechanism. This mechanism dynamically increased the additive noise superimposed after the particle state transition, which guarantees particle diversity and redistributes the particles over a broader search area.

As depicted by the blue curve in Figure 12, this extended strategy enabled rapid re-convergence. Soon after the activation of the mechanism, the similarity score recovered to 0.280, re-entering the safe zone (Figure 13), and the position error correspondingly dropped back to the normal operational range, effectively preventing long-term trajectory divergence.

4.6.3. Summary of Disturbance Robustness

The disturbance experiments confirm that the CPFL framework exhibits robust anti-interference capabilities against sudden spatial displacements. The inherent framework autonomously resolves typical deviations, while the similarity-triggered adaptive noise amplification serves as an effective structural extension for extreme boundary cases. This resilience is fundamentally supported by the large visual footprint provided by the 200 m flight altitude, combined with the core nature of the particle filter. By continuously maintaining particle diversity, the filter inherently preserves multiple state hypotheses at all times. This probabilistic characteristic ensures that even after severe instantaneous offsets, the system retains the necessary possibilities to recapture the true state, thereby reliably re-anchoring itself and maintaining continuous trajectory tracking.

4.7. System Sensitivity and Boundary Condition Evaluation

To comprehensively assess the generalization boundaries and fault tolerance of the CPFL framework under real-world sub-optimal conditions, this section evaluates its performance across three critical dimensions. Section 4.7.1 investigates system robustness against map degradation, including noise, downsampling, and blur. Section 4.7.2 analyzes the parameter sensitivity of the observation likelihood model (

γ

). Finally, Section 4.7.3 examines the system’s tolerance to scale factor uncertainties, which physically correspond to unmeasured altitude variations. For comparative fairness, all experiments strictly adopt the baseline configuration, utilizing 1000 initial particles and 550 tracking particles.

4.7.1. Robustness Against Map Degradation

To address the practical limitation that pre-existing satellite maps may be imperfect due to sensor noise, outdated imagery, or environmental changes, we evaluate the CPFL framework’s fault tolerance under three synthetic map degradation schemes. As illustrated in Figure 14a–d, the baseline satellite map was subjected to the following transformations:

Low-Res Interpolation: The reference map is downsampled by a factor of 0.25 and subsequently upsampled to its original resolution using linear interpolation, followed by a minor Gaussian blur to remove mosaic artifacts. This simulates the reliance on outdated or low-resolution satellite databases.
Severe Blur: A large Gaussian blur kernel ( $31 \times 31$ , $σ = 15$ ) is applied to simulate severe optical defocus or thick atmospheric interference (e.g., fog or clouds) that obscures ground details.
Gaussian Noise: Heavy additive Gaussian noise is injected into the image to simulate sensor thermal noise or signal corruption under suboptimal environmental conditions.

As quantitatively reported in Figure 14e, while the baseline map yields an average mean error of approximately 5.71 m, the introduction of map degradation inevitably increases the localization error. However, the system avoids total tracking divergence. Under the Low-Res Interpolation scheme, the mean error increases to 6.43 m. Under Severe Blur and Gaussian Noise—which obscure the high-frequency textural features typically required for cross-view matching—the mean errors are bounded at 9.24 m and 13.75 m, respectively.

The relatively short error bars in Figure 14e indicate that the particle filter maintains statistical stability, preventing random trajectory divergence under high noise. The mechanism underlying this bounded error increase is elucidated in Figure 14f. As map quality deteriorates, the observation likelihoods generated by the DGAG network become flatter, making it challenging for particles to consistently form high-confidence clusters. Consequently, the average number of successfully clustered frames drops from 73.3 (Baseline) to 65.3 (Gaussian Noise). Nevertheless, by leveraging the temporal continuity of the visual motion propagation model, the CPFL framework bridges these observation gaps, maintaining continuous localization when map priors are degraded.

4.7.2. Parameter Sensitivity of the Observation Likelihood

The formulation of the observation likelihood incorporates an exponent parameter

γ

, which dictates the sharpness of the particle weight distribution. To evaluate the sensitivity of this parameter, the framework was tested comprehensively across

γ

values ranging from 0 to 10. It is worth noting that extremely low values (

γ \leq 2

) cause the likelihood weights to become excessively flat, making it impossible for particles to form effective clusters and leading to immediate tracking divergence. Consequently, these invalid tracking cases are omitted from the graphical visualization to focus on the operable range.

As shown in Figure 15a, the system maintains stable localization across a relatively wide operable range (

γ \in [4, 10]

), indicating that the localization performance is not rigidly dependent on a singular parameter choice. When

γ

is set to moderately low values (e.g.,

γ = 3

or 4), the likelihood weights remain overly smooth; while this preserves particle diversity, it slows down the filter’s convergence, leading to an absolute mean error of 8.66 m at

γ = 3

. Conversely, excessively large values tend to over-sharpen the likelihood curve, increasing the risk of particle deprivation due to minor observation noise. For the current experimental scenario, the baseline configuration adopts

γ = 7

as it provides a practical balance between localization accuracy and filter convergence speed.

4.7.3. Tolerance to Scale Factor and Altitude Perturbations

While the visual motion propagation model assumes a constant flight altitude, practical UAV operations can encounter temporary altitude fluctuations due to aerodynamic disturbances or control latency. These altitude variations alter the Ground Sample Distance (GSD), resulting in a scaling mismatch between the online UAV view and the pre-existing satellite map. To quantify the boundaries of the framework against these geometric perturbations, systematic scale errors spanning from

- 10 %

to

+ 10 %

were introduced into the visual projection model.

The empirical results in Figure 15b indicate that the absolute positioning error remains within a bounded range under these variations. Specifically, despite a continuous

\pm 10 %

scale mismatch, the absolute mean error increases from 5.71 m (at perfect scale) to a maximum of 7.50 m. This level of tolerance indicates that during nominal constant-altitude missions, the closed-loop particle filtering mechanism can compensate for scale factor inaccuracies. By leveraging cross-view observations to provide absolute positional bounding, the framework mitigates the systematic drift induced by temporary altitude deviations.

4.7.4. Summary of Sensitivity and Boundary Analysis

This section evaluated the sensitivity of the CPFL framework to non-ideal conditions, specifically reference map degradation, observation likelihood parameter variations, and scale factor mismatches. The experimental results indicate that the localization error remains within a bounded range across the tested conditions. These findings clarify the operational boundaries of the proposed framework and demonstrate that the system maintains localization performance in sub-optimal real-world environments where map quality, model hyperparameters, and altitude-induced scale perturbations vary.

5. Conclusions and Future Work

This paper presented a cross-view particle filter localization (CPFL) framework for continuous UAV tracking in outdoor GNSS-denied environments using pre-existing reference maps. By fusing homography-based visual motion propagation with a Dual-Granularity Adaptive Gating (DGAG) cross-view network, the framework bounds the cumulative drift inherent in visual dead reckoning through a probabilistic filtering mechanism. Evaluations on the real-world MAFS dataset demonstrated that the system outperforms mainstream discrete and IMU-dependent baselines in global accuracy and trajectory continuity. Further empirical analyses validated the framework’s operational resilience: it autonomously recovers from sudden spatial disturbances simulating wind gusts, maintains bounded errors under non-ideal conditions including map degradation and localized scale perturbations, and is successfully deployed on embedded edge devices.

Despite the documented tracking capabilities, several geometric and environmental limitations remain. The motion estimation model relies on a planar scene assumption and derives global translation and yaw parameters from homography coefficients. This approximation introduces vulnerability to perspective distortions in complex environments characterized by significant vertical relief, such as urban areas with skyscrapers or rugged natural terrains. Furthermore, while sensitivity analyses demonstrated that the particle filter can absorb temporary scale factor inaccuracies during nominal constant-altitude flights, the framework does not currently accommodate unconstrained, full three-dimensional altitude variations. The evaluation was also bounded by the environmental diversity of the experimental datasets, meaning that claims regarding global generalization must be quantified under broader deployment conditions.

Future research will address these geometric constraints by extending the framework toward full 6-DoF state estimation. To relax the planar and constant-altitude assumptions, future iterations will explore the integration of absolute vertical references, such as barometric altimeters or online depth estimation networks, into the observation likelihood model. This coupling aims to dynamically adjust the Ground Sample Distance (GSD) references and resolve complex perspective distortions. Finally, validating the framework across more diverse datasets encompassing severe illumination shifts, varied weather conditions, and diverse structural topologies constitutes a necessary step to comprehensively map its operational boundaries and generalization limits.

Author Contributions

Conceptualization, C.S. and J.Y.; methodology, C.S.; software, C.S. and J.Y.; validation, C.S., J.Y., Z.L. and W.X.; formal analysis, C.S.; investigation, C.S. and J.Y.; resources, E.Z.; data curation, C.S., Z.L. and W.X.; writing—original draft preparation, C.S. and J.Y.; writing—review and editing, C.S., W.X., J.H. and E.Z.; visualization, C.S. and Z.L.; supervision, E.Z.; project administration, C.S., J.H. and E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The MAFS dataset used in this study is publicly available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Shakhatreh, H.; Sawalmeh, A.H.; Al-Fuqaha, A.; Dou, Z.; Almaita, E.; Khalil, I.; Othman, N.S.; Khreishah, A.; Guizani, M. Unmanned aerial vehicles (UAVs): A survey on civil applications and key research challenges. IEEE Access 2019, 7, 48572–48634. [Google Scholar] [CrossRef]
Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [Google Scholar] [CrossRef]
Zhou, X.; Zhang, X.; Yang, X.; Zhao, J.; Liu, Z.; Shuang, F. Towards UAV Localization in GNSS-Denied Environments: The SatLoc Dataset and a Hierarchical Adaptive Fusion Framework. Remote Sens. 2025, 17, 3048. [Google Scholar] [CrossRef]
Partsinevelos, P.; Chatziparaschis, D.; Trigkakis, D.; Tripolitsiotis, A. A novel UAV-assisted positioning system for GNSS-denied environments. Remote Sens. 2020, 12, 1080. [Google Scholar] [CrossRef]
Lin, T.Y.; Cui, Y.; Belongie, S.; Hays, J. Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2015; pp. 5007–5015. [Google Scholar]
Durgam, A.; Paheding, S.; Dhiman, V.; Devabhaktuni, V. Cross-view geo-localization: A survey. IEEE Access 2024, 12, 192028–192050. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Yuan, J.; Dai, M.; Zheng, E.; Su, C.; Chen, N.; Hu, Q.; Zhu, S.; Cao, Y. SWA-PF: Semantic-Weighted Adaptive Particle Filter for Memory-Efficient 4-DoF UAV Localization in GNSS-Denied Environments. arXiv 2025, arXiv:2509.13795. [Google Scholar]
Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th ACM international Conference on Multimedia; ACM: New York, NY, USA, 2020; pp. 1395–1403. [Google Scholar]
Workman, S.; Souvenir, R.; Jacobs, N. Wide-area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2015; pp. 3961–3969. [Google Scholar]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 5297–5307. [Google Scholar]
Dai, M.; Zheng, E.; Feng, Z.; Qi, L.; Zhuang, J.; Yang, W. Vision-based UAV self-positioning in low-altitude urban environments. IEEE Trans. Image Process. 2023, 33, 493–508. [Google Scholar] [CrossRef]
Li, J.; Sun, Y.; Xiang, Y.; Lei, L. One-to-Many Retrieval Between UAV Images and Satellite Images for UAV Self-Localization in Real-World Scenarios. Remote Sens. 2025, 17, 3045. [Google Scholar] [CrossRef]
Shugaev, M.; Semenov, I.; Ashley, K.; Klaczynski, M.; Cuntoor, N.; Lee, M.W.; Jacobs, N. Arcgeo: Localizing limited field-of-view images using cross-view matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: New York, NY, USA, 2024; pp. 209–218. [Google Scholar]
Fan, J.; Zheng, E.; He, Y.; Yang, J. A cross-view geo-localization algorithm using UAV image and satellite image. Sensors 2024, 24, 3719. [Google Scholar] [CrossRef]
Li, H.; Yang, W.; Xu, F.; Tan, H.; Zhang, H.; Li, S.; Xia, G.S. Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception. arXiv 2026, arXiv:2604.01747. [Google Scholar]
Van Dalen, G.J.; Magree, D.P.; Johnson, E.N. Absolute localization using image alignment and particle filtering. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, San Diego, CA, USA, 4–8 January 2016; p. 0647. [Google Scholar]
Patel, B. Visual Localization for UAVs in Outdoor GPS-Denied Environments; University of Toronto: Toronto, ON, Canada, 2019. [Google Scholar]
Miller, I.D.; Cowley, A.; Konkimalla, R.; Shivakumar, S.S.; Nguyen, T.; Smith, T.; Taylor, C.J.; Kumar, V. Any way you look at it: Semantic crossview localization and mapping with lidar. IEEE Robot. Autom. Lett. 2021, 6, 2397–2404. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the European conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, Y.; Zheng, B.; Yang, Y. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 867–879. [Google Scholar] [CrossRef]
Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-aware feature aggregation for image based cross-view geo-localization. Adv. Neural Inf. Process. Syst. 2019, 32, 10090–10100. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Zhu, S.; Shah, M.; Chen, C. Transgeo: Transformer is all you need for cross-view image geo-localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 1162–1171. [Google Scholar]
Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A transformer-based feature segmentation and region alignment method for UAV-view geo-localization. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 4376–4389. [Google Scholar] [CrossRef]
Wang, X.; Zheng, L. UAV Cross-View Geo-Localization Based on Multi-Scale Partitioning and Attention-Enhanced Transformer. In Proceedings of the 2025 International Conference on Signal Processing, Computer Networks and Communications (SPCNC); IEEE: New York, NY, USA, 2025; pp. 512–516. [Google Scholar]
Prutyanov, V.V.; Korolev, N.L.; Romanov, A.Y. UAV Visual Localization System Empowered by Zero-Shot Deep Feature Matching. In Proceedings of the 2025 International Russian Automation Conference (RusAutoCon); IEEE: New York, NY, USA, 2025; pp. 350–354. [Google Scholar]
Nguyen, T.; Chen, S.W.; Shivakumar, S.S.; Taylor, C.J.; Kumar, V. Unsupervised deep homography: A fast and robust homography estimation model. IEEE Robot. Autom. Lett. 2018, 3, 2346–2353. [Google Scholar] [CrossRef]
Amidi, O.; Kanade, T.; Fujita, K. A visual odometer for autonomous helicopter flight. Robot. Auton. Syst. 1999, 28, 185–193. [Google Scholar] [CrossRef]
Caballero, F.; Merino, L.; Ferruz, J.; Ollero, A. Vision-based odometry and SLAM for medium and high altitude flying UAVs. J. Intell. Robot. Syst. 2009, 54, 137–161. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 International Conference on Computer Vision; IEEE: New York, NY, USA, 2011; pp. 2564–2571. [Google Scholar]
Ma, J.; Jiang, X.; Fan, A.; Jiang, J.; Yan, J. Image matching from handcrafted to deep features: A survey. Int. J. Comput. Vis. 2021, 129, 23–79. [Google Scholar] [CrossRef]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops; IEEE: New York, NY, USA, 2018; pp. 224–236. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2021; pp. 8922–8931. [Google Scholar]
Xiao, J.; Loianno, G. Uasthn: Uncertainty-aware deep homography estimation for uav satellite-thermal geo-localization. In Proceedings of the 2025 IEEE International Conference on Robotics and Automation (ICRA); IEEE: New York, NY, USA, 2025; pp. 14066–14072. [Google Scholar]
Hermans, A.; Beyer, L.; Leibe, B. In defense of the triplet loss for person re-identification. arXiv 2017, arXiv:1703.07737. [Google Scholar] [CrossRef]
Radenović, F.; Tolias, G.; Chum, O. Fine-tuning CNN image retrieval with no human annotation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1655–1668. [Google Scholar] [CrossRef]
Arulampalam, M.S.; Maskell, S.; Gordon, N.; Clapp, T. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 2002, 50, 174–188. [Google Scholar] [CrossRef]
Muja, M.; Lowe, D.G. Fast approximate nearest neighbors with automatic algorithm configuration. In Proceedings of the International Conference on Computer Vision Theory and Applications, Scitepress, Lisboa, Portugal, 5–8 February 2009; Volume 1, pp. 331–340. [Google Scholar]
Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1623–1637. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overview of the Cross-view Particle Filter Localization (CPFL) framework. This system utilizes a real-time Unmanned Aerial Vehicle (UAV) image stream and a pre-stored satellite map to achieve reliable continuous localization. The Core Modules consist of three parts: (1) Motion Inference Path: By computing homography mapping based on inter-frame image features, inter-frame displacement is transformed into a relative motion prior within the global coordinate system; (2) Observation Path: Utilizing the cross-view feature extraction network with a Dual-Granularity Adaptive Gating (DGAG) head, it measures the similarity between the UAV view and candidate satellite patches to generate absolute position observation weights; (3) Particle Filtering & Resampling: Within a probabilistic framework, it fuses relative and absolute information, correcting the cumulative drift of vision-based displacement estimation through iterative resampling, and outputs the globally consistent UAV planar pose (x, y,

θ

). In this diagram, blue arrows indicate the direction of data flow, and red dots represent individual particles evaluated in the state space.

Figure 1. Overview of the Cross-view Particle Filter Localization (CPFL) framework. This system utilizes a real-time Unmanned Aerial Vehicle (UAV) image stream and a pre-stored satellite map to achieve reliable continuous localization. The Core Modules consist of three parts: (1) Motion Inference Path: By computing homography mapping based on inter-frame image features, inter-frame displacement is transformed into a relative motion prior within the global coordinate system; (2) Observation Path: Utilizing the cross-view feature extraction network with a Dual-Granularity Adaptive Gating (DGAG) head, it measures the similarity between the UAV view and candidate satellite patches to generate absolute position observation weights; (3) Particle Filtering & Resampling: Within a probabilistic framework, it fuses relative and absolute information, correcting the cumulative drift of vision-based displacement estimation through iterative resampling, and outputs the globally consistent UAV planar pose (x, y,

θ

). In this diagram, blue arrows indicate the direction of data flow, and red dots represent individual particles evaluated in the state space.

Figure 2. Detailed architecture of the cross-view feature extraction network. The network employs a weight-sharing dual-branch Vision Transformer (ViT) to extract base features. To overcome the perceptual aliasing caused by drastic viewpoint transformations, this paper designs a Dual-Granularity Adaptive Gating (DGAG) head: it extracts local high-frequency details (Local Detail Stream) via GeM pooling, and utilizes a gating network to dynamically compute its fusion weight with the global macroscopic semantics ([CLS] Token), ultimately outputting discriminative composite features for similarity measurement and resampling.

Figure 3. Schematic diagram of the Trajectory Continuity Index (TCI@d). Because TCI adopts a quadratic weighting mechanism

(l_{k}^{2})

based on the length of continuous successful segments, it assigns an high weight to long-term uninterrupted localization. Consequently, the TCI score of Trajectory A (1.0) is significantly higher than that of Trajectory B (0.123), whose successful segments are fragmented due to frequent target loss. From a temporal dimension, this metric explicitly penalizes discrete jumps and unstable state estimations, enabling a more accurate measurement of the overall robustness of the visual system in long-endurance tasks. Here, `1’ and `0’ represent localization success (

e_{t} \leq d

) and failure (

e_{t} > d

) at each time step, respectively, while

l_{k}

denotes the length of the k-th continuous successful segment.

Figure 3. Schematic diagram of the Trajectory Continuity Index (TCI@d). Because TCI adopts a quadratic weighting mechanism

(l_{k}^{2})

based on the length of continuous successful segments, it assigns an high weight to long-term uninterrupted localization. Consequently, the TCI score of Trajectory A (1.0) is significantly higher than that of Trajectory B (0.123), whose successful segments are fragmented due to frequent target loss. From a temporal dimension, this metric explicitly penalizes discrete jumps and unstable state estimations, enabling a more accurate measurement of the overall robustness of the visual system in long-endurance tasks. Here, `1’ and `0’ represent localization success (

e_{t} \leq d

) and failure (

e_{t} > d

) at each time step, respectively, while

l_{k}

denotes the length of the k-th continuous successful segment.

Figure 4. Comparison of the Cumulative Error Distribution (CED) of four different localization methods on the MAFS dataset. The solid red line represents our method, which achieves the fastest convergence and highest localization success rate at tight error thresholds.

Figure 5. Comparison of error probability density distributions among feature extraction strategies (Ridgeline Plot). Different colors are used to visually distinguish the corresponding methods explicitly labeled on the left axis. Our DGAG strategy exhibits the sharpest error distribution concentrated at the lowest values.

Figure 6. Frame-by-frame localization error heatmap during continuous flight trajectories. The blue box highlights the performance of our proposed method. Our method maintains stable state estimation across the entire sequence without severe local fluctuations.

Figure 7. Comparison of trajectory reconstruction under long-endurance flight. (a) Estimated trajectories overlaid on the satellite orthophoto. (b) Global trajectory comparison on a white background. The cyan and gray dashed boxes denote the local magnified regions in (a) and (b), respectively. The zoomed-in boxes highlight the cumulative drift of the open-loop Visual Dead Reckoning baseline and the high global consistency of our proposed closed-loop method.

Figure 8. Quantitative analysis of absolute position error. (a) Error evolution curves over the trajectory sequence. (b) Full-cycle error distribution statistics. The results demonstrate that our closed-loop framework (Ours) transforms the unbounded cumulative drift of the open-loop system (Visual Dead Reckoning baseline) into a stable, bounded error range.

Figure 9. Accuracy-latency trade-off under different state-dependent particle decay strategies. The

N = 50

configuration serves as an optimal trade-off point, reducing inference latency without triggering filter divergence.

Figure 9. Accuracy-latency trade-off under different state-dependent particle decay strategies. The

N = 50

configuration serves as an optimal trade-off point, reducing inference latency without triggering filter divergence.

Figure 10. Tracking stability comparison between the conservative baseline (

N = 550

) and the proposed tracking (

N = 50

) over time. Despite occasional transient spikes in feature-poor regions, the

N = 50

setup rapidly recovers, demonstrating system resilience.

Figure 10. Tracking stability comparison between the conservative baseline (

N = 550

) and the proposed tracking (

N = 50

) over time. Despite occasional transient spikes in feature-poor regions, the

N = 50

setup rapidly recovers, demonstrating system resilience.

Figure 11. Temporal evolution of absolute position error under 50 m gust disturbances. The standard CPFL framework autonomously recovers from significant spatial offsets without external recovery mechanisms.

Figure 12. Error evolution under an extreme 90 m gust. The adaptive noise amplification strategy (blue curve) accelerates the re-convergence compared to the standard baseline (red curve) and pulls the system back to the true pose.

Figure 13. The variation of the maximum similarity score (

S_{m a x}

) during the 90 m gust experiment. In the plot, the blue line with circle markers represents the normal tracking state, while the red line with cross markers indicates the tracking loss state. The green dashed line and the faint green horizontal shadow denote the predefined detection threshold (

s i m = 0.25

) and the resulting loss zone, respectively. The light red vertical shaded area highlights the period when the adaptive noise amplification mechanism is actively engaged. A sharp drop below the threshold accurately triggers the adaptive noise amplification mechanism.

Figure 13. The variation of the maximum similarity score (

S_{m a x}

) during the 90 m gust experiment. In the plot, the blue line with circle markers represents the normal tracking state, while the red line with cross markers indicates the tracking loss state. The green dashed line and the faint green horizontal shadow denote the predefined detection threshold (

s i m = 0.25

) and the resulting loss zone, respectively. The light red vertical shaded area highlights the period when the adaptive noise amplification mechanism is actively engaged. A sharp drop below the threshold accurately triggers the adaptive noise amplification mechanism.

Figure 14. Performance evaluation of the CPFL framework under various map degradation conditions. (a–d) Visualized regions of interest (ROIs) for the baseline map and the three degradation schemes. (e) Absolute mean positioning error and RMSE across different map conditions, with error bars indicating the standard deviation over three independent trials. (f) The ratio of effectively clustered frames to the total sequence (78 frames). In the bar charts, different colors distinguish the evaluated metrics in (e), while the coral bars in (f) represent the clustering success rate. The gray dashed line indicates the 100% theoretical maximum reference level.

Figure 15. Sensitivity and robustness analysis of system parameters. (a) Mean localization error across different observation likelihood exponent (

γ

) values. The broad stable range indicates the framework’s robustness against parameter variations. Note that

γ \leq 2

cases are omitted due to total tracking divergence caused by overly flat likelihoods. (b) System tolerance against continuous scale factor mismatches (

- 10 %

to

+ 10 %

), demonstrating the particle filter’s capability to compensate for scale inaccuracies caused by altitude perturbations during constant-altitude flights. In the plots, the blue line with circle markers and the orange line with square markers represent the mean error trends for their respective experiments. The vertical error bars indicate the standard deviation across multiple independent trials, and the red dashed lines denote the baseline configurations.

Figure 15. Sensitivity and robustness analysis of system parameters. (a) Mean localization error across different observation likelihood exponent (

γ

) values. The broad stable range indicates the framework’s robustness against parameter variations. Note that

γ \leq 2

cases are omitted due to total tracking divergence caused by overly flat likelihoods. (b) System tolerance against continuous scale factor mismatches (

- 10 %

to

+ 10 %

), demonstrating the particle filter’s capability to compensate for scale inaccuracies caused by altitude perturbations during constant-altitude flights. In the plots, the blue line with circle markers and the orange line with square markers represent the mean error trends for their respective experiments. The vertical error bars indicate the standard deviation across multiple independent trials, and the red dashed lines denote the baseline configurations.

Table 1. Performance comparison of different localization methods on the MAFS dataset.

Method	Type	Sensors	Mean Error (m) ↓	RMSE (m) ↓	Success Rate% (10 m) ↑	TCI@5 ↑	TCI@10 ↑	TCI@20 ↑
SSPT	Regression	Vision Only	42.08	62.26	41.25%	0.006	0.035	0.104
DenseUAV	Retrieval	Vision Only	176.15	260.54	5.06%	0.0001	0.001	0.007
SWA-PF	Sequential	Vision + IMU	5.76 ± 0.35	6.60 ± 0.40	80.77%	0.061	0.319	0.924
Ours	Sequential	Vision Only	5.28 ± 0.38	5.92 ± 0.44	89.74%	0.033	0.329	0.906

Note: Particle filter methods (SWA-PF and Ours) report metrics as Mean ± Standard Deviation across 10 independent runs. DenseUAV and SSPT are deterministic algorithms without random sampling variance, hence their results are reported as single values. The arrows ↓ and ↑ indicate that lower and higher values are better, respectively. Bold text indicates the best performance in each metric.

Table 2. Comparison of localization performance with different head architectures.

Head	Mean Error (m) ↓	RMSE (m) ↓	Success Rate%(10 m) ↑	TCI@5 ↑	TCI@10 ↑	TCI@20 ↑
Global (baseline)	6.43 ± 0.59	7.44 ± 0.76	76.92%	0.038	0.192	0.900
MaxPool	8.58 ± 1.54	10.26 ± 1.85	62.82%	0.011	0.082	0.761
AvgPool	9.21 ± 2.03	11.99 ± 2.64	76.92%	0.082	0.217	0.696
AvgMaxPool	5.76 ± 0.40	6.71 ± 0.47	83.33%	0.031	0.277	0.888
GemPool	6.84 ± 1.63	7.72 ± 1.82	73.08%	0.024	0.182	0.876
LPN (block = 2)	10.25 ± 2.63	11.57 ± 3.02	47.44%	0.002	0.0738	0.439
FSRA (block = 2)	5.28 ± 0.41	6.20 ± 0.54	85.90%	0.046	0.306	0.937
DGAG (ours)	5.28 ± 0.38	5.92 ± 0.44	89.74%	0.033	0.329	0.906

Note: The arrows ↓ and ↑ indicate that lower and higher values are better, respectively. Bold text indicates the best performance in each metric.

Table 3. Edge performance and accuracy under different state-dependent particle decay strategies on the Jetson NX platform.

Decay Strategy	Init Particles	Tracking Particles	Mean Error (m) ↓	RMSE (m) ↓	Tracking Latency (s/Frame) ↓
Conservative (Baseline)	1000	550	6.70	7.29	∼3.58
Moderate	1000	200	7.45	8.18	∼1.74
Aggressive	1000	50	8.94	9.34	∼0.95
Divergent	1000	20	28.40	30.56	∼0.80

Note: The arrow ↓ indicates that lower values are better. Bold text highlights the optimal trade-off configuration.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Su, C.; Yuan, J.; Zheng, E.; Xu, W.; Liu, Z.; Hu, J. CPFL: Resilient Continuous UAV Localization via Cross-View Perception and Particle Filtering. Drones 2026, 10, 437. https://doi.org/10.3390/drones10060437

AMA Style

Su C, Yuan J, Zheng E, Xu W, Liu Z, Hu J. CPFL: Resilient Continuous UAV Localization via Cross-View Perception and Particle Filtering. Drones. 2026; 10(6):437. https://doi.org/10.3390/drones10060437

Chicago/Turabian Style

Su, Chao, Jiayu Yuan, Enhui Zheng, Wangpin Xu, Zhanghua Liu, and Jianhong Hu. 2026. "CPFL: Resilient Continuous UAV Localization via Cross-View Perception and Particle Filtering" Drones 10, no. 6: 437. https://doi.org/10.3390/drones10060437

APA Style

Su, C., Yuan, J., Zheng, E., Xu, W., Liu, Z., & Hu, J. (2026). CPFL: Resilient Continuous UAV Localization via Cross-View Perception and Particle Filtering. Drones, 10(6), 437. https://doi.org/10.3390/drones10060437

Article Menu

CPFL: Resilient Continuous UAV Localization via Cross-View Perception and Particle Filtering

Highlights

Abstract

1. Introduction

2. Related Work

2.1. UAV Cross-View Geo-Localization Methods

2.1.1. Image Retrieval

2.1.2. FPI

2.1.3. Particle Filter

2.2. Feature Representation Learning Methods in Cross-View Tasks

2.3. Motion Estimation Based on Inter-Frame Visual Displacement

3. Method

3.1. Cross-View Feature Extraction Network

3.1.1. Baseline Model: Dual-Branch Siamese ViT Architecture

3.1.2. Dual-Granularity Adaptive Gating (DGAG) Head Design

3.1.3. Feature Similarity Calculation

3.2. Particle Filter Localization Framework

3.2.1. Particle Set Construction

3.2.2. State Prediction

3.2.3. Cross-View Coordinate Transformation and Scale Mapping

3.2.4. Observation Update

3.2.5. Resampling and State Estimation

3.3. Trajectory Continuity Index (TCI@d)

4. Experiments

4.1. Dataset

4.2. Main Localization Performance Comparison

4.2.1. Experimental Setup and Baseline Implementation

4.2.2. Quantitative Performance Analysis

4.2.3. Cumulative Error Distribution (CED) Analysis

4.2.4. Continuity and Smoothness Trade-Off Analysis

4.3. Comparative Analysis of Feature Extraction Strategies

4.3.1. Quantitative Performance Evaluation

4.3.2. Error Distribution and Temporal Stability Analysis

4.3.3. Summary of Feature Extraction Strategies

4.4. Drift Elimination and Global Consistency Verification

4.4.1. Experimental Setup and Comparison Schemes

4.4.2. Qualitative Analysis of Trajectory Consistency

4.4.3. Quantitative Error Evolution

4.4.4. Summary of Drift Elimination

4.5. Real-Time Performance on Edge Platforms

4.5.1. Edge Hardware Setup and System Optimization

4.5.2. State-Dependent Particle Decay and Accuracy-Latency Trade-Off

4.5.3. Tracking Stability and Resilience Analysis

4.5.4. Summary of Real-Time Edge Performance

4.6. Robustness Analysis Under Sudden Wind Gust Disturbances

4.6.1. Intrinsic Fault Tolerance (50 m Gust Disturbance)

4.6.2. Extreme Disturbance and Adaptive Noise Amplification (90 m Gust)

4.6.3. Summary of Disturbance Robustness

4.7. System Sensitivity and Boundary Condition Evaluation

4.7.1. Robustness Against Map Degradation

4.7.2. Parameter Sensitivity of the Observation Likelihood

4.7.3. Tolerance to Scale Factor and Altitude Perturbations

4.7.4. Summary of Sensitivity and Boundary Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI