NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation

Shpagin, Pavel; Panchenko, Taras

doi:10.3390/drones10020097

Open AccessArticle

NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation

by

Pavel Shpagin

^1,*

and

Taras Panchenko

^2,*

¹

Academia Tech, Research and Development Division, 03038 Kyiv, Ukraine

²

Theory and Technology of Programming Department, Faculty of Computer Science and Cybernetics, Taras Shevchenko National University of Kyiv, 01033 Kyiv, Ukraine

^*

Authors to whom correspondence should be addressed.

Drones 2026, 10(2), 97; https://doi.org/10.3390/drones10020097

Submission received: 22 December 2025 / Revised: 19 January 2026 / Accepted: 26 January 2026 / Published: 29 January 2026

(This article belongs to the Section Artificial Intelligence in Drones (AID))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

NaviLoc enables training-free GNSS-denied aerial-to-satellite localization using trajectory-level fusion of image matching and motion sensing.
On low-altitude (50–150 m) rural UAV flights, NaviLoc achieves a 19.5 m Mean Localization Error (16.0× better than prior state-of-the-art).

What are the implications of the main finding?

Trajectory-level optimization handles visually repetitive rural environments, enabling reliable GPS-free navigation.
The lightweight algorithm enables real-time CPU-only deployment (9 FPS on Raspberry Pi 5), supporting embedded autonomy.

Abstract

Aerial-to-satellite visual localization enables GNSS-denied UAV navigation, but the appearance gap between low-altitude (50–150 m) UAV imagery and nadir satellite tiles makes per-frame Visual Place Recognition (VPR) unreliable. Under perceptual aliasing, high-similarity matches are often geographically inconsistent, so naïve anchoring fails. We introduce NaviLoc, a training-free three-stage trajectory-level estimator that treats VPR as a noisy measurement source and exploits Visual-Inertial Odometry (VIO) as a relative-motion prior. Stage 1 (Global Align) estimates a global SE(2) transform by maximizing an explicit trajectory-level similarity objective. Stage 2 (Refinement) performs sliding-window bounded weighted Procrustes updates. Stage 3 (Smoothing) computes a strictly convex MAP trajectory estimate that fuses VIO displacements with VPR anchors while clamping detected outliers. On a challenging low-altitude rural UAV benchmark, NaviLoc attains 19.5 m Mean Localization Error (MLE)—a 16.0× reduction compared to state-of-the-art localization method AnyLoc-VLAD, and 32.1× compared to raw VIO drift. End-to-end inference runs at 9 FPS on Raspberry Pi 5, enabling real-time embedded deployment.

Keywords:

visual localization; UAV navigation; Visual Place Recognition; GNSS-denied navigation; satellite imagery; trajectory optimization; robust estimation

1. Introduction

Global localization is a prerequisite for long-horizon GNSS-denied UAV autonomy. Applications such as environmental monitoring, search-and-rescue, and autonomous delivery increasingly require operation in environments where GNSS signals may be unavailable, jammed, or spoofed. Visual-Inertial Odometry (VIO) provides locally accurate relative motion but drifts without bound, making it insufficient for missions requiring global position awareness.

A natural correction source is aerial-to-satellite Visual Place Recognition (VPR), matching onboard views to geo-referenced satellite tiles. However, as illustrated in Figure 1, the cross-view domain gap between low-altitude UAV imagery and nadir satellite maps produces frequent high-similarity false matches at geographically distant locations—even human observers find it difficult to identify correct correspondences. Per-frame VPR anchoring is therefore unreliable.

NaviLoc addresses this failure mode by estimating a trajectory-level solution rather than committing to individual matches. The method is training-free: it uses off-the-shelf pretrained image descriptors without domain-specific fine-tuning. NaviLoc operates in three stages: Stage 1 (Global Align) searches for a single global SE(2) transform whose implied trajectory yields consistently high local retrieval scores; Stage 2 (Refinement) refines the aligned trajectory through bounded weighted Procrustes updates on overlapping windows; and Stage 3 (Smoothing) computes a closed-form MAP estimate that fuses VIO displacements with VPR anchors while suppressing low-confidence anchors (see Figure 2).

Public benchmarks for this setting remain limited. We were unable to find an open-source dataset that jointly provides low-altitude (50–150 m) UAV imagery, synchronized VIO, and a geo-referenced satellite tile database suitable for trajectory-level evaluation. To study this practically relevant regime, we evaluate NaviLoc on a prepared real-world UAV-to-satellite dataset with VIO and curated satellite tiles.

Contributions

A novel, training-free, lightweight, and embedded three-stage trajectory-level localization method with explicit objectives and closed-form solutions.
An evaluation of low-altitude (50–150 m) UAV imagery showing a 19.5 m Mean Localization Error and 16.0× improvement over AnyLoc-VLAD [1].
An ablation demonstrating robustness to hyperparameter variations, and an empirical study showing that distilled ViT descriptors are more effective for trajectory-level alignment than larger foundation models on our benchmark.
An embedded implementation achieving 9 FPS end-to-end inference on Raspberry Pi 5.

2. Related Work

Visual Place Recognition. VPR evolved from handcrafted descriptors [2] through bag-of-words [3,4] to learned representations. NetVLAD [5] introduced end-to-end learning for place recognition. AnyLoc [1] aggregates foundation model features using VLAD or GeM pooling, achieving state-of-the-art results on diverse benchmarks. MixVPR [6] proposes feature mixing for compact descriptors. These methods produce per-image descriptors without trajectory-level consistency constraints.

Cross-Domain and Aerial Localization. The aerial-to-satellite domain gap presents challenges distinct from ground-level VPR. CVM-Net [7] learns cross-view representations. VIGOR [8] and CVUSA [9] established benchmarks for ground-to-aerial matching. Recent UAV-specific methods [10,11] address the UAV-to-satellite gap. Our approach is complementary: we accept that individual matches are unreliable and leverage trajectory-level statistics.

UAV-to-Satellite Geo-Localization. Several recent works study cross-view matching between UAV imagery and satellite maps in the remote sensing literature [11,12,13,14]. These methods improve per-frame matching but still face perceptual aliasing under viewpoint and appearance changes; NaviLoc instead aggregates evidence across time using an explicit trajectory-level objective. For broader context on absolute visual localization pipelines and design choices, we refer to a recent survey in Drones [15]. Recent work on neural network-based recognition [16] and air safety information management [17] provides a conceptual bridge between learning-based perception methods and system-level safety considerations; our trajectory-level, training-free approach complements these by enabling robust localization without domain-specific training.

Point Set Registration. The Iterative Closest Point (ICP) algorithm [18,19] alternates between correspondence assignment and transform estimation for point cloud alignment. Procrustes analysis [20] provides closed-form solutions for rigid alignment given correspondences. Extensions include weighted variants [21,22] and robust formulations. Our Stage 1 adapts ICP-style alternating optimization to the VPR similarity objective with provable monotone improvement over evaluated candidates.

Robust Estimation. Huber’s M-estimators [23] downweight outliers in regression. Pose graph optimization [24,25] fuses odometry and visual constraints. Switchable constraints [26] and graduated non-convexity [27] handle outliers in SLAM. Our Stage 3 performs z-score outlier detection and clamps detected outliers to the VIO prior (by setting

α_{i} \to \infty

), yielding a closed-form strictly convex solve without iterative robust optimization.

Visual-Inertial Odometry. VIO methods—including EKF-based VIO [28], OKVIS [29], VINS-Mono [30], and ORB-SLAM3 [31] —provide accurate relative motion estimation but accumulate unbounded drift over long trajectories. These methods solve a fundamentally different problem than NaviLoc: VIO estimates how the platform moved (relative poses), while NaviLoc estimates where the platform is in the world (global positions). NaviLoc is designed to operate on top of any VIO system, using its relative motion output as a prior while correcting global drift via satellite imagery matching.

3. Method

This section presents the NaviLoc algorithm in three stages, illustrated in Figure 2. We first formalize the problem (Section 3.1) and then describe each stage: Global Align (Section 3.2) finds an initial SE(2) transform via robust trajectory-level optimization; Refinement (Section 3.3) applies windowed geometric corrections; and Smoothing (Section 3.4) fuses VIO constraints with VPR anchors while rejecting outliers. Each stage has a well-defined objective function with a closed-form or provably convergent solution.

3.1. Problem Formulation

Given N aerial query images with descriptors

F_{q} = {f_{i}^{q}}_{i = 1}^{N}

and VIO-derived positions

V = {(v_{i}^{x}, v_{i}^{y})}_{i = 1}^{N}

in a local frame, along with VIO displacements

Δ V = {v_{i + 1} - v_{i}}

, and a geo-referenced satellite map with M reference tiles having descriptors

F_{r} = {f_{j}^{r}}_{j = 1}^{M}

and coordinates

G = {(g_{j}^{x}, g_{j}^{y})}_{j = 1}^{M}

, we seek global positions

P = {(p_{i}^{x}, p_{i}^{y})}_{i = 1}^{N}

.

3.2. Stage 1: Global Align

We estimate a global SE(2) transform

(θ, t)

by maximizing the trajectory-level similarity objective:

J (θ, t) = \frac{1}{N} \sum_{i = 1}^{N} max_{j \in B (R_{θ} v_{i} + t, r)} 〈 f_{i}^{q}, f_{j}^{r} 〉

(1)

where

R_{θ}

is the 2D rotation matrix,

B (x, r) = {j : ∥ g_{j} - x ∥ \leq r}

is the set of reference tiles within radius r, and

〈 \cdot, \cdot 〉

denotes cosine similarity.

Algorithm and intuition. Stage 1 treats VPR as a noisy correspondence generator: for a fixed rotation

θ

, each frame proposes a translation “measurement”

d_{i} (θ) = g_{j_{i}^{*}} - R_{θ} v_{i}

based on the global top-1 match

j_{i}^{*}

. Under perceptual aliasing, many

j_{i}^{*}

are outliers; we therefore aggregate

{d_{i} (θ)}

with a robust location estimator (the component-wise median). The key contribution is casting these noisy retrieval outcomes into a trajectory-level objective. We use coordinate ascent: (1) scan a grid of K rotation angles

θ \in {- π, - π + 2 π / K, \dots}

; (2) for each

θ

, compute the L1-optimal translation (component-wise median) [23]:

t (θ) = median {g_{j_{i}^{*}} - R_{θ} v_{i}}_{i = 1}^{N}

(2)

where

j_{i}^{*} = arg {max}_{j} 〈 f_{i}^{q}, f_{j}^{r} 〉

is the global top-1 match; (3) evaluate

J (θ, t (θ))

and select the best; and (4) refine with alternating maximization using local targets.

We denote the resulting aligned trajectory by

P^{(1)} = {p_{i}^{(1)}}_{i = 1}^{N}

. The following result formalizes why the median aggregator is optimal for robust translation estimation under outlier contamination:

Theorem 1 (L1-Optimal Translation (Median)).

Let

d_{i} = g_{j_{i}} - R_{θ} v_{i} \in R^{2}

be translation residuals for fixed θ. The minimizer of

min_{t \in R^{2}} \sum_{i = 1}^{N} {∥ d_{i} - t ∥}_{1}

(3)

is given by the component-wise median:

t_{x} = median {d_{i, x}}

and

t_{y} = median {d_{i, y}}

. Consequently,

t (θ)

is robust to outlier residuals: as long as more than half of the residuals are inliers per coordinate, the estimate is controlled by the inlier set rather than the outliers.

Proof.

The L1 norm

∥ d_{i} {- t ∥}_{1} = | d_{i, x} - t_{x} | + | d_{i, y} - t_{y} |

decomposes across coordinates, so the objective separates into two independent 1D problems:

{min}_{t_{x}} \sum_{i} | d_{i, x} - t_{x} |

and

{min}_{t_{y}} \sum_{i} | d_{i, y} - t_{y} |

. The minimizer of

\sum_{i} | x_{i} - t |

over

t \in R

is the median of

{x_{i}}

[23]. Robustness follows because the median is unaffected by arbitrarily large perturbations to fewer than half of the samples. □

Remark 1 (monotone acceptance).

In our implementation, we only accept candidate updates that improve (or tie) J, so the sequence of evaluated objective values is monotone non-decreasing, bounded above by 1, and hence convergent. This guarantees convergence of the objective sequence, not global optimality (see Algorithm A1).

3.3. Stage 2: Refinement

After global alignment, we refine positions using sliding-window bounded Procrustes. This extends standard weighted Procrustes analysis [20,21] with a novel rotation bound constraint that prevents overcorrection from noisy VPR targets. For each frame index j, we compute a local VPR target

t_{j} \in R^{2}

as the coordinate of the best-matching reference tile within radius r of the current predicted position

p_{j}^{(1)}

and record its cosine similarity score

s_{j}

. For each window

W = {i, i + 1, \dots, i + W - 1}

with targets

{t_{j}}_{j \in W}

and weights

w_{j} = max {(0, s_{j})}^{2}

, we solve:

min_{R \in S O (2), t} \sum_{j \in W} w_{j} ∥ R p_{j}^{(1)} + t - t_{j} ∥^{2}, s . t . | θ | \leq θ_{max}

(4)

This weighted Procrustes problem admits a closed-form solution for both rotation and translation. The rotation constraint prevents overcorrection when local VPR targets are noisy:

Theorem 2 (Optimal Bounded Procrustes).

For the constrained problem (4) in 2D, let

\bar{p} = \sum_{j} w_{j} p_{j}^{(1)} / \sum_{j} w_{j}

and

\bar{t} = \sum_{j} w_{j} t_{j} / \sum_{j} w_{j}

be the weighted centroids, and let

H = \sum_{j \in W} w_{j} (p_{j}^{(1)} - \bar{p}) {(t_{j} - \bar{t})}^{⊤}

be the

2 \times 2

weighted cross-covariance matrix. The optimal rotation angle is

θ^{*} = clip (\hat{θ}, - θ_{max}, θ_{max})

, where

\hat{θ} = atan 2 (H_{10} - H_{01}, H_{00} + H_{11})

, and the optimal translation is

t^{*} = \bar{t} - R_{θ^{*}} \bar{p}

.

Proof.

After centering by the weighted centroids, the objective separates into rotation and translation terms. For the rotation, we seek to maximize

tr (R_{θ} H)

. Writing

R_{θ} = (\begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix})

, the trace expands to

(H_{00} + H_{11}) cos θ + (H_{10} - H_{01}) sin θ

[20,21]. This sinusoidal function of

θ

is maximized at

\hat{θ} = atan 2 (H_{10} - H_{01}, H_{00} + H_{11})

. The constraint

| θ | \leq θ_{max}

clips this to the boundary if

\hat{θ}

lies outside. The optimal translation follows as

t^{*} = \bar{t} - R_{θ^{*}} \bar{p}

. □

Windows overlap (stride

S < W

), so each frame receives corrections from multiple windows; we average these contributions. Multiple passes are performed because each pass updates the trajectory, which changes the local VPR targets

{t_{j}}

(since they depend on the current predicted positions). Empirically, 2–3 passes suffice for convergence (see Algorithm A2).

3.4. Stage 3: Smoothing

Let

P^{(2)} = {p_{i}^{(2)}}_{i = 1}^{N}

denote the Stage 2 output. We treat these positions as anchors

A = {a_{i}}_{i = 1}^{N}

with

a_{i} : = p_{i}^{(2)}

and fuse them with VIO constraints via a standard MAP formulation [32]. Our novel element is the z-score-based outlier detection that automatically clamps unreliable anchors to the VIO prior:

min_{P} {∥ D P - Δ V ∥}^{2} + \sum_{i = 1}^{N} α_{i} {∥ p_{i} - a_{i} ∥}^{2}

(5)

where

D

is the

(N - 1) \times N

first-difference matrix.

Outlier detection. For each anchor

a_{i}

, we recompute its local VPR similarity

s_{i}

as the best cosine similarity among reference tiles within radius r of

a_{i}

. We then compute z-scores

z_{i} = (s_{i} - \bar{s}) / σ_{s}

. Frames with

z_{i} < τ

(default

τ = - 1.5

) are marked as outliers and assigned

α_{i} \to \infty

(implemented as

10^{6}

), effectively clamping them to the VIO prior. Inliers use

α_{i} = α_{in} = 0.05

. With the anchor weights thus defined, the optimization problem has a unique closed-form solution:

Theorem 3 (Unique Solution).

If

α_{i} > 0

for at least one i, then the objective (5) is strictly convex with unique minimizer given by the linear system:

(D^{⊤} D + diag (α)) P = D^{⊤} Δ V + α ⊙ A

(6)

Proof.

The objective (5) is quadratic in

P

. Setting the gradient to zero yields the normal equations

(D^{⊤} D + diag (α)) P = D^{⊤} Δ V + α ⊙ A

. The matrix

D^{⊤} D

is the graph Laplacian of a path, which is positive semidefinite with nullspace spanned by the constant vector

1

. Adding

diag (α)

with at least one

α_{i} > 0

eliminates this nullspace: any

x = c 1

has

x^{⊤} diag (α) x = c^{2} \sum_{i} α_{i} > 0

. Thus the combined matrix is positive definite, guaranteeing strict convexity and a unique solution [32]. □

See Algorithm A3 for pseudocode.

3.5. Implementation Details

Algorithm 1 summarizes the complete pipeline. We use DeiT-Tiny-Distilled [33], a distilled Vision Transformer [34], for feature extraction (192-dim descriptors). Fixed parameters: search radius

r = 150

m,

K = 72

angles, window size

W = 10

, stride

S = 7

,

θ_{max} = 0.09

rad (≈5), and z-threshold

τ = - 1.5

.

Algorithm 1 NaviLoc

Require: Query features

F_{q} \in R^{N \times D}

, VIO trajectory

V \in R^{N \times 2}

Require: Reference database

(F_{r}, G)

with M tiles
Ensure: Localized positions

P \in R^{N \times 2}

1:

(P, θ^{*}) \leftarrow

GlobalAlign

(V, F_{q}, F_{r}, G)

▹ Stage 1
2:

P \leftarrow

Refinement

(P, F_{q}, F_{r}, G)

▹ Stage 2
3:

P^{*} \leftarrow

Smoothing

(P, V, θ^{*}, F_{q}, F_{r}, G)

▹ Stage 3
4: return

P^{*}

Detailed pseudocode for each stage is provided in Appendix A.

4. Experiments

4.1. Dataset

To the best of our knowledge, there is no publicly available challenging dataset for our target setting—low-altitude (50–150 m) UAV imagery with synchronized VIO and a geo-referenced satellite tile database for trajectory-level evaluation—in particular with long real-world trajectories. We therefore evaluate on a prepared real-world UAV-to-satellite benchmark collected over rural terrain in Ukraine. Table 1 summarizes the dataset statistics. The dataset comprises 58 aerial query frames captured at 50–150 m AGL along a 2.3 km trajectory with 40 m inter-frame spacing. Reference imagery consists of 462 geo-referenced satellite tiles at 0.3 m/px resolution covering 1.6 km² with 40 m tile spacing. The benchmark is challenging due to strong perceptual aliasing in rural/village scenes (repetitive texture, limited distinctive landmarks) combined with low-altitude viewpoint and appearance changes relative to the satellite map.

Data collection and preparation. The query stream was recorded from a real multirotor UAV flight over rural/village terrain in Ukraine under summer conditions. The platform was manually piloted in an FPV-style flight mode at 50–150 m AGL, with average speed below 60 m/s, producing a 2.3 km trajectory.

All onboard processing was performed on a Raspberry Pi 5 (8 GB RAM). Images were captured by a rigidly mounted nadir-view Sony IMX219 RGB camera at 30 FPS with approximately 77–80° diagonal field of view, with auto-exposure and auto-white-balance enabled; a separate forward/oblique camera was used only for piloting and is not part of the evaluation. Inertial measurements were provided by the onboard SpeedyBee V3 IMU. Relative motion was estimated by a standard EKF-based monocular Visual-Inertial Odometry (VIO) pipeline fusing the camera and IMU. These components were chosen for availability, low cost, and suitability for lightweight embedded UAV platforms; NaviLoc is designed to work with any standard VIO pipeline and is agnostic to the specific sensor choice. As expected, VIO is subject to drift over extended flights and performs no global map alignment; NaviLoc uses only the resulting relative pose increments and is agnostic to the specific VIO implementation.

Ground truth positions are obtained from onboard GNSS logs and are used only for evaluation (MLE/ATE) and visualization; GNSS data are not available to NaviLoc during inference. The reference database was built by sampling satellite map imagery via the Google Maps API at zoom level 19 on a 40 m grid over the flight area; each tile is associated with its geographic coordinate. We then convert both query frames and reference tiles into embeddings using the backbones in Table 1.

4.2. Baselines

We compare against:

Raw VIO (IP only): VIO trajectory with initial position aligned to ground truth (no rotation).
VIO + IP + IR (oracle): VIO with initial position and initial rotation estimated from the first displacement (one-segment oracle) and from the first $k = 10$ displacements (multi-segment oracle).
VIO SE(2) oracle: VIO with oracle global SE(2) alignment to ground truth (best possible rigid alignment).
Per-frame VPR (top-k): For each query frame, retrieve the k reference tiles with highest cosine similarity and predict position as their coordinate mean (top-1 uses the single best match; top-3 averages the three best).
AnyLoc-VLAD [1]: State-of-the-art VPR using DINOv2 [35] with VLAD aggregation, evaluated with top-3 retrieval.

4.3. Results

On this challenging real-world benchmark, NaviLoc achieves 19.5 m MLE—a 16.0× improvement over the state-of-the-art AnyLoc-VLAD, and a 32.1× improvement over raw VIO drift. Figure 3 and Figure 4 summarize these results.

4.4. Ablation Studies

Stage contributions. Each stage provides substantial improvement (Table 2). Stage 1 alone achieves 9× improvement by finding the correct global transform. Stage 2 refines local errors, and Stage 3 leverages VIO constraints while excluding outliers.

Outlier threshold. Table 3 shows sensitivity to the z-score threshold.

Hyperparameter sensitivity. Table 4 varies one hyperparameter at a time (others fixed to defaults). Performance is stable across moderate changes, while overly permissive local search radii or underconstrained refinement settings degrade accuracy.

Backbone comparison. Table 5 reveals that DeiT-Tiny-Distilled achieves the best NaviLoc performance despite not having the best per-frame VPR accuracy. Empirically, distillation appears to yield descriptors whose scores are more consistent across time, making trajectory-level alignment easier.

4.5. Computational Efficiency

NaviLoc operates on precomputed descriptors. We benchmark on Raspberry Pi 5 (ARM Cortex-A76, 8 GB RAM). Table 6 summarizes the computational performance.

Inference strategy. At each timestep, the system samples an aerial image, extracts a 192-dimensional DeiT-Tiny-Distilled embedding (79 ms), and updates the global trajectory estimate using the C++ NaviLoc algorithm (32 ms). The combined end-to-end latency of 111 ms yields 9.0 FPS on Raspberry Pi 5, enabling real-time embedded deployment. Feature extraction becomes the bottleneck; DeiT-Tiny-Distilled requires 1.3 GFLOPs per image versus 21.1 GFLOPs for DINOv2-S, providing 16× faster extraction while achieving superior localization accuracy.

5. Discussion

We now analyze NaviLoc’s performance across multiple dimensions, compare it with alternative approaches, and discuss limitations and future directions.

5.1. Comparison with Existing Methods

Versus per-frame VPR. Per-frame VPR methods treat each query independently, producing 342.1 m MLE (top-3) on our benchmark. NaviLoc achieves 19.5 m—a 17.5× improvement. The key insight is that perceptual aliasing causes individual VPR errors that are spatially inconsistent: incorrect matches scatter across the map, while correct matches cluster near the true trajectory. Trajectory-level optimization exploits this statistical property.

Versus state-of-the-art. AnyLoc [1] represents the state-of-the-art in universal VPR, aggregating DINOv2 [35] features via VLAD pooling. On our benchmark, AnyLoc-VLAD achieves 312.2 m MLE (top-3)—only marginally better than simple per-frame VPR. NaviLoc reduces this to 19.5 m, a 16.0× improvement. This demonstrates that even powerful foundation model features remain susceptible to cross-view aliasing without trajectory-level reasoning.

Versus VIO baselines. Raw VIO accumulates 626.7 m drift over the 2.3 km trajectory. Even with oracle initial rotation alignment, VIO achieves only 90.1 m MLE due to scale and heading drift. NaviLoc’s 19.5 m MLE represents a 32.1× improvement over raw VIO, highlighting the benefit of trajectory-level use of noisy VPR measurements to correct long-horizon drift.

Versus graph-based SLAM. Traditional pose graph optimization [24,25] formulates localization as nonlinear least squares over loop closure constraints. These methods typically require multiple sensors (LiDAR, stereo cameras, IMU), 3D point cloud processing, and GPU acceleration for real-time operation. NaviLoc is fundamentally lighter: it operates on a single monocular camera plus VIO, uses 2D tile descriptors rather than 3D geometry, and requires no GPU. Each stage employs closed-form solutions (median, SVD, linear solve) rather than iterative solvers (Gauss-Newton, Levenberg-Marquardt) and handles outliers via z-score clamping rather than iterative robust optimization [26,27]. This yields deterministic, bounded runtime on CPU-only embedded platforms.

5.2. Computational Efficiency

Real-time embedded deployment. NaviLoc achieves 9 FPS end-to-end on Raspberry Pi 5, combining 79 ms feature extraction (DeiT-Tiny-Distilled) with 32 ms trajectory optimization (C++). This demonstrates that trajectory-level cross-view localization can run in real time on a CPU-only embedded platform.

Comparison with foundation model approaches. While larger backbones can improve standalone VPR on some benchmarks, they are substantially more expensive for embedded CPU deployment. Our ablation (Table 5) shows that DINOv2-ViT-G/14+VLAD achieves 162.0 m MLE versus 19.5 m for DeiT-Tiny-Distilled—larger models did not improve NaviLoc on this cross-view dataset.

Runtime scaling. Let N denote trajectory length and M reference database size. Stage 1 performs

O (N)

global nearest-neighbor queries, each requiring

O (M)

comparisons, yielding

O (N M)

. The angle grid size and iteration counts are small constants. Stage 2 runs in

O (N)

per pass with constant window size. Stage 3 solves a tridiagonal system in

O (N)

via the Thomas algorithm [36]. The overall time complexity is

O (N M)

; in practice, feature extraction dominates runtime. For larger reference databases, approximate nearest-neighbor structures could reduce this to sublinear in M.

5.3. Portability and Transferability

Training-free operation. Unlike learned cross-view methods [7,8] that require domain-specific training data, NaviLoc operates on pretrained features without fine-tuning. This is critical for GNSS-denied scenarios where ground truth correspondences are unavailable for training.

Backbone flexibility. NaviLoc accepts any image descriptor; our ablation tested five backbones (Table 5). This modularity allows leveraging future feature extractors without algorithmic changes.

Simplicity. NaviLoc has no learned components beyond the feature extractor, no hyperparameter schedules, and no complex data association. This facilitates debugging and deployment in safety-critical applications.

5.4. Robustness Analysis

Hyperparameter stability. Table 4 demonstrates that NaviLoc maintains sub-25 m accuracy across

\pm 50 %

perturbations of most hyperparameters. The outlier threshold (Table 3) shows graceful degradation outside the optimal range. This robustness simplifies deployment: practitioners can use default settings without extensive tuning.

Adaptive outlier detection. Z-score normalization adapts to per-trajectory statistics: a similarity of 0.25 may indicate an outlier in one flight but a reliable match in another. This eliminates per-dataset threshold tuning.

5.5. Feature Descriptor Analysis

Table 5 shows that DeiT-Tiny-Distilled (5M parameters) achieves 19.5 m MLE, while DINOv2-ViT-G/14 (1.1B parameters) achieves 162.0 m—8× worse despite being 200× larger. This empirical finding suggests that model selection for trajectory-level VPR should consider compatibility with geometric reasoning, not just per-frame retrieval accuracy.

5.6. Practical Extensions

Long-trajectory operation. The current evaluation uses a 2.3 km trajectory processed in batch. For arbitrarily long flights, NaviLoc can be extended via a sliding-window approach: maintain a context window of the most recent 2–3 km of trajectory data and apply NaviLoc to this window for each new position estimate. This bounds memory and computation while providing continuous global localization.

5.7. Limitations and Future Work

VIO dependency. NaviLoc assumes VIO provides accurate relative motion over short horizons. Significant VIO failures (e.g., from aggressive maneuvers or texture-poor environments) would propagate to the final estimate. Future work could jointly estimate VIO scale and bias.

Trajectory length. Very short trajectories (<10 frames) provide insufficient statistics for robust similarity aggregation. A minimum flight distance of approximately 400 m is recommended. Conversely, NaviLoc handles arbitrarily long trajectories via the sliding-window extension described above.

Seasonal and environmental variation. Our evaluation uses satellite imagery captured under similar conditions (summer, clear weather) to the query flight. Performance under adverse weather conditions (rain, fog, and low visibility), seasonal changes (snow cover, vegetation differences), or significant lighting variations between the reference map and query images remains unevaluated. The training-free design may offer some robustness since the DeiT feature extractor was trained on diverse ImageNet conditions, but systematic evaluation across weather and seasonal variations is an important direction for future work.

Additional trajectories. Due to the complexity of real-world UAV data collection with synchronized VIO and ground truth, our evaluation is limited to a single 2.3 km trajectory. While the ablation studies (Table 3, Table 4 and Table 5) provide statistical validation across hyperparameter variations, the evaluation of additional trajectories in diverse environments would further strengthen the generalization claims. Collecting and evaluating datasets spanning multiple environments is planned as future work.

3D extension. The current SE(2) formulation assumes approximately planar flight. Extending to SE(3) would support altitude variations and complex terrain, potentially leveraging 3D point cloud or mesh references.

Online operation. NaviLoc currently processes trajectories in batch. An incremental variant that updates estimates as new frames arrive would enable tighter integration with flight controllers for real-time autonomous navigation.

Energy efficiency. Reducing onboard computation energy extends flight endurance and lowers environmental impact. Systematic energy profiling [37] could guide model selection and duty-cycling to further improve efficiency.

6. Conclusions

We presented NaviLoc, a three-stage trajectory-level localization pipeline with rigorous mathematical foundations. Each stage has a well-defined objective and provable convergence properties. NaviLoc achieves 19.5 m Mean Localization Error on a UAV-to-satellite benchmark, representing a 16× improvement over AnyLoc-VLAD, the state-of-the-art VPR method. End-to-end inference runs at 9 FPS on Raspberry Pi 5, demonstrating practical viability for real-time embedded UAV navigation.

Author Contributions

Conceptualization, methodology, software, experiments, visualization, and writing—original draft: P.S.; supervision, formal analysis, writing—review and editing, and funding acquisition: T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Academia Tech, Taras Shevchenko National University of Kyiv, and Hackathon Expert NGO.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experimental data and code used in this study are available from the corresponding author upon reasonable request for non-commercial academic research purposes. Requests must be submitted from an academic email address; access is granted after signing an NDA. Eligible requesters will receive a link to a private GitHub (https://github.com/) repository and will be added with read-only access.

Acknowledgments

The authors thank Academia Tech for providing datasets and computational infrastructure to support this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AGL	Above Ground Level
ATE	Absolute Trajectory Error
CPU	Central Processing Unit
EKF	Extended Kalman Filter
FPS	Frames Per Second
GNSS	Global Navigation Satellite System
GPS	Global Positioning System
GPU	Graphics Processing Unit
ICP	Iterative Closest Point
IMU	Inertial Measurement Unit
LiDAR	Light Detection and Ranging
MAP	Maximum A Posteriori
MLE	Mean Localization Error
MSCKF	Multi-State Constraint Kalman Filter
OKVIS	Open Keyframe-based Visual-Inertial SLAM
RAM	Random Access Memory
SE(2)	Special Euclidean Group in 2D
SLAM	Simultaneous Localization and Mapping
SVD	Singular Value Decomposition
UAV	Unmanned Aerial Vehicle
VIO	Visual-Inertial Odometry
ViT	Vision Transformer
VLAD	Vector of Locally Aggregated Descriptors
VPR	Visual Place Recognition

Appendix A. Detailed Algorithm Pseudocode

This appendix provides detailed pseudocode for each stage of NaviLoc with complete mathematical specifications.

Appendix A.1. Stage 1: Global Align

Algorithm A1 Global Align (Stage 1)

Require: VIO trajectory

V \in R^{N \times 2}

, Query features

F_{q} \in R^{N \times D}

Require: Reference features

F_{r} \in R^{M \times D}

, Reference coords

G \in R^{M \times 2}

Require: Number of angles

K = 72

, Search radius r, AM iterations

I_{max} = 3

Ensure: Aligned trajectory

P

, Rotation angle

θ^{*}

▹ Phase 1: Coarse grid search over rotation
1:

Θ \leftarrow {- π + 2 π k / K : k = 0, \dots, K - 1}

2:

T_{global} \leftarrow GlobalTop 1 (F_{q}, F_{r}, G)

▹ VPR targets
3:

J^{*} \leftarrow - \infty

,

θ^{*} \leftarrow 0

4: for

θ \in Θ

do
5:

V_{θ} \leftarrow R_{θ} V

▹ Rotate VIO
6:

t \leftarrow median (T_{global} - V_{θ})

▹ L1-optimal translation
7:

P \leftarrow V_{θ} + t

8:

J \leftarrow Score (P, F_{q}, F_{r}, G, r)

▹ Equation (1)
9: if

J > J^{*}

then
10:

J^{*} \leftarrow J

,

θ^{*} \leftarrow θ

,

t^{*} \leftarrow t

11: end if
12: end for
13:

P \leftarrow R_{θ^{*}} V + t^{*}

▹ Phase 2: Alternating maximization
14: for

i = 1, \dots, I_{max}

do
15:

T_{local} \leftarrow LocalTop 1 (P, F_{q}, F_{r}, G, r)

16:

δ \leftarrow 2 π / K \cdot 2^{1 - min (i, 2)}

▹ Shrinking grid
17:

Θ_{local} \leftarrow {θ^{*} + δ \cdot (2 k / 35 - 1) : k = 0, \dots, 35}

18: improved ← false
19: for

θ \in Θ_{local}

do
20:

V_{θ} \leftarrow R_{θ} V

21:

t \leftarrow median (T_{local} - V_{θ})

22:

P_{test} \leftarrow V_{θ} + t

23:

J \leftarrow Score (P_{test}, F_{q}, F_{r}, G, r)

24: if

J > J^{*} + ϵ

then
25:

J^{*} \leftarrow J

,

θ^{*} \leftarrow θ

,

t^{*} \leftarrow t

,

P \leftarrow P_{test}

26:            improved ← true
27:         end if
28:     end for
29:     if ¬improved then
30:         break   ▹ Converged
31:     end if
32: end for
33: return

P

,

θ^{*}

Appendix A.2. Stage 2: Refinement

Algorithm A2 Refinement (Stage 2)

Require: Trajectory

P \in R^{N \times 2}

, Features

F_{q}

, Reference

(F_{r}, G)

Require: Window size

W = 10

, Stride

S = 7

, Max rotation

θ_{max} = 0.09

, Passes

P_{max} = 3

Ensure: Refined trajectory

P

1: for

p = 1, \dots, P_{max}

do
2:

A \leftarrow 0_{N \times 2}

,

C \leftarrow 0_{N}

▹ Accumulators
3: for

start = 0, S, 2 S, \dots

while

start < N

do
4:

end \leftarrow min (start + W, N)

5:

W \leftarrow {start, \dots, end - 1}

▹ Window indices
6: if

| W | < 3

then continue
7: end if
8:

(T, s) \leftarrow LocalTop 1 WithSim (P_{W}, F_{q} [W], F_{r}, G, r)

9:

w \leftarrow max {(0, s)}^{2}

▹ Squared similarity weights
▹ Bounded weighted Procrustes
10:

\bar{p} \leftarrow \sum_{j} w_{j} P_{j} / \sum_{j} w_{j}

11:

\bar{t} \leftarrow \sum_{j} w_{j} T_{j} / \sum_{j} w_{j}

12:

H \leftarrow \sum_{j} w_{j} (P_{j} - \bar{p}) {(T_{j} - \bar{t})}^{⊤}

▹ 2 × 2 cross-covariance
13:

(U, Σ, V^{⊤}) \leftarrow SVD (H)

14:

R \leftarrow V U^{⊤}

15: if

det (R) < 0

then

V [:, 1] \leftarrow - V [:, 1]

;

R \leftarrow V U^{⊤}

16:         end if
   ▹ Clip rotation to bounded interval
17:

θ \leftarrow atan 2 (R_{10}, R_{00})

18:

θ \leftarrow clip (θ, - θ_{max}, θ_{max})

19:

R_{bounded} \leftarrow (\begin{matrix} cos θ & - sin θ \\ sin θ & cos θ \end{matrix})

20:

t_{bounded} \leftarrow \bar{t} - R_{bounded} \bar{p}

▹ Apply transform
21:

P_{W}^{'} \leftarrow R_{bounded} P_{W} + t_{bounded}

22:

A [W] \leftarrow A [W] + P_{W}^{'}

23:

C [W] \leftarrow C [W] + 1

24: end for
25:

C [C = 0] \leftarrow 1

▹ Avoid division by zero
26:

P \leftarrow A / C

▹ Average overlapping estimates
27: end for
28: return

P

Appendix A.3. Stage 3: Smoothing

Algorithm A3 Smoothing (Stage 3)

Require: Anchor trajectory

A \in R^{N \times 2}

, VIO

V

, Rotation

θ

, Features

F_{q}

Require: Z-threshold

τ = - 1.5

,

α_{in} = 0.05

Ensure: Fused trajectory

P^{*}

1:

V_{θ} \leftarrow R_{θ} V

▹ Rotate VIO to global frame
2:

Δ V \leftarrow diff (V_{θ})

▹

(N - 1) \times 2

displacements
▹ Compute local VPR similarities
3:

(_, s) \leftarrow LocalTop 1 WithSim (A, F_{q}, F_{r}, G, r)

▹ Z-score outlier detection
4:

\bar{s} \leftarrow mean (s)

,

σ_{s} \leftarrow std (s)

5:

z \leftarrow (s - \bar{s}) / (σ_{s} + ϵ)

6:

{outlier}_{i} \leftarrow (z_{i} < τ)

for

i = 1, \dots, N

▹ Set anchor weights (outliers clamped)
7: for

i = 1, \dots, N

do
8:

α_{i} \leftarrow \{\begin{matrix} 10^{6} & if {outlier}_{i} \\ α_{in} & otherwise \end{matrix}

9: end for
▹ Build and solve linear system
10: Construct

(N - 1) \times N

difference matrix

D

:

D_{i, i} = - 1

,

D_{i, i + 1} = 1

11:

M \leftarrow D^{⊤} D + diag (α)

▹

N \times N

positive definite
12: for

d \in {x, y}

do
13:

b \leftarrow D^{⊤} Δ V_{d} + α ⊙ A_{d}

14:

P_{d}^{*} \leftarrow solve (M, b)

15: end for
▹ Final uniform smoothing
16:

α_{final} \leftarrow α_{in} \cdot 1_{N}

17:

M_{final} \leftarrow D^{⊤} D + diag (α_{final})

18: for

d \in {x, y}

do
19:

b \leftarrow D^{⊤} Δ V_{d} + α_{in} \cdot P_{d}^{*}

20:

P_{d}^{*} \leftarrow solve (M_{final}, b)

21: end for
22: return

P^{*}

References

Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. AnyLoc: Towards Universal Visual Place Recognition. IEEE Robot. Autom. Lett. 2024, 9, 1286–1293. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Sivic, J.; Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings of the IEEE Conference on International Conference on Computer Vision (ICCV), Nice, France, 13–16 October 2003; pp. 1470–1477. [Google Scholar] [CrossRef]
Gálvez-López, D.; Tardós, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar] [CrossRef]
Ali-bey, A.; Chaib-draa, B.; Giguère, P. MixVPR: Feature Mixing for Visual Place Recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2998–3007. [Google Scholar] [CrossRef]
Hu, S.; Feng, M.; Nguyen, R.M.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7258–7267. [Google Scholar] [CrossRef]
Zhu, S.; Shah, M.; Chen, C. VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3640–3649. [Google Scholar] [CrossRef]
Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3961–3969. [Google Scholar] [CrossRef]
Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization. In Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1395–1403. [Google Scholar] [CrossRef]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2021, 13, 47. [Google Scholar] [CrossRef]
Zhuo, X.; Koch, T.; Kurz, F.; Fraundorfer, F.; Reinartz, P. Automatic UAV Image Geo-Registration by Matching UAV Images to Georeferenced Image Data. Remote Sens. 2017, 9, 376. [Google Scholar] [CrossRef]
Zhuang, J.; Dai, M.; Chen, X.; Zheng, E. A Faster and More Effective Cross-View Matching Method of UAV and Satellite Images for UAV Geolocalization. Remote Sens. 2021, 13, 3979. [Google Scholar] [CrossRef]
Cui, Z.; Zhou, P.; Wang, X.; Zhang, Z.; Li, Y.; Li, H.; Zhang, Y. A Novel Geo-Localization Method for UAV and Satellite Images Using Cross-View Consistent Attention. Remote Sens. 2023, 15, 4667. [Google Scholar] [CrossRef]
Couturier, A.; Akhloufi, M.A. A Review on Deep Learning for UAV Absolute Visual Localization. Drones 2024, 8, 622. [Google Scholar] [CrossRef]
Zváriková, K.; Lázaro, A.; Durana, P. Neural Network-based Recognition and Virtual Simulation Algorithms, Interactive 3D Geo-Visualization and Neuromorphic Computing Systems, and Tactile Sensing and Cognitive Modeling Technologies in Web3-powered Metaverse Worlds. Rev. Contemp. Philos. 2023, 22, 170–186. [Google Scholar] [CrossRef]
Kurdel, P.; Lazar, T.; Mrekaj, B.; Hovanec, M.; Novák Sedláčková, A. Air Safety Information Management. In Aviation Technologies, Systems and Research; Springer: Cham, Switzerland, 2023; pp. 1–26. [Google Scholar] [CrossRef]
Besl, P.J.; McKay, N.D. A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
Chen, Y.; Medioni, G. Object Modelling by Registration of Multiple Range Images. Image Vis. Comput. 1992, 10, 145–155. [Google Scholar] [CrossRef]
Schönemann, P.H. A Generalized Solution of the Orthogonal Procrustes Problem. Psychometrika 1966, 31, 1–10. [Google Scholar] [CrossRef]
Arun, K.S.; Huang, T.S.; Blostein, S.D. Least-Squares Fitting of Two 3-D Point Sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, 9, 698–700. [Google Scholar] [CrossRef]
Umeyama, S. Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
Huber, P.J. Robust Statistics; John Wiley & Sons: New York, NY, USA, 1981; ISBN 978-0-471-41805-4. [Google Scholar]
Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
Kümmerle, R.; Grisetti, G.; Strasdat, H.; Konolige, K.; Burgard, W. g2o: A General Framework for Graph Optimization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 3607–3613. [Google Scholar] [CrossRef]
Sünderhauf, N.; Protzel, P. Switchable Constraints for Robust Pose Graph SLAM. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal, 7–12 October 2012; pp. 1879–1884. [Google Scholar] [CrossRef]
Yang, H.; Antonante, P.; Tzoumas, V.; Carlone, L. Graduated Non-Convexity for Robust Spatial Perception. IEEE Robot. Autom. Lett. 2020, 5, 1127–1134. [Google Scholar] [CrossRef]
Mourikis, A.I.; Roumeliotis, S.I. A Multi-State Constraint Kalman Filter for Vision-Aided Inertial Navigation. In Proceedings of the IEEE International Conference on Robotics & Automation (ICRA), Rome, Italy, 10–14 April 2007; pp. 3565–3572. [Google Scholar] [CrossRef]
Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef]
Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; ISBN 978-0-521-83378-3. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-efficient Image Transformers & Distillation through Attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
Golub, G.H.; Van Loan, C.F. Matrix Computations, 4th ed.; Johns Hopkins University Press: Baltimore, MD, USA, 2013; ISBN 978-1-4214-0794-4. Available online: https://www.worldcat.org/isbn/9781421407944 (accessed on 20 November 2025).
Panchenko, T.V.; Piatygorskiy, N.D. Enrichment of the HEPscore Benchmark by Energy Consumption Assessment. Technologies 2025, 13, 362. [Google Scholar] [CrossRef]

Figure 1. Cross-view correspondences between UAV and satellite imagery. Upper: UAV query frames (50–150 m AGL). Lower: corresponding satellite tiles matched by GPS ground truth. The visual similarity between aerial and satellite views is often ambiguous even to human observers, motivating trajectory-level optimization.

Figure 2. NaviLoc pipeline. Global Align (Stage 1): global VPR yields coarse targets used to optimize a trajectory-level similarity objective over SE(2). Refinement (Stage 2): windowed local VPR supports bounded weighted Procrustes updates. Smoothing (Stage 3): VIO deltas are enforced while low-confidence anchors are detected via similarity statistics and clamped to the VIO prior.

Figure 3. Mean Localization Error comparison. NaviLoc (19.5 m) outperforms all baselines by a wide margin.

Figure 4. Trajectory comparison: ground truth (green, dashed) versus NaviLoc estimate (navy, solid). NaviLoc correctly recovers the global trajectory despite per-frame VPR noise.

Table 1. Dataset statistics.

Property	Value
Query frames	58
Query spacing	40 m
Trajectory length	2323 m
Flight altitude (AGL)	50–150 m
Reference tiles	462
Reference tile spacing	40 m
Reference coverage	1.6 km²

Table 2. Quantitative comparison. MLE = Mean Localization Error; ATE = Absolute Trajectory Error.

Method	MLE (m)	ATE (m)	Improvement
Raw VIO (IP only)	626.7	728.9	1.0×
VIO + IP + IR (first delta, oracle) *	98.4	132.5	6.4×
VIO + IP + IR (first $k = 10$ deltas, oracle) *	90.1	132.2	7.0×
Per-frame VPR (top-1)	404.3	471.8	1.5×
Per-frame VPR (top-3)	342.1	413.8	1.8×
AnyLoc-VLAD (top-3) [1]	312.2	368.5	2.0×
VIO SE(2) oracle *	54.2	67.9	11.6×
NaviLoc: Stage 1	69.3	76.9	9.0×
NaviLoc: Stages 1 + 2	36.7	42.6	17.1×
NaviLoc: Full	19.5	21.6	32.1×

* Oracle baselines use ground truth information.

Table 3. Outlier detection threshold ablation.

Threshold	Outliers	MLE (m)	Improvement
None	0	26.6	23.5×
$z < - 2.0$	3	21.4	29.3×
$z < - 1.5$	7	19.5	32.1×
$z < - 1.0$	9	27.2	23.1×

Table 4. One-at-a-time hyperparameter sensitivity (full NaviLoc pipeline). Default settings are bold.

Hyperparameter	Setting	MLE (m)
Rotation grid (K angles)	36/72/144	32.7/19.5/23.8
Local search radius r (m)	100/150/200	36.6/19.5/83.0
Stage 1 iterations	1/2/3/4	19.8/22.2/19.5/19.5
Window size W (frames)	9/10/11/12	27.6/19.5/24.6/27.2
Stride S (frames)	5/7/9	33.2/19.5/35.2
Rotation bound $θ_{max}$ (rad)	0.05/0.09/0.12	24.3/19.5/19.2
Anchor weight $α_{in}$	0.02/0.05/0.10	20.6/19.5/22.6

Table 5. Backbone comparison.

Backbone	Dimensions	VPR MLE (m)	NaviLoc MLE (m)
DeiT-Tiny-Distilled	192	342.1	19.5
ConvNeXt-Tiny	768	364.1	42.0
LeViT-256 (distilled)	512	346.7	70.6
DeiT-Tiny	192	430.4	105.1
DINOv2-ViT-G/14 + VLAD	98,304	356.7	162.0

Table 6. Computational performance on Raspberry Pi 5.

Component	Time (ms)	FPS
Feature Extraction (DeiT-Tiny)	79	12.7
NaviLoc Algorithm (C++)	32	31
End-to-End Inference	111	9.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shpagin, P.; Panchenko, T. NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation. Drones 2026, 10, 97. https://doi.org/10.3390/drones10020097

AMA Style

Shpagin P, Panchenko T. NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation. Drones. 2026; 10(2):97. https://doi.org/10.3390/drones10020097

Chicago/Turabian Style

Shpagin, Pavel, and Taras Panchenko. 2026. "NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation" Drones 10, no. 2: 97. https://doi.org/10.3390/drones10020097

APA Style

Shpagin, P., & Panchenko, T. (2026). NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation. Drones, 10(2), 97. https://doi.org/10.3390/drones10020097

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

NaviLoc: Trajectory-Level Visual Localization for GNSS-Denied UAV Navigation

Highlights

Abstract

1. Introduction

Contributions

2. Related Work

3. Method

3.1. Problem Formulation

3.2. Stage 1: Global Align

3.3. Stage 2: Refinement

3.4. Stage 3: Smoothing

3.5. Implementation Details

4. Experiments

4.1. Dataset

4.2. Baselines

4.3. Results

4.4. Ablation Studies

4.5. Computational Efficiency

5. Discussion

5.1. Comparison with Existing Methods

5.2. Computational Efficiency

5.3. Portability and Transferability

5.4. Robustness Analysis

5.5. Feature Descriptor Analysis

5.6. Practical Extensions

5.7. Limitations and Future Work

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Detailed Algorithm Pseudocode

Appendix A.1. Stage 1: Global Align

Appendix A.2. Stage 2: Refinement

Appendix A.3. Stage 3: Smoothing

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI