1. Introduction
Global localization is a prerequisite for long-horizon GNSS-denied UAV autonomy. Applications such as environmental monitoring, search-and-rescue, and autonomous delivery increasingly require operation in environments where GNSS signals may be unavailable, jammed, or spoofed. Visual-Inertial Odometry (VIO) provides locally accurate relative motion but drifts without bound, making it insufficient for missions requiring global position awareness.
A natural correction source is aerial-to-satellite Visual Place Recognition (VPR), matching onboard views to geo-referenced satellite tiles. However, as illustrated in
Figure 1, the cross-view domain gap between low-altitude UAV imagery and nadir satellite maps produces frequent high-similarity false matches at geographically distant locations—even human observers find it difficult to identify correct correspondences. Per-frame VPR anchoring is therefore unreliable.
NaviLoc addresses this failure mode by estimating a
trajectory-level solution rather than committing to individual matches. The method is training-free: it uses off-the-shelf pretrained image descriptors without domain-specific fine-tuning. NaviLoc operates in three stages: Stage 1 (Global Align) searches for a single global SE(2) transform whose implied trajectory yields consistently high
local retrieval scores; Stage 2 (Refinement) refines the aligned trajectory through bounded weighted Procrustes updates on overlapping windows; and Stage 3 (Smoothing) computes a closed-form MAP estimate that fuses VIO displacements with VPR anchors while suppressing low-confidence anchors (see
Figure 2).
Public benchmarks for this setting remain limited. We were unable to find an open-source dataset that jointly provides low-altitude (50–150 m) UAV imagery, synchronized VIO, and a geo-referenced satellite tile database suitable for trajectory-level evaluation. To study this practically relevant regime, we evaluate NaviLoc on a prepared real-world UAV-to-satellite dataset with VIO and curated satellite tiles.
Contributions
A novel, training-free, lightweight, and embedded three-stage trajectory-level localization method with explicit objectives and closed-form solutions.
An evaluation of low-altitude (50–150 m) UAV imagery showing a
19.5 m Mean Localization Error and
16.0× improvement over AnyLoc-VLAD [
1].
An ablation demonstrating robustness to hyperparameter variations, and an empirical study showing that distilled ViT descriptors are more effective for trajectory-level alignment than larger foundation models on our benchmark.
An embedded implementation achieving 9 FPS end-to-end inference on Raspberry Pi 5.
2. Related Work
Visual Place Recognition. VPR evolved from handcrafted descriptors [
2] through bag-of-words [
3,
4] to learned representations. NetVLAD [
5] introduced end-to-end learning for place recognition. AnyLoc [
1] aggregates foundation model features using VLAD or GeM pooling, achieving state-of-the-art results on diverse benchmarks. MixVPR [
6] proposes feature mixing for compact descriptors. These methods produce per-image descriptors without trajectory-level consistency constraints.
Cross-Domain and Aerial Localization. The aerial-to-satellite domain gap presents challenges distinct from ground-level VPR. CVM-Net [
7] learns cross-view representations. VIGOR [
8] and CVUSA [
9] established benchmarks for ground-to-aerial matching. Recent UAV-specific methods [
10,
11] address the UAV-to-satellite gap. Our approach is complementary: we accept that individual matches are unreliable and leverage trajectory-level statistics.
UAV-to-Satellite Geo-Localization. Several recent works study cross-view matching between UAV imagery and satellite maps in the remote sensing literature [
11,
12,
13,
14]. These methods improve per-frame matching but still face perceptual aliasing under viewpoint and appearance changes; NaviLoc instead aggregates evidence across time using an explicit trajectory-level objective. For broader context on absolute visual localization pipelines and design choices, we refer to a recent survey in
Drones [
15]. Recent work on neural network-based recognition [
16] and air safety information management [
17] provides a conceptual bridge between learning-based perception methods and system-level safety considerations; our trajectory-level, training-free approach complements these by enabling robust localization without domain-specific training.
Point Set Registration. The Iterative Closest Point (ICP) algorithm [
18,
19] alternates between correspondence assignment and transform estimation for point cloud alignment. Procrustes analysis [
20] provides closed-form solutions for rigid alignment given correspondences. Extensions include weighted variants [
21,
22] and robust formulations. Our Stage 1 adapts ICP-style alternating optimization to the VPR similarity objective with provable monotone improvement over evaluated candidates.
Robust Estimation. Huber’s M-estimators [
23] downweight outliers in regression. Pose graph optimization [
24,
25] fuses odometry and visual constraints. Switchable constraints [
26] and graduated non-convexity [
27] handle outliers in SLAM. Our Stage 3 performs z-score outlier detection and
clamps detected outliers to the VIO prior (by setting
), yielding a closed-form strictly convex solve without iterative robust optimization.
Visual-Inertial Odometry. VIO methods—including EKF-based VIO [
28], OKVIS [
29], VINS-Mono [
30], and ORB-SLAM3 [
31] —provide accurate
relative motion estimation but accumulate unbounded drift over long trajectories. These methods solve a fundamentally different problem than NaviLoc: VIO estimates how the platform moved (relative poses), while NaviLoc estimates where the platform is in the world (global positions). NaviLoc is designed to operate
on top of any VIO system, using its relative motion output as a prior while correcting global drift via satellite imagery matching.
3. Method
This section presents the NaviLoc algorithm in three stages, illustrated in
Figure 2. We first formalize the problem (
Section 3.1) and then describe each stage: Global Align (
Section 3.2) finds an initial SE(2) transform via robust trajectory-level optimization; Refinement (
Section 3.3) applies windowed geometric corrections; and Smoothing (
Section 3.4) fuses VIO constraints with VPR anchors while rejecting outliers. Each stage has a well-defined objective function with a closed-form or provably convergent solution.
3.1. Problem Formulation
Given N aerial query images with descriptors and VIO-derived positions in a local frame, along with VIO displacements , and a geo-referenced satellite map with M reference tiles having descriptors and coordinates , we seek global positions .
3.2. Stage 1: Global Align
We estimate a global SE(2) transform
by maximizing the trajectory-level similarity objective:
where
is the 2D rotation matrix,
is the set of reference tiles within radius
r, and
denotes cosine similarity.
Algorithm and intuition. Stage 1 treats VPR as a noisy correspondence generator: for a fixed rotation
, each frame proposes a translation “measurement”
based on the global top-1 match
. Under perceptual aliasing, many
are outliers; we therefore aggregate
with a robust location estimator (the component-wise median). The key contribution is casting these noisy retrieval outcomes into a trajectory-level objective. We use coordinate ascent: (1) scan a grid of
K rotation angles
; (2) for each
, compute the L1-optimal translation (component-wise median) [
23]:
where
is the global top-1 match; (3) evaluate
and select the best; and (4) refine with alternating maximization using local targets.
We denote the resulting aligned trajectory by . The following result formalizes why the median aggregator is optimal for robust translation estimation under outlier contamination:
Theorem 1 (
L1-Optimal Translation (Median))
. Let be translation residuals for fixed θ. The minimizer ofis given by the component-wise median: and . Consequently, is robust to outlier residuals: as long as more than half of the residuals are inliers per coordinate, the estimate is controlled by the inlier set rather than the outliers. Proof. The L1 norm
decomposes across coordinates, so the objective separates into two independent 1D problems:
and
. The minimizer of
over
is the median of
[
23]. Robustness follows because the median is unaffected by arbitrarily large perturbations to fewer than half of the samples. □
Remark 1 (
monotone acceptance)
. In our implementation, we only accept candidate updates that improve (or tie) J, so the sequence of evaluated objective values is monotone non-decreasing, bounded above by 1, and hence convergent. This guarantees convergence of the objective sequence, not global optimality (see Algorithm A1).
3.3. Stage 2: Refinement
After global alignment, we refine positions using sliding-window bounded Procrustes. This extends standard weighted Procrustes analysis [
20,
21] with a novel rotation bound constraint that prevents overcorrection from noisy VPR targets. For each frame index
j, we compute a
local VPR target as the coordinate of the best-matching reference tile within radius
r of the current predicted position
and record its cosine similarity score
. For each window
with targets
and weights
, we solve:
This weighted Procrustes problem admits a closed-form solution for both rotation and translation. The rotation constraint prevents overcorrection when local VPR targets are noisy:
Theorem 2 (
Optimal Bounded Procrustes)
. For the constrained problem (4) in 2D, let and be the weighted centroids, and let be the weighted cross-covariance matrix. The optimal rotation angle is , where , and the optimal translation is . Proof. After centering by the weighted centroids, the objective separates into rotation and translation terms. For the rotation, we seek to maximize
. Writing
, the trace expands to
[
20,
21]. This sinusoidal function of
is maximized at
. The constraint
clips this to the boundary if
lies outside. The optimal translation follows as
. □
Windows overlap (stride ), so each frame receives corrections from multiple windows; we average these contributions. Multiple passes are performed because each pass updates the trajectory, which changes the local VPR targets (since they depend on the current predicted positions). Empirically, 2–3 passes suffice for convergence (see Algorithm A2).
3.4. Stage 3: Smoothing
Let
denote the Stage 2 output. We treat these positions as
anchors with
and fuse them with VIO constraints via a standard MAP formulation [
32]. Our novel element is the z-score-based outlier detection that automatically clamps unreliable anchors to the VIO prior:
where
is the
first-difference matrix.
Outlier detection. For each anchor , we recompute its local VPR similarity as the best cosine similarity among reference tiles within radius r of . We then compute z-scores . Frames with (default ) are marked as outliers and assigned (implemented as ), effectively clamping them to the VIO prior. Inliers use . With the anchor weights thus defined, the optimization problem has a unique closed-form solution:
Theorem 3 (
Unique Solution)
. If for at least one i, then the objective (5) is strictly convex with unique minimizer given by the linear system: Proof. The objective (
5) is quadratic in
. Setting the gradient to zero yields the normal equations
. The matrix
is the graph Laplacian of a path, which is positive semidefinite with nullspace spanned by the constant vector
. Adding
with at least one
eliminates this nullspace: any
has
. Thus the combined matrix is positive definite, guaranteeing strict convexity and a unique solution [
32]. □
See Algorithm A3 for pseudocode.
3.5. Implementation Details
Algorithm 1 summarizes the complete pipeline. We use DeiT-Tiny-Distilled [
33], a distilled Vision Transformer [
34], for feature extraction (192-dim descriptors). Fixed parameters: search radius
m,
angles, window size
, stride
,
rad (≈5), and
z-threshold
.
| Algorithm 1 NaviLoc |
Require: Query features , VIO trajectory Require: Reference database with M tiles
Ensure: Localized positions
1: GlobalAlign ▹ Stage 1
2: Refinement ▹ Stage 2
3: Smoothing ▹ Stage 3
4: return |
Detailed pseudocode for each stage is provided in
Appendix A.
4. Experiments
4.1. Dataset
To the best of our knowledge, there is no publicly available
challenging dataset for our target setting—low-altitude (50–150 m) UAV imagery with synchronized VIO and a geo-referenced satellite tile database for trajectory-level evaluation—in particular with long real-world trajectories. We therefore evaluate on a prepared real-world UAV-to-satellite benchmark collected over rural terrain in Ukraine.
Table 1 summarizes the dataset statistics. The dataset comprises 58 aerial query frames captured at 50–150 m AGL along a 2.3 km trajectory with 40 m inter-frame spacing. Reference imagery consists of 462 geo-referenced satellite tiles at 0.3 m/px resolution covering 1.6 km
2 with 40 m tile spacing. The benchmark is challenging due to strong perceptual aliasing in rural/village scenes (repetitive texture, limited distinctive landmarks) combined with low-altitude viewpoint and appearance changes relative to the satellite map.
Data collection and preparation. The query stream was recorded from a real multirotor UAV flight over rural/village terrain in Ukraine under summer conditions. The platform was manually piloted in an FPV-style flight mode at 50–150 m AGL, with average speed below 60 m/s, producing a 2.3 km trajectory.
All onboard processing was performed on a Raspberry Pi 5 (8 GB RAM). Images were captured by a rigidly mounted nadir-view Sony IMX219 RGB camera at 30 FPS with approximately 77–80° diagonal field of view, with auto-exposure and auto-white-balance enabled; a separate forward/oblique camera was used only for piloting and is not part of the evaluation. Inertial measurements were provided by the onboard SpeedyBee V3 IMU. Relative motion was estimated by a standard EKF-based monocular Visual-Inertial Odometry (VIO) pipeline fusing the camera and IMU. These components were chosen for availability, low cost, and suitability for lightweight embedded UAV platforms; NaviLoc is designed to work with any standard VIO pipeline and is agnostic to the specific sensor choice. As expected, VIO is subject to drift over extended flights and performs no global map alignment; NaviLoc uses only the resulting relative pose increments and is agnostic to the specific VIO implementation.
Ground truth positions are obtained from onboard GNSS logs and are used only for evaluation (MLE/ATE) and visualization; GNSS data are not available to NaviLoc during inference. The reference database was built by sampling satellite map imagery via the Google Maps API at zoom level 19 on a 40 m grid over the flight area; each tile is associated with its geographic coordinate. We then convert both query frames and reference tiles into embeddings using the backbones in
Table 1.
4.2. Baselines
We compare against:
Raw VIO (IP only): VIO trajectory with initial position aligned to ground truth (no rotation).
VIO + IP + IR (oracle): VIO with initial position and initial rotation estimated from the first displacement (one-segment oracle) and from the first displacements (multi-segment oracle).
VIO SE(2) oracle: VIO with oracle global SE(2) alignment to ground truth (best possible rigid alignment).
Per-frame VPR (top-k): For each query frame, retrieve the k reference tiles with highest cosine similarity and predict position as their coordinate mean (top-1 uses the single best match; top-3 averages the three best).
AnyLoc-VLAD [
1]: State-of-the-art VPR using DINOv2 [
35] with VLAD aggregation, evaluated with top-3 retrieval.
4.3. Results
On this challenging real-world benchmark, NaviLoc achieves
19.5 m MLE—a
16.0× improvement over the state-of-the-art AnyLoc-VLAD, and a
32.1× improvement over raw VIO drift.
Figure 3 and
Figure 4 summarize these results.
4.4. Ablation Studies
Stage contributions. Each stage provides substantial improvement (
Table 2). Stage 1 alone achieves 9× improvement by finding the correct global transform. Stage 2 refines local errors, and Stage 3 leverages VIO constraints while excluding outliers.
Outlier threshold. Table 3 shows sensitivity to the z-score threshold.
Hyperparameter sensitivity. Table 4 varies one hyperparameter at a time (others fixed to defaults). Performance is stable across moderate changes, while overly permissive local search radii or underconstrained refinement settings degrade accuracy.
Backbone comparison. Table 5 reveals that DeiT-Tiny-Distilled achieves the best NaviLoc performance despite not having the best per-frame VPR accuracy. Empirically, distillation appears to yield descriptors whose scores are more consistent
across time, making trajectory-level alignment easier.
4.5. Computational Efficiency
NaviLoc operates on precomputed descriptors. We benchmark on Raspberry Pi 5 (ARM Cortex-A76, 8 GB RAM).
Table 6 summarizes the computational performance.
Inference strategy. At each timestep, the system samples an aerial image, extracts a 192-dimensional DeiT-Tiny-Distilled embedding (79 ms), and updates the global trajectory estimate using the C++ NaviLoc algorithm (32 ms). The combined end-to-end latency of 111 ms yields 9.0 FPS on Raspberry Pi 5, enabling real-time embedded deployment. Feature extraction becomes the bottleneck; DeiT-Tiny-Distilled requires 1.3 GFLOPs per image versus 21.1 GFLOPs for DINOv2-S, providing 16× faster extraction while achieving superior localization accuracy.
5. Discussion
We now analyze NaviLoc’s performance across multiple dimensions, compare it with alternative approaches, and discuss limitations and future directions.
5.1. Comparison with Existing Methods
Versus per-frame VPR. Per-frame VPR methods treat each query independently, producing 342.1 m MLE (top-3) on our benchmark. NaviLoc achieves 19.5 m—a 17.5× improvement. The key insight is that perceptual aliasing causes individual VPR errors that are spatially inconsistent: incorrect matches scatter across the map, while correct matches cluster near the true trajectory. Trajectory-level optimization exploits this statistical property.
Versus state-of-the-art. AnyLoc [
1] represents the state-of-the-art in universal VPR, aggregating DINOv2 [
35] features via VLAD pooling. On our benchmark, AnyLoc-VLAD achieves 312.2 m MLE (top-3)—only marginally better than simple per-frame VPR. NaviLoc reduces this to 19.5 m, a
16.0×
improvement. This demonstrates that even powerful foundation model features remain susceptible to cross-view aliasing without trajectory-level reasoning.
Versus VIO baselines. Raw VIO accumulates 626.7 m drift over the 2.3 km trajectory. Even with oracle initial rotation alignment, VIO achieves only 90.1 m MLE due to scale and heading drift. NaviLoc’s 19.5 m MLE represents a 32.1× improvement over raw VIO, highlighting the benefit of trajectory-level use of noisy VPR measurements to correct long-horizon drift.
Versus graph-based SLAM. Traditional pose graph optimization [
24,
25] formulates localization as nonlinear least squares over loop closure constraints. These methods typically require multiple sensors (LiDAR, stereo cameras, IMU), 3D point cloud processing, and GPU acceleration for real-time operation. NaviLoc is fundamentally lighter: it operates on a single monocular camera plus VIO, uses 2D tile descriptors rather than 3D geometry, and requires no GPU. Each stage employs closed-form solutions (median, SVD, linear solve) rather than iterative solvers (Gauss-Newton, Levenberg-Marquardt) and handles outliers via z-score clamping rather than iterative robust optimization [
26,
27]. This yields deterministic, bounded runtime on CPU-only embedded platforms.
5.2. Computational Efficiency
Real-time embedded deployment. NaviLoc achieves 9 FPS end-to-end on Raspberry Pi 5, combining 79 ms feature extraction (DeiT-Tiny-Distilled) with 32 ms trajectory optimization (C++). This demonstrates that trajectory-level cross-view localization can run in real time on a CPU-only embedded platform.
Comparison with foundation model approaches. While larger backbones can improve standalone VPR on some benchmarks, they are substantially more expensive for embedded CPU deployment. Our ablation (
Table 5) shows that DINOv2-ViT-G/14+VLAD achieves 162.0 m MLE versus 19.5 m for DeiT-Tiny-Distilled—larger models did not improve NaviLoc on this cross-view dataset.
Runtime scaling. Let
N denote trajectory length and
M reference database size. Stage 1 performs
global nearest-neighbor queries, each requiring
comparisons, yielding
. The angle grid size and iteration counts are small constants. Stage 2 runs in
per pass with constant window size. Stage 3 solves a tridiagonal system in
via the Thomas algorithm [
36]. The overall time complexity is
; in practice, feature extraction dominates runtime. For larger reference databases, approximate nearest-neighbor structures could reduce this to sublinear in
M.
5.3. Portability and Transferability
Training-free operation. Unlike learned cross-view methods [
7,
8] that require domain-specific training data, NaviLoc operates on pretrained features without fine-tuning. This is critical for GNSS-denied scenarios where ground truth correspondences are unavailable for training.
Backbone flexibility. NaviLoc accepts any image descriptor; our ablation tested five backbones (
Table 5). This modularity allows leveraging future feature extractors without algorithmic changes.
Simplicity. NaviLoc has no learned components beyond the feature extractor, no hyperparameter schedules, and no complex data association. This facilitates debugging and deployment in safety-critical applications.
5.4. Robustness Analysis
Hyperparameter stability. Table 4 demonstrates that NaviLoc maintains sub-25 m accuracy across
perturbations of most hyperparameters. The outlier threshold (
Table 3) shows graceful degradation outside the optimal range. This robustness simplifies deployment: practitioners can use default settings without extensive tuning.
Adaptive outlier detection. Z-score normalization adapts to per-trajectory statistics: a similarity of 0.25 may indicate an outlier in one flight but a reliable match in another. This eliminates per-dataset threshold tuning.
5.5. Feature Descriptor Analysis
Table 5 shows that DeiT-Tiny-Distilled (5M parameters) achieves 19.5 m MLE, while DINOv2-ViT-G/14 (1.1B parameters) achieves 162.0 m—
8×
worse despite being 200× larger. This empirical finding suggests that model selection for trajectory-level VPR should consider compatibility with geometric reasoning, not just per-frame retrieval accuracy.
5.6. Practical Extensions
Long-trajectory operation. The current evaluation uses a 2.3 km trajectory processed in batch. For arbitrarily long flights, NaviLoc can be extended via a sliding-window approach: maintain a context window of the most recent 2–3 km of trajectory data and apply NaviLoc to this window for each new position estimate. This bounds memory and computation while providing continuous global localization.
5.7. Limitations and Future Work
VIO dependency. NaviLoc assumes VIO provides accurate relative motion over short horizons. Significant VIO failures (e.g., from aggressive maneuvers or texture-poor environments) would propagate to the final estimate. Future work could jointly estimate VIO scale and bias.
Trajectory length. Very short trajectories (<10 frames) provide insufficient statistics for robust similarity aggregation. A minimum flight distance of approximately 400 m is recommended. Conversely, NaviLoc handles arbitrarily long trajectories via the sliding-window extension described above.
Seasonal and environmental variation. Our evaluation uses satellite imagery captured under similar conditions (summer, clear weather) to the query flight. Performance under adverse weather conditions (rain, fog, and low visibility), seasonal changes (snow cover, vegetation differences), or significant lighting variations between the reference map and query images remains unevaluated. The training-free design may offer some robustness since the DeiT feature extractor was trained on diverse ImageNet conditions, but systematic evaluation across weather and seasonal variations is an important direction for future work.
Additional trajectories. Due to the complexity of real-world UAV data collection with synchronized VIO and ground truth, our evaluation is limited to a single 2.3 km trajectory. While the ablation studies (
Table 3,
Table 4 and
Table 5) provide statistical validation across hyperparameter variations, the evaluation of additional trajectories in diverse environments would further strengthen the generalization claims. Collecting and evaluating datasets spanning multiple environments is planned as future work.
3D extension. The current SE(2) formulation assumes approximately planar flight. Extending to SE(3) would support altitude variations and complex terrain, potentially leveraging 3D point cloud or mesh references.
Online operation. NaviLoc currently processes trajectories in batch. An incremental variant that updates estimates as new frames arrive would enable tighter integration with flight controllers for real-time autonomous navigation.
Energy efficiency. Reducing onboard computation energy extends flight endurance and lowers environmental impact. Systematic energy profiling [
37] could guide model selection and duty-cycling to further improve efficiency.
6. Conclusions
We presented NaviLoc, a three-stage trajectory-level localization pipeline with rigorous mathematical foundations. Each stage has a well-defined objective and provable convergence properties. NaviLoc achieves 19.5 m Mean Localization Error on a UAV-to-satellite benchmark, representing a 16× improvement over AnyLoc-VLAD, the state-of-the-art VPR method. End-to-end inference runs at 9 FPS on Raspberry Pi 5, demonstrating practical viability for real-time embedded UAV navigation.
Author Contributions
Conceptualization, methodology, software, experiments, visualization, and writing—original draft: P.S.; supervision, formal analysis, writing—review and editing, and funding acquisition: T.P. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported by Academia Tech, Taras Shevchenko National University of Kyiv, and Hackathon Expert NGO.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The experimental data and code used in this study are available from the corresponding author upon reasonable request for non-commercial academic research purposes. Requests must be submitted from an academic email address; access is granted after signing an NDA. Eligible requesters will receive a link to a private GitHub (
https://github.com/) repository and will be added with read-only access.
Acknowledgments
The authors thank Academia Tech for providing datasets and computational infrastructure to support this research.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| AGL | Above Ground Level |
| ATE | Absolute Trajectory Error |
| CPU | Central Processing Unit |
| EKF | Extended Kalman Filter |
| FPS | Frames Per Second |
| GNSS | Global Navigation Satellite System |
| GPS | Global Positioning System |
| GPU | Graphics Processing Unit |
| ICP | Iterative Closest Point |
| IMU | Inertial Measurement Unit |
| LiDAR | Light Detection and Ranging |
| MAP | Maximum A Posteriori |
| MLE | Mean Localization Error |
| MSCKF | Multi-State Constraint Kalman Filter |
| OKVIS | Open Keyframe-based Visual-Inertial SLAM |
| RAM | Random Access Memory |
| SE(2) | Special Euclidean Group in 2D |
| SLAM | Simultaneous Localization and Mapping |
| SVD | Singular Value Decomposition |
| UAV | Unmanned Aerial Vehicle |
| VIO | Visual-Inertial Odometry |
| ViT | Vision Transformer |
| VLAD | Vector of Locally Aggregated Descriptors |
| VPR | Visual Place Recognition |
Appendix A. Detailed Algorithm Pseudocode
This appendix provides detailed pseudocode for each stage of NaviLoc with complete mathematical specifications.
Appendix A.1. Stage 1: Global Align
| Algorithm A1 Global Align (Stage 1) |
Require: VIO trajectory , Query features Require: Reference features , Reference coords Require: Number of angles , Search radius r, AM iterations Ensure: Aligned trajectory , Rotation angle
▹ Phase 1: Coarse grid search over rotation
1:
2: ▹ VPR targets
3: ,
4: for do
5: ▹ Rotate VIO
6: ▹ L1-optimal translation
7:
8: ▹ Equation (1)
9: if then
10: , ,
11: end if
12: end for
13: ▹ Phase 2: Alternating maximization
14: for do
15:
16: ▹ Shrinking grid
17:
18: improved ← false
19: for do
20:
21:
22:
23:
24: if then
25: , , ,
26: improved ← true
27: end if
28: end for
29: if ¬improved then
30: break ▹ Converged
31: end if
32: end for
33: return , |
Appendix A.2. Stage 2: Refinement
| Algorithm A2 Refinement (Stage 2) |
Require: Trajectory , Features , Reference Require: Window size , Stride , Max rotation , Passes Ensure: Refined trajectory
1: for do
2: , ▹ Accumulators
3: for while do
4:
5: ▹ Window indices
6: if then continue
7: end if
8:
9: ▹ Squared similarity weights
▹ Bounded weighted Procrustes
10:
11:
12: ▹ 2 × 2 cross-covariance
13:
14:
15: if then ;
16: end if ▹ Clip rotation to bounded interval
17:
18:
19:
20: ▹ Apply transform
21:
22:
23:
24: end for
25: ▹ Avoid division by zero
26: ▹ Average overlapping estimates
27: end for
28: return |
Appendix A.3. Stage 3: Smoothing
| Algorithm A3 Smoothing (Stage 3) |
Require: Anchor trajectory , VIO , Rotation , Features Require: Z-threshold , Ensure: Fused trajectory
1: ▹ Rotate VIO to global frame
2: ▹ displacements
▹ Compute local VPR similarities
3: ▹ Z-score outlier detection
4: ,
5:
6: for ▹ Set anchor weights (outliers clamped)
7: for do
8:
9: end for ▹ Build and solve linear system
10: Construct difference matrix : ,
11: ▹ positive definite
12: for do
13:
14:
15: end for ▹ Final uniform smoothing
16:
17:
18: for do
19:
20:
21: end for
22: return |
References
- Keetha, N.; Mishra, A.; Karhade, J.; Jatavallabhula, K.M.; Scherer, S.; Krishna, M.; Garg, S. AnyLoc: Towards Universal Visual Place Recognition. IEEE Robot. Autom. Lett. 2024, 9, 1286–1293. [Google Scholar] [CrossRef]
- Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
- Sivic, J.; Zisserman, A. Video Google: A Text Retrieval Approach to Object Matching in Videos. In Proceedings of the IEEE Conference on International Conference on Computer Vision (ICCV), Nice, France, 13–16 October 2003; pp. 1470–1477. [Google Scholar] [CrossRef]
- Gálvez-López, D.; Tardós, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
- Arandjelović, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN Architecture for Weakly Supervised Place Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5297–5307. [Google Scholar] [CrossRef]
- Ali-bey, A.; Chaib-draa, B.; Giguère, P. MixVPR: Feature Mixing for Visual Place Recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2998–3007. [Google Scholar] [CrossRef]
- Hu, S.; Feng, M.; Nguyen, R.M.; Lee, G.H. CVM-Net: Cross-View Matching Network for Image-Based Ground-to-Aerial Geo-Localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7258–7267. [Google Scholar] [CrossRef]
- Zhu, S.; Shah, M.; Chen, C. VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 3640–3649. [Google Scholar] [CrossRef]
- Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3961–3969. [Google Scholar] [CrossRef]
- Zheng, Z.; Wei, Y.; Yang, Y. University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization. In Proceedings of the ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1395–1403. [Google Scholar] [CrossRef]
- Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2021, 13, 47. [Google Scholar] [CrossRef]
- Zhuo, X.; Koch, T.; Kurz, F.; Fraundorfer, F.; Reinartz, P. Automatic UAV Image Geo-Registration by Matching UAV Images to Georeferenced Image Data. Remote Sens. 2017, 9, 376. [Google Scholar] [CrossRef]
- Zhuang, J.; Dai, M.; Chen, X.; Zheng, E. A Faster and More Effective Cross-View Matching Method of UAV and Satellite Images for UAV Geolocalization. Remote Sens. 2021, 13, 3979. [Google Scholar] [CrossRef]
- Cui, Z.; Zhou, P.; Wang, X.; Zhang, Z.; Li, Y.; Li, H.; Zhang, Y. A Novel Geo-Localization Method for UAV and Satellite Images Using Cross-View Consistent Attention. Remote Sens. 2023, 15, 4667. [Google Scholar] [CrossRef]
- Couturier, A.; Akhloufi, M.A. A Review on Deep Learning for UAV Absolute Visual Localization. Drones 2024, 8, 622. [Google Scholar] [CrossRef]
- Zváriková, K.; Lázaro, A.; Durana, P. Neural Network-based Recognition and Virtual Simulation Algorithms, Interactive 3D Geo-Visualization and Neuromorphic Computing Systems, and Tactile Sensing and Cognitive Modeling Technologies in Web3-powered Metaverse Worlds. Rev. Contemp. Philos. 2023, 22, 170–186. [Google Scholar] [CrossRef]
- Kurdel, P.; Lazar, T.; Mrekaj, B.; Hovanec, M.; Novák Sedláčková, A. Air Safety Information Management. In Aviation Technologies, Systems and Research; Springer: Cham, Switzerland, 2023; pp. 1–26. [Google Scholar] [CrossRef]
- Besl, P.J.; McKay, N.D. A Method for Registration of 3-D Shapes. IEEE Trans. Pattern Anal. Mach. Intell. 1992, 14, 239–256. [Google Scholar] [CrossRef]
- Chen, Y.; Medioni, G. Object Modelling by Registration of Multiple Range Images. Image Vis. Comput. 1992, 10, 145–155. [Google Scholar] [CrossRef]
- Schönemann, P.H. A Generalized Solution of the Orthogonal Procrustes Problem. Psychometrika 1966, 31, 1–10. [Google Scholar] [CrossRef]
- Arun, K.S.; Huang, T.S.; Blostein, S.D. Least-Squares Fitting of Two 3-D Point Sets. IEEE Trans. Pattern Anal. Mach. Intell. 1987, 9, 698–700. [Google Scholar] [CrossRef]
- Umeyama, S. Least-Squares Estimation of Transformation Parameters Between Two Point Patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
- Huber, P.J. Robust Statistics; John Wiley & Sons: New York, NY, USA, 1981; ISBN 978-0-471-41805-4. [Google Scholar]
- Cadena, C.; Carlone, L.; Carrillo, H.; Latif, Y.; Scaramuzza, D.; Neira, J.; Reid, I.; Leonard, J.J. Past, Present, and Future of Simultaneous Localization and Mapping. IEEE Trans. Robot. 2016, 32, 1309–1332. [Google Scholar] [CrossRef]
- Kümmerle, R.; Grisetti, G.; Strasdat, H.; Konolige, K.; Burgard, W. g2o: A General Framework for Graph Optimization. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 3607–3613. [Google Scholar] [CrossRef]
- Sünderhauf, N.; Protzel, P. Switchable Constraints for Robust Pose Graph SLAM. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal, 7–12 October 2012; pp. 1879–1884. [Google Scholar] [CrossRef]
- Yang, H.; Antonante, P.; Tzoumas, V.; Carlone, L. Graduated Non-Convexity for Robust Spatial Perception. IEEE Robot. Autom. Lett. 2020, 5, 1127–1134. [Google Scholar] [CrossRef]
- Mourikis, A.I.; Roumeliotis, S.I. A Multi-State Constraint Kalman Filter for Vision-Aided Inertial Navigation. In Proceedings of the IEEE International Conference on Robotics & Automation (ICRA), Rome, Italy, 10–14 April 2007; pp. 3565–3572. [Google Scholar] [CrossRef]
- Leutenegger, S.; Lynen, S.; Bosse, M.; Siegwart, R.; Furgale, P. Keyframe-Based Visual-Inertial Odometry Using Nonlinear Optimization. Int. J. Robot. Res. 2015, 34, 314–334. [Google Scholar] [CrossRef]
- Qin, T.; Li, P.; Shen, S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator. IEEE Trans. Robot. 2018, 34, 1004–1020. [Google Scholar] [CrossRef]
- Campos, C.; Elvira, R.; Rodríguez, J.J.G.; Montiel, J.M.M.; Tardós, J.D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multimap SLAM. IEEE Trans. Robot. 2021, 37, 1874–1890. [Google Scholar] [CrossRef]
- Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; ISBN 978-0-521-83378-3. [Google Scholar]
- Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training Data-efficient Image Transformers & Distillation through Attention. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
- Oquab, M.; Darcet, T.; Moutakanni, T.; Vo, H.V.; Szafraniec, M.; Khalidov, V.; Fernandez, P.; Haziza, D.; Massa, F.; El-Nouby, A.; et al. DINOv2: Learning Robust Visual Features without Supervision. arXiv 2023, arXiv:2304.07193. [Google Scholar]
- Golub, G.H.; Van Loan, C.F. Matrix Computations, 4th ed.; Johns Hopkins University Press: Baltimore, MD, USA, 2013; ISBN 978-1-4214-0794-4. Available online: https://www.worldcat.org/isbn/9781421407944 (accessed on 20 November 2025).
- Panchenko, T.V.; Piatygorskiy, N.D. Enrichment of the HEPscore Benchmark by Energy Consumption Assessment. Technologies 2025, 13, 362. [Google Scholar] [CrossRef]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |