LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery

Wang, Boya; Wang, Shuo; Han, Yibin; Xu, Linfeng; Ye, Dong

doi:10.3390/rs17193349

Open AccessArticle

LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery

by

Boya Wang

^1,†

,

Shuo Wang

^2,†

,

Yibin Han

¹,

Linfeng Xu

¹ and

Dong Ye

^1,*

¹

School of Instrumentation Science and Engineering, Harbin Institute of Technology, Harbin 150001, China

²

School of Integrated Circuitsm, Harbin Institute of Technology, Shenzhen 518055, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2025, 17(19), 3349; https://doi.org/10.3390/rs17193349

Submission received: 12 July 2025 / Revised: 24 September 2025 / Accepted: 29 September 2025 / Published: 1 October 2025

Download

Browse Figures

Versions Notes

Abstract

Highlights

What are the main findings?

LiteSAM, a lightweight satellite–UAV feature matching framework, employs unified feature representation and fine-grained matching strategies to achieve a superior trade-off between accuracy and efficiency in cross-view matching.
Achieves state-of-the-art performance on multiple benchmarks while substantially reducing model complexity and inference latency.

What is the implication of the main finding?

Enables real-time UAV visual localization in GPS-denied and resource-constrained environments, making it practical for deployment.
Demonstrates strong generalization across datasets and scenarios, extending its applicability to remote sensing and natural image matching tasks.

Abstract

We present a (Light)weight (S)atellite–(A)erial feature (M)atching framework (LiteSAM) for robust UAV absolute visual localization (AVL) in GPS-denied environments. Existing satellite–aerial matching methods struggle with large appearance variations, texture-scarce regions, and limited efficiency for real-time UAV applications. LiteSAM integrates three key components to address these issues. First, efficient multi-scale feature extraction optimizes representation, reducing inference latency for edge devices. Second, a Token Aggregation–Interaction Transformer (TAIFormer) with a convolutional token mixer (CTM) models inter- and intra-image correlations, enabling robust global–local feature fusion. Third, a MinGRU-based dynamic subpixel refinement module adaptively learns spatial offsets, enhancing subpixel-level matching accuracy and cross-scenario generalization. The experiments show that LiteSAM achieves competitive performance across multiple datasets. On UAV-VisLoc, LiteSAM attains an RMSE@30 of 17.86 m, outperforming state-of-the-art semi-dense methods such as EfficientLoFTR. Its optimized variant, LiteSAM (opt., without dual softmax), delivers inference times of 61.98 ms on standard GPUs and 497.49 ms on NVIDIA Jetson AGX Orin, which are 22.9% and 19.8% faster than EfficientLoFTR (opt.), respectively. With 6.31M parameters, which is 2.4× fewer than EfficientLoFTR’s 15.05M, LiteSAM proves to be suitable for edge deployment. Extensive evaluations on natural image matching and downstream vision tasks confirm its superior accuracy and efficiency for general feature matching.

Keywords:

satellite–aerial imagery matching; feature matching; convolutional token mixer; absolute visual localization

1. Introduction

Recent advances in UAV manufacturing and the enhanced computational capabilities of edge devices have rapidly broadened the scope of UAV applications. Unmanned aerial vehicles (UAVs) are now extensively deployed across diverse domains, including urban planning [1], border security [2], disaster management [3], precision agriculture [4], surveying and mapping [5], environmental conservation [6], military operations [7], infrastructure inspection [8], and construction [9]. Precise localization is critical for autonomous navigation and effective task execution. However, Global Navigation Satellite Systems (GNSSs) often suffer from degraded performance in challenging environments—such as urban canyons, dense forests, or regions affected by signal interference and spoofing.

In GNSS-degraded or GNSS-denied environments, absolute visual localization (AVL) provides a reliable alternative by matching aerial imagery obtained by UAVs with pre-stored reference satellite or aerial imagery to estimate global positions. In contrast, Relative Visual Localization (RVL) estimates a UAV’s motion state through changes between consecutive frames, enabling local navigation, but it susceptible to cumulative drift over time. By anchoring observations directly to a fixed global reference frame, AVL fundamentally eliminates drift errors. Serving as the foundation for environmental perception, precise navigation, and 3D reconstruction, AVL significantly enhances the reliability of UAV operations in complex scenarios by enabling them to comprehensively understand their surroundings and execute autonomous decisions. Consequently, AVL has emerged as a pivotal technology, endowing UAVs with robust autonomous flight capabilities.

Based on matching granularity, AVL can be classified into large-scale matching and fine-grained matching [10]. Large-scale matching focuses on image-level correspondences for initial localization, while fine-grained matching establishes pixel-level correspondences to support downstream tasks such as pose estimation, visual localization, and 3D reconstruction. Typically, fine-grained matching is performed after large-scale matching for high-precision localization. Because it closely resembles natural image matching, classical methods (e.g., template matching and hand-crafted features such as SIFT [11], ORB [12], and SURF [13]) are directly applicable to AVL. In satellite imagery, appearance and scale variations destabilize template matching. Moreover, keypoint-based methods often lack robustness in common AVL scenarios, such as repeated structures, low-texture regions, extreme scale differences, atmospheric distortions, or large viewpoint changes, resulting in diminish localization accuracy. To address these challenges, learning-based feature matching methods have been proposed. These methods learn to extract feature points and generate descriptors from massive-scale datasets, significantly improving the generalization and stability of matching, effectively mitigating complex environmental disturbances, and ultimately enhancing the reliability of AVL in high-precision localization tasks.

Directly applying feature matchers designed for natural images to AVL is challenging due to variations in illumination, seasonality, viewpoint, and scale, as well as motion blur in satellite–aerial images, which complicate matching and degrade performance. Semi-dense large models [14,15] are typically used to establish correspondences; however, their transformer-based processing of full-scale feature maps incurs high computational costs and is unsuitable for real-time edge deployment. However, low-resolution token selection can reduce some overhead [16], while inter- and intra-image correlations still pose bottlenecks. Sparse matchers lack sufficient accuracy [17,18]. To address this, we propose an edge-friendly semi-dense matcher with reduced model size and computation for AVL following a coarse-to-fine architecture [16].

In this work, we introduce LiteSAM, a lightweight and efficient feature matching framework that achieves a better trade-off between accuracy and inference latency. The framework is specifically designed for AVL applications. As illustrated in Figure 1, LiteSAM achieves state-of-the-art localization accuracy on satellite–aerial benchmarks, requiring significantly fewer parameters and lower computational costs while also performing competitively in natural image matching tasks. We employ a reparameterized feature extraction backbone to reduce inference latency while maintaining strong representation capability. In addition, we introduce a novel efficient feature fusion method, TAIFormer, which utilizes a convolutional token mixer (CTM) to effectively capture both global and local feature correlations. Additionally, we propose a learnable correspondence refinement strategy to improve matching accuracy and enhance generalization capability. LiteSAM can directly transfer from natural image datasets to satellite–aerial image datasets without fine-tuning on the target dataset, maintaining high accuracy across diverse domains. LiteSAM not only maintains high accuracy but also significantly reduces computational complexity, achieving an optimal balance between performance and efficiency for resource-constrained UAV visual localization tasks.

Our contributions can be summarized as follows:

(1): We employ efficient multi-scale feature extraction that optimizes representation capacity and significantly reduces inference latency, enabling robust deployment on resource-constrained edge devices for real-time UAV localization tasks.
(2): We introduce a Token Aggregation–Interaction Transformer (TAIFormer), a lightweight hybrid module leveraging a convolutional token mixer (CTM), to efficiently model inter- and intra-image correlations. This design achieves robust global–local feature fusion, enhancing matching accuracy across diverse appearance variations.
(3): We develop a MinGRU-based dynamic subpixel refinement module that adaptively learns spatial offsets through local contextual modeling. This approach significantly improves subpixel-level matching accuracy and cross-scenario generalization for complex satellite–aerial scenes.
(4): Extensive experiments validate that LiteSAM delivers competitive performance in accuracy, latency, and model size, particularly on the UAV-VisLoc dataset and NVIDIA Jetson edge devices. These results demonstrate its suitability for real-time UAV localization and general feature matching applications.

2. Related Work

2.1. Detector-Based Feature Matching

Detector-based methods perform image matching by identifying sparse keypoints and extracting their corresponding descriptors.

Learning-based approaches (e.g., D2D [25], SuperPoint [23]) leverage CNNs and large datasets to train robust keypoint detectors and feature descriptors, achieving superior matching performance under diverse appearance variations and adverse lighting conditions.

Detector-based methods still fail in extreme AVL conditions due to weak integration between detection and description stages. SuperGlue [17] mitigates this limitation by utilizing attention and graph neural networks (GNNs) for self- and cross-attention matching. However, its high computational and memory requirements hinder deployment on UAV edge devices.

In UAV-based AVL applications, substantial appearance discrepancies between current and reference images caused by variations in flight altitude, illumination, and seasonal conditions are common. Additionally, extensive low-texture regions, such as open terrains and water surfaces, exacerbate keypoint sparsity and descriptor degradation, leading to degraded matching performance.

To overcome these issues, we adopt detector-free matching, which avoids sparse keypoints and directly computes dense correspondences to improve accuracy and robustness in complex environments.

2.2. Detector-Free Feature Matching

Instead of relying on explicit keypoint detection, detector-free approaches establish dense or semi-dense correspondences directly at the pixel or patch level. They utilize convolutional neural networks (CNNs) to extract dense feature representations and construct 4D correlation volumes to model pairwise similarities. Nonetheless, these methods often exhibit limited robustness under severe illumination changes or wide baseline variations, which constrains their effectiveness in highly dynamic scenarios, such as satellite–aerial image matching.

LoFTR [15] proposes a transformer-based feature matching framework that leverages both self- and cross-attention, substantially improving the modeling of global context and long-range dependencies. Building upon this paradigm, several specialized frameworks have emerged. EfficientLoFTR [16] introduces an aggregated attention network for efficient local feature transformation and designs a two-stage correlation refinement module to enhance matching accuracy and subpixel alignment, thereby significantly enhancing overall efficiency and accuracy. ASpanFormer [14] implements an adaptive local attention design that dynamically adjusts the receptive field according to the complexity of local matching, thereby improving robustness in scenes with structural clutter or low texture. MatchFormer [26] utilizes a hierarchical transformer to jointly optimize feature extraction and correspondence estimation, which increases the distinctiveness of the learned features. In a different vein, TopicFM [27] segments features into multiple semantic topics and focuses computation on semantically aligned regions across images, leading to stronger cross-domain generalization. Its improved variant, TopicFM+ [28], further integrates sparse attention for fixed-size topics and fine-grained structures, substantially lowering the computational overhead while preserving matching accuracy. Deepmatcher [29] employs a vector-based attention to model the correlations among keypoints. GeoAT [30] utilizes the affine transformation matrix estimated during coarse matching to guide the cross-attention process in the intermediate matching stage, thereby enhancing correspondence accuracy. JamMa [31] replaces transformer with mamba architecture to achieve a well-balanced trade-off between matching precision and computational efficiency.

Despite improved accuracy and robustness, high computational costs hinder resource-constrained deployments, particularly in real-time AVL on UAVs with stringent latency and energy constraints. To overcome this, we propose TAIFormer, which enhances feature discriminability at a low cost, offering an efficient UAV-based AVL solution.

Moreover, recent dense matching frameworks such as DKM [32], RoMa [33], and HomoMatcher [34] incorporate probabilistic modeling and homography estimation to achieve high-precision pixel-level correspondences. These techniques perform well in applications demanding rigorous geometric consistency. However, their computational complexity escalates rapidly with image resolution, hindering their applicability in latency-sensitive environments. In contrast, the semi-dense matching strategy presented in this work maintains competitive accuracy with significantly reduced overhead, achieving a favorable trade-off between performance and efficiency in real-time UAV-based AVL applications.

2.3. Feature-Matching in AVL

Early approaches to AVL typically rely on template matching methods (e.g., [35,36,37]) to localize UAVs by identifying corresponding regions in satellite imagery. In recent years, learning-based aerial image matching techniques have advanced significantly and can be broadly categorized into large-scale and fine-grained matching methods. Large-scale matching methods rely on global descriptors for image retrieval, enabling the identification of candidate satellite views from massive databases. For instance, Sample4Geo [38] presents a lightweight contrastive learning framework for cross-view localization, while Deuser et al. [39] propose a compact orientation prediction module to enhance pose estimation. Although these methods offer strong scalability, they often fall short of the precision required for fine-grained localization.

Fine-grained matching methods address this limitation by aligning features at the pixel level to achieve high localization accuracy. WildNav [40] combines SuperPoint [23] and SuperGlue [17] to establish sparse correspondences across UAV and satellite images. VDFT [41] introduces a viewpoint-invariant transformation to mitigate geometric distortions, while MOMA [42] proposes a structure-consistent framework to handle scale variations. Notwithstanding their improved accuracy, these methods often involve bulky models and significant computational cost, making them difficult to deploy on UAV edge devices with strict real-time and resource constraints. SwinMatcher [43] introduces a cross-modal remote-sensing image matching framework that integrates cross-modal feature interaction with multi-scale contextual modeling, thereby enhancing robustness in complex environments. Similarly, ETQ-Matcher [44] employs a multi-transformer with channel attention to capture global representations, achieving superior performance in challenging aerial–ground urban scenarios.

In contrast, the fine-grained matching strategy proposed in this work achieves a more favorable trade-off between accuracy and efficiency, providing a practical solution for real-time AVL under constrained UAV platforms.

3. Methodology

Image matching is used to establish reliable correspondences between a given image pair. Formally, for an image pair

(I^{A}, I^{B})

, where

I^{A}, I^{B} \in R^{H \times W}

, a feature extractor is employed to obtain the corresponding feature sets

F^{A}

and

F^{B}

. By computing the correlation between features, matching relationships

(k_{i}^{A}, k_{j}^{B})

are identified, where

k_{i}^{A} \in F^{A}

and

k_{j}^{B} \in F^{B}

. Here, i and j denote the indices of individual feature points in the respective feature sets, representing potentially corresponding locations across the two images.

We focus on satellite–aerial image matching, which is particularly challenging due to significant viewpoint variations and low-texture regions. To tackle these challenges with minimal computational overhead, we employ a coarse-to-fine progressive matching strategy to achieve high-precision correspondences at the subpixel level.

LiteSAM is specifically designed to address these challenges in satellite–aerial image matching. By processing input image pairs

(I^{A}, I^{B})

, the framework extracts multi-scale feature maps and progressively generates correspondence sets

{(k_{i}^{A}, k_{j}^{B})}

from coarse to subpixel-level precision. This design ensures high-precision matching while maintaining lightweight and efficient computation, suitable for real-time aerial visual localization tasks. As illustrated in Figure 2, the overall architecture consists of the following key modules:

Section 3.1 introduces a reparameterizable backbone for compact and efficient feature representation;
Section 3.2 presents the Token Aggregation–Interaction Transformer (TAIFormer), which adaptively integrates local and global cues to facilitate rich contextual interaction;
Section 3.3 outlines a coarse-level matching strategy for generating initial correspondence hypotheses;
Section 3.4 describes a learnable refinement module that progressively improves correspondence accuracy to the pixel and subpixel levels;
Section 3.5 formulates the multi-level loss functions used to jointly supervise all components in an end-to-end training scheme.

3.1. Reparameterization-Based Feature Extraction

We adopt MobileOne as the backbone network due to its lightweight architecture and suitability for deployment in mobile and edge computing scenarios. During training, MobileOne integrates multiple linear branches, which are reparameterized into a unified feed-forward structure during inference. This architectural transformation effectively minimizes memory overhead and improves runtime efficiency, enabling real-time performance on resource-limited platforms.

In detail, we employ the MobileOne-S3 variant. After discarding the classification head and retaining only the convolutional feature extraction layers, the parameter count is significantly reduced from 10.1 million to 0.81 million. This substantial model compression lowers both memory footprint and computational overhead. On the ImageNet dataset, the original model achieves a top-1 accuracy of 78.1% and an inference latency of 1.53 ms on mobile devices, outperforming most reparameterized backbones of comparable size in terms of both effectiveness and speed.

For feature extraction, MobileOne produces

1 / 8

-scale feature maps, denoted as

F_{C}^{A}, F_{C}^{B} \in R^{C \times \frac{H}{8} \times \frac{W}{8}}

, from the input image pair

(I^{A}, I^{B})

. Additionally, we extract higher-resolution features at

1 / 4

and

1 / 2

scales to preserve finer-grained spatial information. This hierarchical feature representation provides essential contextual cues for subsequent refinement stages, enabling subpixel-level correspondence accuracy while maintaining a favorable balance between accuracy and efficiency.

3.2. Feature Information Propagation

While MobileOne efficiently extracts multi-scale features, it lacks the capacity to model inter-image correspondences, which are essential for accurate matching. To address this, we introduce a lightweight context module that enriches both global and local features.

We propose the Token Aggregation–Interaction Transformer (TAIFormer) as shown in Figure 3, a hybrid architecture that combines convolutional and transformer mechanisms to model intra- and inter-image correspondences. TAIFormer enhances feature representations for robust matching. It comprises two components: the convolutional token mixer (CTM), which integrates multi-head attention and convolution to capture global context and spatial details, and the convolutional feed-forward network (ConvFFN), which replaces fully connected layers with convolutional blocks to improve spatial coherence and reduce redundancy.

Through hierarchical attention and convolutional refinement, TAIFormer yields context-aware features for accurate and efficient correspondence refinement under resource constraints.

3.2.1. Convolutional Token Mixer (CTM)

Traditional token mixers effectively capture long-range dependencies and enable context-aware modeling, but their reliance on large matrix operations often incurs high computational costs and latency. To address this limitation, we employ depth-wise convolution (DWConv) and max pooling (MaxPool) for feature downsampling. This design reduces complexity while preserving essential global contextual cues. In particular, DWConv encodes local spatial structures, whereas MaxPool retains dominant activations by compressing feature responses. The query, key, and value matrices, denoted as

Q, K, V \in R^{C \times \frac{H}{8} \times \frac{W}{8}}

, are initialized as follows:

\begin{matrix} Q & = W_{Q} \otimes DWConv (f_{i}), \end{matrix}

(1)

\begin{matrix} K & = W_{K} \otimes MaxPool (f_{j}), \end{matrix}

(2)

\begin{matrix} V & = W_{V} \otimes MaxPool (f_{j}), \end{matrix}

(3)

where ⊗ denotes matrix multiplication;

W_{Q}

,

W_{K}

, and

W_{V}

are linear projection layers;

f_{i}, f_{j} \in R^{C \times \frac{H}{8} \times \frac{W}{8}}

represents the 1/8-scale features obtained during feature extraction. For intra-image correlation,

f_{i} = f_{j}

, whereas for inter-image correlation,

f_{i} \neq f_{j}

.

Global Contextual Feature Aggregation

The downsampled features are subsequently processed by multi-head attention (MHA) to enhance global representations, thereby yielding the global feature tensor

F_{g l o b a l}

as follows:

F_{g l o b a l} = softmax (\frac{Q K^{T}}{\sqrt{d}}) V,

(4)

where Q, K, and V denote the query, key, and value matrices, respectively, and

\sqrt{d}

is a scaling factor used to stabilize gradients.

As shown in Figure 2, intra-image correlation is modeled via self-attention with Rotary Position Embedding (RoPE) [45], which captures dynamic relative dependencies within token sequences. The 2D RoPE is defined as follows:

R (x, y) = [\begin{matrix} cos θ_{x} & sin θ_{x} & 0 & 0 \\ - sin θ_{x} & cos θ_{x} & 0 & 0 \\ 0 & 0 & cos θ_{y} & sin θ_{y} \\ 0 & 0 & - sin θ_{y} & cos θ_{y} \end{matrix}], θ_{x} = {10,000}^{- 2 c / d} x, θ_{y} = {10,000}^{- 2 c / d} y,

(5)

where c indexes channel pairs, and d is the feature dimension;

(x_{i}, y_{i})

and

(x_{j}, y_{j})

are the spatial coordinates of Q and K. Applying the rotations to Q and K, the attention product can be written as

Q_{rot} K_{rot}^{⊤} = Q R (x_{i}, y_{i}) {(K R (x_{j}, y_{j}))}^{⊤} = Q R (x_{j} - x_{i}, y_{j} - y_{i}) K^{⊤},

(6)

showing that attention depends only on the relative positions between features.

Q_{rot}

and

K_{rot}

are then used in Equation (4) to integrate positional information into self-attention. In contrast, inter-image correlation adopts the same CTM structure with a cross-attention mechanism, omitting RoPE to avoid positional misalignment between different image domains.

Local Contextual Feature Aggregation

In image matching tasks, point features are often ambiguous or indistinct, especially in low-texture or repetitive regions. Relying solely on global representations may compromise localization and matching accuracy. To address this, local context is incorporated to enhance spatial detail and improve keypoint detection and matching.

The CTM integrates Depthwise Convolution (DWConv) to capture fine-grained local structures and applies a Sigmoid function for adaptive modulation. This design increases sensitivity to subtle local variations and enhances matching reliability. The computation is formulated as follows:

\begin{matrix} F_{l o c a l} & = σ (DWConv (Q)) ⊙ DWConv (Q), \end{matrix}

(7)

where

σ

denotes the Sigmoid activation function, and ⊙ represents the Hadamard product.

Dynamic Feature Interaction

To facilitate the effective integration of global and local representations, we design a dynamic feature interaction module that adaptively modulates local features using global contextual cues through Sigmoid. Notably, a weighting factor is computed from the global feature representation

F_{g l o b a l}

via the Sigmoid function

σ (\cdot)

and applied to

F_{l o c a l}

through element-wise multiplication. The modulated local and global features are then fused and projected using a

1 \times 1

convolutional layer to produce the final output of the token mixer. The process is expressed as follows:

\begin{matrix} F_{f u s e} & = {Conv}_{1 \times 1} ((F_{l o c a l} ⊙ σ (F_{g l o b a l})) ⊙ F_{g l o b a l}), \end{matrix}

(8)

where

F_{global}

is derived from Equation (4), and ⊙ represents the Hadamard product.

Finally,

F_{f u s e}

is upsampled to match the original input resolution and concatenated with the initial input feature

f_{i}

to form the output of the convolutional token mixer (CTM). This fusion strategy effectively preserves both fine-grained spatial details and holistic semantic information for downstream processing.

3.2.2. Convolutional Feed-Forward Network

We employ a convolutional feed-forward network (ConvFFN) to improve spatial sensitivity and local contextual modeling. ConvFFN combines

1 \times 1

point-wise convolutions with DWConv to achieve efficient feature transformation.

The input feature map,

F_{f u s e} \in R^{2 C \times \frac{H}{8} \times \frac{W}{8}}

, is first passed through a

1 \times 1

convolutional layer to compress the channel dimension to C. A GELU activation and a subsequent DWConv layer are then applied to capture fine-grained spatial dependencies while minimizing computational overhead. A residual connection is adopted to preserve informative gradients and promote feature discriminability during training. The output is normalized via batch normalization and further processed by another

1 \times 1

convolution to produce the final representation

F_{c} \in R^{C \times \frac{H}{8} \times \frac{W}{8}}

, which constitutes the output of the Feature Information Propagation Module.

To ensure a balance between accuracy and efficiency, the channel compression is performed early using the

1 \times 1

convolution, maintaining a lightweight design while preserving the network’s expressive capacity.

This feature propagation module is designed to reinforce both intra-image feature distinctiveness and inter-image feature consistency, both of which are vital for robust visual matching. Accordingly, we utilize four stacked aggregation blocks, each composed of an intra-image TAIFormer layer and an inter-image TAIFormer layer.

Figure 4 presents comparative visualizations of feature outputs from LiteSAM and EfficientLoFTR. To validate the robustness of the findings and ensure fair comparison, LiteSAM was evaluated using both MobileOne and RepVGG—the default backbone of EfficientLoFTR—for feature extraction.

The initial feature maps predominantly emphasize salient foreground regions while suppressing irrelevant background areas such as the sky. After transformer-based TAIFormer refinement, LiteSAM’s features exhibit sharper, more localized attention on true correspondence regions, effectively suppressing background responses. This leads to a higher number of inlier matches and fewer outliers, whereas EfficientLoFTR maintains broad foreground attention, producing more outlier matches.

Furthermore, the refined features from LiteSAM show stronger cross-image correlation, with higher spatial coherence and cross-view similarity. This enhanced feature correspondence directly improves matching precision, ensuring that keypoint pairs align more faithfully with the true geometric relationships in challenging satellite–aerial image pairs. LiteSAM with TAIFormer demonstrates superior semantic region representation, higher inlier ratios, and reduced outliers compared to EfficientLoFTR.

3.3. Coarse-Level Matching Module

Following the feature propagation process, the resulting

1 / 8

-scale feature maps

(F_{c}^{A}, F_{c}^{B}) \in R^{C \times \frac{H}{8} \times \frac{W}{8}}

are obtained from the two input images. A coarse matching score matrix

S

is subsequently computed based on these feature representations, which is defined as

\begin{matrix} S (i, j) = \frac{〈 F_{c}^{A} (i), F_{c}^{B} (j) 〉}{τ}, \end{matrix}

(9)

where

〈 \cdot, \cdot 〉

denotes the inner product operation, and

τ

is a temperature coefficient that controls the sharpness of the similarity distribution. To obtain a valid matching probability distribution, the dual-softmax technique [15] is applied to normalize S along both row-wise and column-wise dimensions, yielding the coarse matching probability matrix

P_{c}

:

\begin{matrix} P_{c} (i, j) = softmax {(S (i, \cdot))}_{j} \cdot softmax {(S (\cdot, j))}_{i}, \end{matrix}

(10)

where

S (i, \cdot)

denotes the i-th row of

S

, reflecting the similarity between the i-th feature in image A and all features in image B, while

S (\cdot, j)

represents the j-th column, indicating the reverse.

To ensure robust initial correspondences, a confidence-based thresholding strategy combined with mutual nearest neighbor (MNN) filtering is employed to construct a binary matching mask

M_{c}

, which is defined as

M_{c} (i, j) = \{\begin{matrix} 1, & if S (i, j) \geq θ_{c} and is row / column \max, \\ 0, & otherwise . \end{matrix}

(11)

where

θ_{c}

is a predefined confidence threshold used to eliminate low-probability matches. A value of

M_{c} (i, j) = 1

indicates that the feature pair

(i, j)

satisfies both mutual nearest neighbor and confidence criteria, thereby constituting a valid coarse-level correspondence; otherwise, the pair is discarded.

Importantly, the dual-softmax operation plays a crucial role during training by imposing contextual constraints that emphasize the learning of discriminative features, thereby reinforcing the confidence of correct matches while suppressing ambiguous ones. Once the model attains sufficient discriminative power, this operation becomes optional during inference to alleviate computational overhead.

To speed up inference, especially for high-resolution images, LiteSAM (opt.) skips the dual-softmax step and directly applies the MNN strategy to the raw score matrix S. This adjustment simplifies the matching process while maintaining a balance between accuracy and efficiency.

3.4. Learnable Correspondence Refinement Module

To further refine the accuracy of feature correspondences, a learnable refinement module is devised, which systematically optimizes matching precision through a two-stage process: pixel-level refinement followed by subpixel-level adjustment. The first stage focuses on spatial alignment at the pixel scale, while the subsequent stage utilizes a learnable iterative strategy to achieve high-precision subpixel matching.

3.4.1. Pixel-Level Refinement

Building upon the feature pyramid fusion strategy, the propagated feature maps

(F_{c}^{A}, F_{c}^{B})

from the Feature Information Propagation Module are upsampled to the original image resolution. Intermediate-resolution features (at 1/4 and 1/2 scales), obtained during the initial extraction phase, are integrated to reconstruct the pixel-level features

(F_{f}^{A}, F_{f}^{B}) \in R^{C / 4 \times H \times W}

, which retain rich spatial granularity essential for fine-grained matching.

For each coarse-level correspondence in

M_{c}

, a localized feature patch is cropped from the fine-resolution feature maps. Within this patch, a localized similarity matrix

S_{f}

is formulated. Dual-softmax normalization is then conducted to derive a fine-grained matching probability matrix

P_{f}

. To ensure match reliability, a Global Nearest Neighbor (GNN) filtering strategy is subsequently applied to

P_{f}

, retaining only the most confident correspondence within each local region.

3.4.2. Subpixel-Level Refinement

To address the resolution limitations imposed by discrete pixel grids, this module models continuous feature distributions to achieve subpixel matching, thereby suppressing spurious correspondences and ensuring geometric consistency. This is in contrast to heatmap-based refinement methods [16], which exhibit limited nonlinear modeling capacity and are prone to super-linear errors in repetitive structures. We propose a learnable refinement module that dynamically captures local contextual dependencies through a stacked MinGRU [46] architecture. As illustrated in Figure 5, this module performs fine-grained corrections of initial matches by iteratively updating subpixel coordinates, thereby attaining high matching precision and robust generalization across diverse visual conditions.

Given an initial correspondence

(k_{i}^{A}, k_{j}^{B})

, where

k_{i}^{A}

and

k_{j}^{B}

denote the i-th and j-th feature points in the reference image

I^{A}

and target image

I^{B}

, respectively, and represent potentially corresponding locations across the two images, a

3 \times 3

search window is centered at

k_{j}^{B}

, restricting the candidate region to its local neighborhood. To refine the estimated location, a multi-layer MinGRU network is instantiated to encode the correlations between

k_{i}^{A}

and each candidate point within the window. Through iterative updates, the position of

k_{j}^{B}

is progressively adjusted toward sub-pixel accuracy.

Each MinGRU unit receives local correlation features and the hidden representation from the preceding layer, enabling context-aware updates. Unlike conventional GRUs that depend on temporal recurrence via

h_{t - 1}

, the MinGRU formulation [46] eliminates this dependency, thereby facilitating parallel inferences without sacrificing modeling capacity. The update process is defined as follows:

\begin{matrix} z_{t} & = σ ({Linear}_{d_{h}} (C_{f})), \end{matrix}

(12)

\begin{matrix} {\tilde{h}}_{t} & = {Linear}_{d_{h}} (C_{f}), \end{matrix}

(13)

\begin{matrix} h_{t} & = (1 - z_{t}) ⊙ h_{t - 1} + z_{t} ⊙ {\tilde{h}}_{t}, \end{matrix}

(14)

where

C_{f}

denotes the current correlation input,

σ (\cdot)

is the sigmoid activation function, and ⊙ represents the Hadamard product. All operations involving

z_{t}

and

{\tilde{h}}_{t}

are based solely on

C_{f}

, supporting the parallel derivation of

h_{t}

. The final subpixel offset is computed via the following:

Δ = W_{offset} \otimes ReLU (h_{t}),

(15)

δ = tanh (Δ),

(16)

where

W_{offset}

denotes a learnable projection matrix, and

tanh (\cdot)

constrains the predicted offset

δ

within

[- 1, 1]

to ensure numerical stability.

This learnable correspondence refinement module independently optimizes both pixel-level and subpixel coordinates, effectively decoupling their estimation processes. By leveraging large-scale end-to-end supervision, it adaptively learns spatial offsets, thereby significantly improving the accuracy, robustness, and generalization capability of feature matching across various scenarios.

3.5. Loss Function

In this work, three auxiliary loss functions are utilized at different stages of the matching process: coarse, pixel-level, and subpixel-level matching. These loss functions serve to guide the model in learning precise matching relationships at various scales, thereby optimizing the overall matching accuracy at each stage.

3.5.1. Coarse-Level Matching Loss Function

During the coarse matching phase, the approach presented in [17] is adopted, where the relative pose between the depth map and images is leveraged to project grid points from the reference image,

I_{A}

, onto the target image

I_{B}

. This projection facilitates the construction of a ground-truth coarse matching set, denoted as

{M_{c}}_{g t}

, which includes a total of

N_{c}

matching points. As the feature maps are downsampled to 1/8 of the original image resolution, the matching points are annotated and supervised at this scale.

Furthermore, the model’s predicted coarse matching probability matrix, denoted as

P_{c}

, is supervised with respect to the ground-truth coarse matching set. To optimize the coarse matching process, a negative log-likelihood loss function is employed. The objective of this loss function is to minimize the negative log-likelihood at the ground-truth matching positions,

{M_{c}}_{g t}

, thereby enhancing the model’s ability to effectively capture and learn global matching relationships. The coarse matching loss function is defined as follows:

\begin{matrix} L_{c} = - \frac{1}{N_{c}} \sum_{(i, j) \in {M_{c}}_{g t}} log P_{c} (i, j) \end{matrix}

(17)

Here,

P_{c} (i, j)

denotes the matching probability at position

(i, j)

during the coarse matching phase, and

N_{c}

denotes the total number of ground-truth matches at the coarse matching stage.

3.5.2. Pixel-Level Matching Loss Function

In the pixel-level matching phase, feature point pairs obtained from the coarse matching stage undergo further refinement within corresponding

8 \times 8

pixel blocks at the original image resolution. Each coarse match is associated with an

8 \times 8

pixel region within the reference image

I_{A}

, and the objective of pixel-level matching is to identify the optimal pixel pair within the corresponding local blocks of both the reference and target images, thereby achieving precise feature alignment.

To facilitate this, projective geometry is employed in combination with camera extrinsics and depth map information [16]. Specifically, pixels within the

8 \times 8

blocks of the reference image

I_{A}

are projected onto the target image

I_{B}

. This projection process constructs the ground-truth pixel-level matching set, denoted as

{M_{f}}_{g t}

.

A negative log-likelihood loss function supervises the estimation of matching probabilities corresponding to the ground-truth positions in the probability matrix

P_{f}

. The pixel-level matching loss function is defined as follows:

\begin{matrix} L_{f} = - \frac{1}{N_{f}} \sum_{(i, j) \in {M_{f}}_{g t}} log P_{f} (i, j) \end{matrix}

(18)

Here,

P_{f} (i, j)

represents the score at position

(i, j)

within the fine-grained correlation score matrix, computed over a local matching window, while

N_{f}

represents the number of valid ground-truth matches involved in the fine-grained matching process.

3.5.3. Subpixel-Level Matching Loss Function

The subpixel-level matching stage refines the pixel-level matching results, thereby enhancing the accuracy of the final matching positions to a subpixel resolution. To achieve this, depth maps and camera extrinsics are leveraged to precisely project pixel-level correspondences from the reference image

I_{A}

onto the target image

I_{B}

[16]. This projection generates continuous coordinate values for the resulting positions, enabling the construction of the ground-truth subpixel-level matching set, denoted as

{M_{s}}_{g t}

, which includes not only integer pixel positions but also floating-point values that accurately capture subpixel-level correspondences.

To refine the model’s subpixel prediction accuracy, the Euclidean distance is used as the supervisory signal. This loss function quantifies the deviation between the predicted subpixel matching positions

M_{s}

and the ground-truth subpixel positions

{M_{s}}_{g t}

. The subpixel matching loss is defined as follows:

\begin{matrix} L_{s} = \frac{1}{N} \sum_{i = 1}^{N_{s}} {∥ {M_{s}}^{(i)} - {M_{s}}_{g t}^{(i)} ∥}_{2}^{2} \end{matrix}

(19)

Here,

N_{s}

represents the total number of matching points,

{M_{s}}^{(i)}

denotes the i-th predicted subpixel matching point,

{M_{s}}_{g t}^{(i)}

denotes the number of subpixel level ground-truth correspondences used for supervision, and

{| \cdot |}_{2}

denotes the Euclidean norm.

The final optimization objective is a weighted sum of the three loss functions. The overall loss function is expressed as follows:

\begin{matrix} L = L_{c} + α L_{f} + β L_{s} \end{matrix}

(20)

where

α

and

β

are weighting coefficients that regulate the contributions of each matching-level loss function. By jointly optimizing this multi-scale loss function, the model progressively refines its matching accuracy from coarse to pixel-level and subpixel-level precision, yielding robust and high-precision feature matching across various scales.

4. Experiments

We conduct a comprehensive evaluation of the proposed model across benchmark tasks involving natural scenes and satellite–aerial image matching. In natural scenes, the model is evaluated on downstream tasks including relative pose estimation, homography estimation, and visual localization. For satellite–aerial imagery, evaluation focuses on AVL. To further validate the effectiveness of each component, detailed ablation studies are carried out. Additionally, the framework is deployed on edge devices to demonstrate its real-time inference performance and localization accuracy under resource-constrained conditions.

4.1. Datasets

For relative pose estimation, we utilize the following datasets:

MegaDepth [21] encompasses 196 outdoor landmark scenes, featuring over 130,000 images, along with high-quality depth ground truth derived from Structure-from-Motion and Multi-View Stereo techniques. We evaluate 1500 test image pairs, resized to a maximum dimension of 1200 pixels, to assess robustness to variations in illumination and viewpoint.
ScanNet [22] comprises 1513 indoor scenes with over 2.5 million RGB-D frames, which provides precise 3D reconstruction ground truth. We select 1500 test pairs at a resolution of 640 × 480 pixels to evaluate performance in texture-limited indoor environments.

For absolute visual localization (AVL), which involves matching aerial images to satellite imagery, we employ the following:

The Aerial Image Dataset [47] includes 500 test pairs of aerial images at 512 × 512 pixels, with ground-truth affine transformations to evaluate matching accuracy.
UAV-VisLoc [19] encompasses 6744 unmanned aerial vehicle (UAV) images collected from 11 distinct locations across China, captured at altitudes ranging from 400 to 2000 m. These images span urban, rural, and agricultural settings, with real-time kinematic (RTK) positioning and heading angles provided as ground truth. Following the data augmentation pipeline detailed in Figure 6, the dataset yields 20,232 satellite images, each resized to 1184 pixels along the longer dimension.
The Self-made Dataset includes 2284 UAV images acquired from 14 scenes in Harbin and Qiqihar, captured at altitudes between 100 and 500 m. Each image is annotated with RTK-based ground truth and heading angles. After applying the augmentation pipeline illustrated in Figure 6, the dataset generates 6852 satellite images.

For visual localization, which estimates 6 DoF camera poses within the HLoc framework, we carry out evaluations on the following datasets:

Aachen Day-Night v1.1 [48] includes 6697 reference images and 1015 query images, featuring variations in viewpoint and illumination.
InLoc [49] comprises 9972 reference images and 329 query images from texture-scarce indoor environments; this dataset incorporates depth supervision for evaluation.

For homography estimation, we use the following:

HPatches [20] contains 580 image pairs across 116 scenes, with 285 pairs exhibiting illumination variations and 295 pairs with viewpoint changes. Images are resized to 640 × 480 pixels to test the accuracy of feature matching and homography estimation.

4.2. Implementation Details

The architecture, outlined in Section 3, is implemented in PyTorch using PyTorch Lightning for distributed training. MobileOne-S3 [50] extracts multi-scale features at 1/2, 1/4, and 1/8 resolutions. Features at 1/8 scale undergo

4 \times 4

convolution or max pooling, followed by four TAIFormer layers for coarse matching, with threshold

θ_{c}

in Section 3.3 set to 0.2 (20 in opt mode). Subpixel refinement uses four MinGRU units. Training uses 18,400 image pairs from 368 MegaDepth sub-scenes (50 pairs each) resized to 832 × 832 pixels with padding. We carry out training for 30 epochs using a batch size of 14, an AdamW optimizer (learning rate 0.002, weight decay 0.1), and loss weights

α = 1.0

and

β = 0.2

on two NVIDIA A100 GPUs for 48 h. The model is evaluated on an NVIDIA RTX 3090 GPU.

4.3. Baseline

We evaluated LiteSAM against state-of-the-art feature matching methods on the datasets described in Section 4.1, covering sparse, semi-dense, and dense methods. Sparse methods include D2Net [51], R2D2 [52], DISK [53], and SuperPoint [23] (using nearest-neighbor matching), as well as SuperGlue [17], LightGlue [24], and OmniGlue [18]. Semi-dense methods comprise Sparse-NCNet [54], DRC-Net [55], LoFTR [15], MatchFormer [26], AspanFormer [14], TopicFM [27], RCM [56], SAM [57], EfficientLoFTR [16], DeepMatcher [29], GeoAT [30], Homomatcher [34], and Jamma [31]. Dense methods include DKM [32], ROMA [33], and PMatch [58]. Except for the results on the MegaDepth, Aerial Image Dataset, UAV-VisLoc, and Self-made Dataset, and the results of SP+LG and EfficientLoFTR (opt.) on the HPatches dataset, which were reproduced by the authors under a unified experimental environment, all other comparative results reported for different datasets are directly taken from the corresponding original publications to ensure fairness and reproducibility. Moreover, the runtime measurements reported for the MegaDepth dataset were obtained using the standardized testing protocol described in Section 4.2, and the reproduced results on the HPatches dataset are clearly indicated within the relevant table.

4.4. Relative Pose Estimation

4.4.1. Metric

Feature matching performance is evaluated through relative pose estimation, where correspondences are used to estimate the relative pose matrix via RANSAC. Accuracy is evaluated by measuring the angular errors in rotation and translation between the estimated and ground-truth poses. Following standard practice, we adopt the Area Under the Curve (AUC) at angular error thresholds of

5^{°}

,

10^{°}

, and

20^{°}

, using the greater of the two errors as the evaluation criterion.

4.4.2. Results

As shown in Table 1, LiteSAM exhibits robust generalization and competitive performance on MegaDepth and ScanNet. LiteSAM (opt.) outperforms EfficientLoFTR (opt.) while reducing inference time on MegaDepth, and closely matches LiteSAM, demonstrating high stability. On ScanNet, LiteSAM (opt.) surpasses SP+LG among sparse methods, with a moderate increase in latency. In comparison to dense methods, it significantly reduces inference times relative to DKM and ROMA, with modest performance reductions. This highlights LiteSAM’s effective balance of accuracy and efficiency for time-sensitive applications.

4.5. Absolute Visual Localization

4.5.1. Metric

To comprehensively evaluate matching and localization performance, multiple quantitative metrics are employed based on the ground truth from each dataset. Specifically, the Aerial Image Dataset [47] is evaluated using the Percentage of Correct Keypoints (PCK), whereas UAV-VisLoc and self-made datasets adopt a hit rate (HR) and a Root Mean Square Error within 30 m (RMSE@30) as the primary evaluation indicators.

(1): Percentage of Correct Keypoints (PCK)

For the Aerial Image Dataset, PCK is applied to quantify the matching fidelity under affine transformations estimated through RANSAC. Each image pair is associated with a predicted transformation matrix

T_{θ}

and the corresponding ground-truth matrix

T_{gt}

, both of which are used to warp 20 predefined keypoints. The proportion of correctly transformed points is computed across various pixel error thresholds

τ \in 0.01, 0.03, 0.05

as follows:

P C K = \frac{\sum_{i = 1}^{n} \sum_{p_{i}} 1 [d (T_{θ} (p_{i}), T_{gt} (p_{i})) < τ \cdot max (h, w)]}{\sum_{i = 1}^{n} | p_{i} |}

(21)

where

d (\cdot, \cdot)

denotes the Euclidean distance, and

1 [\cdot]

is the indicator function.

(2): Localization Error

For the UAV-VisLoc and self-made datasets, localization accuracy is evaluated by estimating the homography matrix H between UAV and satellite images using RANSAC. The UAV image’s center is projected onto the satellite view, and the localization error is quantified as the distance between predicted and ground-truth coordinates using the Haversine formula:

a = {sin}^{2} (\frac{Δ φ}{2}) + cos (φ_{pred}) cos (φ_{gt}) {sin}^{2} (\frac{Δ λ}{2}),

(22)

d = R \cdot 2 atan2 (\sqrt{a}, \sqrt{1 - a}) \times 1000,

(23)

where

R = 6371 km

is Earth’s radius, and d is the distance in meters. The RMSE@30 is calculated as follows:

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} d_{i}^{2}},

(24)

where

d_{i}

is the error for the i-th sample, and N is the number of samples with errors below 30 m.

(3): Hit Rate (HR)

To assess robustness in weak-texture or low-similarity scenarios, the hit rate (HR) is defined as follows:

HR = \frac{N_{hit}}{N_{total}},

(25)

where

N_{hit}

is the number of image pairs with localization errors below 30 m, and

N_{total}

is the total number evaluated. HR reflects the localization system’s reliability and the matching strategy’s robustness in challenging visual environments.

(4): Efficiency Indicators

In addition to accuracy metrics, we report model complexity, computational load, and inference latency to highlight LiteSAM’s practical advantages. To ensure fair comparisons, FLOPs and runtime are averaged over correctly localized samples, accounting for variations in input resolution and image content.

4.5.2. Results

(1): Aerial Image Dataset

As shown in Table 2, LiteSAM achieves competitive matching accuracy among semi-dense approaches while offering clear advantages in model efficiency. Specifically, LiteSAM achieves a PCK@3% of 30.2%, closely matching the best-performing EfficientLoFTR but with significantly fewer parameters and lower FLOPs, resulting in faster inference.

Compared to traditional sparse methods such as SP+LG, LiteSAM improves accuracy by 11.3% under the PCK@3% metric while reducing inference latency by 23.9%. These gains highlight the benefit of its lightweight feature extraction and attention-based propagation. While dense models like ROMA yield the highest accuracy, they incur prohibitive computational costs, making them less suitable for UAV scenarios requiring real-time performance.

(2): UAV-VisLoc and Self-made datasets

As shown in Table 3 and Table 4, LiteSAM consistently outperforms existing methods in both accuracy and efficiency. Unlike methods that degrade in complex scenes, it maintains robustness. On the self-made dataset, it achieves competitive accuracy, with the optimized variant further balancing precision and computational cost. Figure 7 and Figure 8 qualitatively illustrate these advantages. Notably, LiteSAM produces more matched points and more consistent spatial distribution, which is also quantitatively reflected in the tables, demonstrating its superior matching quality.

LiteSAM integrates a reparameterized lightweight backbone, TAIFormer-based feature propagation, and iterative MinGRU refinement to achieve high-precision matching with low latencies and minimal overhead. Pretrained on MegaDepth [21], it generalizes well to cross-view settings. Compared to dense methods, it significantly reduces model size and runtime, even at high resolutions.

While some sparse baselines run faster, LiteSAM offers more reliable correspondences—beneficial for tasks like 3D reconstruction. Overall, it achieves a strong trade-off among accuracy, speed, and efficiency, making it well-suited for UAV localization and real-world deployment.

4.6. Visual Localization

4.6.1. Metric

The primary evaluation metric follows the HLoc protocol [59], which employs the Percentage of Correct Poses (PCP). Pose accuracy is quantified under three thresholds:

(0.25 m, 2^{°})

,

(0.5 m, 5^{°})

, and

(1.0 m, 10^{°})

. By comparing the predicted pose against the ground truth, this metric provides a comprehensive assessment of localization precision and robustness. Moreover, it indirectly reflects the quality of feature correspondences established by LiteSAM.

4.6.2. Results

On the Aachen Day-Night v1.1 benchmark, LiteSAM and LiteSAM (opt.) achieve robust accuracy across varying illumination, including night-time scenes (Table 5). LiteSAM (opt.) offers faster inference with minimal accuracy trade-off. On InLoc, both outperform sparse and semi-dense methods in localization accuracy, leveraging efficient feature matching in complex indoor settings. LiteSAM’s lightweight backbone, TAIFormer propagation, and MinGRU-based refinement strike a balance between accuracy and efficiency, making it suitable for latency-sensitive applications such as UAV navigation and autonomous driving.

4.7. Homography Estimation

4.7.1. Metric

Following the evaluation protocol in [16], the primary metric is Corner Correctness, which measures the average deviation between the warped image corners (via estimated homography) and ground-truth projections. The proportion of image pairs with corner errors below thresholds of 3, 5, and 10 pixels is reported. All images are resized such that the shorter side is 480 pixels, ensuring uniform input resolution. Homography matrices are computed using 1000 matched correspondences through an RANSAC-based solver, with identical preprocessing across all evaluated methods.

4.7.2. Results

On the HPatches dataset, LiteSAM achieves competitive accuracy among semi-dense methods, outperforming others across all corner error thresholds (Table 6). Compared to EfficientLoFTR, it improves homography estimations at 3 px, 5 px, and 10 px with similar efficiency. The optimized variant, LiteSAM (opt.), slightly reduces accuracy but offers faster inference and lower computational cost. While dense methods like DKM and Pmatch achieve higher accuracy, they incur significantly greater resource demands. LiteSAM thus offers a favorable balance of precision and efficiency for real-world applications.

4.8. Ablation Study

To evaluate LiteSAM’s components, we conducted ablation studies on self-made and MegaDepth datasets at

1184 \times 1184

resolution, as shown in Table 7 and Table 8. LiteSAM comprises a reparameterizable MobileOne extractor, TAIFormer for feature propagation, and MinGRU-based subpixel refinement.

Replacing MobileOne with RepVGG slightly improves the hit rate but increases the RMSE, parameters, and latency, confirming MobileOne’s efficiency and generalization for edge devices. Substituting TAIFormer with EfficientLoFTR’s attention mechanism reduces accuracy and raises computational cost, highlighting TAIFormer’s effective global–local interaction. Omitting or modifying MinGRU results in marginal latency gains, underscoring its role in fine-grained matching, and precision is compromised. LiteSAM (opt.), without dual softmax, accelerates inferences by up to 39.9% while maintaining competitive accuracy, demonstrating a robust efficiency–accuracy trade-off.

4.9. Image Resolution

Experiments on the MegaDepth dataset using an NVIDIA RTX 3090Ti confirm LiteSAM’s balanced trade-off between accuracy and efficiency across resolutions. At 640 × 640, it delivers robust pose estimation, with AUC further improved at 1408 × 1408 alongside increased latency. LiteSAM opt achieves a similar AUC, with slightly reduced precision but significantly lower inference time, making it well-suited for latency-sensitive applications through efficient feature matching, as detailed in Table 9.

4.10. Performance on Edge Devices

We evaluated LiteSAM on the self-made dataset using an NVIDIA Jetson AGX Orin at 50 W with an input resolution of

1184 \times 1184

and AMP acceleration. As shown in Table 10, LiteSAM achieves the lowest RMSE@30 among full models while maintaining high hit rates and competitive latency. LiteSAM (opt.) achieves lower latency than other optimized models while maintaining strong accuracy and reliable matching across all difficulty levels. These results highlight LiteSAM’s suitability for real-time UAV localization on resource-constrained edge devices.

5. Conclusions

We propose LiteSAM, a lightweight feature matching framework for absolute visual localization (AVL) in UAV applications, designed to operate reliably on resource-constrained edge devices under spatiotemporal variations. LiteSAM integrates a reparameterizable multi-scale feature extractor to reduce inference latencies, a Token Aggregation–Interaction Transformer (TAIFormer) to efficiently model cross-view geometric deformations, and a learnable subpixel refinement module to enhance correspondence estimation and generalization.

Experiments on satellite–aerial benchmarks and the self-made dataset, as well as edge hardware, demonstrate that LiteSAM achieves competitive localization accuracies with reduced parameters and computational cost, enabling real-time UAV navigation in GNSS-denied environments. The implementation code is publicly available at: https://github.com/boyagesmile/LiteSAM, accessed on 20 September 2025.

Future research will focus on scaling datasets, leveraging self-supervised learning for cross-domain generalization, and integrating IMU data or temporal consistency to enhance long-term robustness.

6. Discussion

While LiteSAM demonstrates competitive performance and significant efficiency gains for satellite–aerial image matching, it exhibits certain limitations that warrant further exploration. Notably, LiteSAM relies on sufficient image overlap and consistent scales between satellite and aerial inputs, which may not always be guaranteed in real-world absolute visual localization (AVL) scenarios, such as those involving significant viewpoint changes or varying resolutions. This dependency could necessitate additional preprocessing steps, such as image retrieval or coarse alignment, to ensure robust matching. In future work, approaches such as the feature fusion wavelet attention-enhanced network [60] could be incorporated to adaptively enhance multi-scale discriminative features, thereby improving robustness and generalization across diverse AVL conditions, particularly under extreme scale variations or limited image overlap.

Author Contributions

Conceptualization, B.W. and S.W.; methodology, B.W. and S.W.; software, B.W.; investigation, B.W. and Y.H.; data curation, B.W., L.X., and Y.H.; writing—original draft preparation, B.W. and S.W.; writing—review and editing, B.W., S.W., and D.Y.; supervision, D.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Harbin Institute of Technology.

Data Availability Statement

The datasets used in this study are publicly available and can be accessed at the following sources: (1) MegaDepth: https://www.cs.cornell.edu/projects/megadepth/ (accessed on 28 September 2025); (2) ScanNet: https://github.com/ScanNet/ScanNet (accessed on 28 September 2025); (3) HPatches: https://github.com/hpatches/hpatches-dataset (accessed on 28 September 2025); (4) Aachen v1.1: https://www.visuallocalization.net/datasets/ (accessed on 28 September 2025); (5) InLoc: https://www.visuallocalization.net/datasets/ (accessed on 28 September 2025); (6) UAV-VisLoc: https://github.com/IntelliSensing/UAV-VisLoc (accessed on 28 September 2025); (7) DeepAerial: https://github.com/jaehyunnn/DeepAerialMatching_pytorch (accessed on 28 September 2025). All publicly available datasets were used in accordance with their respective licenses and citation requirements. The self-made dataset used in this study cannot be made publicly available at this time due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

UAVs	Unmanned Aerial Vehicles;
AVL	Absolute Visual Localization;
TAIFormer	Token Aggregation–Interaction Transformer;
CTM	Convolutional Token Mixer;
ConvFFN	Convolutional Feed-Forward Network.

References

Skondras, A.; Karachaliou, E.; Tavantzis, I.; Tokas, N.; Valari, E.; Skalidi, I.; Bouvet, G.A.; Stylianidis, E. UAV Mapping and 3D Modeling as a Tool for Promotion and Management of the Urban Space. Drones 2022, 6, 115. [Google Scholar] [CrossRef]
Koslowski, R. Drones and border control: An examination of state and non-state actor use of UAVs along borders. In Research Handbook on International Migration and Digital Technology; Edward Elgar Publishing: Cheltenham, UK, 2021; pp. 152–165. [Google Scholar]
Nikhil, N.; Shreyas, S.; Vyshnavi, G.; Yadav, S. Unmanned aerial vehicles (UAV) in disaster management applications. In Proceedings of the 2020 IEEE Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 August 2020; pp. 140–148. [Google Scholar]
Velusamy, P.; Rajendran, S.; Mahendran, R.K.; Naseer, S.; Shafiq, M.; Choi, J.G. Unmanned Aerial Vehicles (UAV) in precision agriculture: Applications and challenges. Energies 2021, 15, 217. [Google Scholar] [CrossRef]
Jiang, S.; Jiang, W.; Wang, L. Unmanned Aerial Vehicle-Based Photogrammetric 3D Mapping: A survey of techniques, applications, and challenges. IEEE Geosci. Remote Sens. Mag. 2021, 10, 135–171. [Google Scholar] [CrossRef]
Mohan, M.; Richardson, G.; Gopan, G.; Aghai, M.M.; Bajaj, S.; Galgamuwa, G.P.; Vastaranta, M.; Arachchige, P.S.P.; Amorós, L.; Corte, A.P.D.; et al. UAV-supported forest regeneration: Current trends, challenges and implications. Remote Sens. 2021, 13, 2596. [Google Scholar] [CrossRef]
Wang, H.; Cheng, H.; Hao, H. The use of unmanned aerial vehicle in military operations. In Proceedings of the International Conference on Man-Machine-Environment System Engineering, Zhengzhou, China, 24–26 October 2020; pp. 939–945. [Google Scholar]
Lekidis, A.; Anastasiadis, A.G.; Vokas, G.A. Electricity infrastructure inspection using AI and edge platform-based UAVs. Energy Rep. 2022, 8, 1394–1411. [Google Scholar] [CrossRef]
Rachmawati, T.S.N.; Kim, S. Unmanned aerial vehicles (UAV) integration with digital technologies toward construction 4.0: A systematic literature review. Sustainability 2022, 14, 5708. [Google Scholar] [CrossRef]
Couturier, A.; Akhloufi, M.A. A review on absolute visual localization for UAV. Robot. Auton. Syst. 2021, 135, 103666. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Rublee, E.; Rabaud, V.; Konolige, K.; Bradski, G. ORB: An efficient alternative to SIFT or SURF. In Proceedings of the 2011 IEEE International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2564–2571. [Google Scholar]
Bay, H.; Tuytelaars, T.; Van Gool, L. Surf: Speeded up robust features. In Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; Proceedings, Part I 9. Springer: Berlin/Heidelberg, Germany, 2006; pp. 404–417. [Google Scholar]
Chen, H.; Luo, Z.; Zhou, L.; Tian, Y.; Zhen, M.; Fang, T.; Mckinnon, D.; Tsin, Y.; Quan, L. Aspanformer: Detector-free image matching with adaptive span transformer. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 20–36. [Google Scholar]
Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8922–8931. [Google Scholar]
Wang, Y.; He, X.; Peng, S.; Tan, D.; Zhou, X. Efficient LoFTR: Semi-dense local feature matching with sparse-like speed. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 21666–21675. [Google Scholar]
Sarlin, P.E.; DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4938–4947. [Google Scholar]
Jiang, H.; Karpur, A.; Cao, B.; Huang, Q.; Araujo, A. OmniGlue: Generalizable Feature Matching with Foundation Model Guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19865–19875. [Google Scholar]
Xu, W.; Yao, Y.; Cao, J.; Wei, Z.; Liu, C.; Wang, J.; Peng, M. UAV-VisLoc: A Large-scale Dataset for UAV Visual Localization. arXiv 2024, arXiv:2405.11936. [Google Scholar]
Balntas, V.; Lenc, K.; Vedaldi, A.; Mikolajczyk, K. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5173–5182. [Google Scholar]
Li, Z.; Snavely, N. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2041–2050. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5828–5839. [Google Scholar]
DeTone, D.; Malisiewicz, T.; Rabinovich, A. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 224–236. [Google Scholar]
Lindenberger, P.; Sarlin, P.E.; Pollefeys, M. Lightglue: Local feature matching at light speed. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 17627–17638. [Google Scholar]
Tian, Y.; Balntas, V.; Ng, T.; Barroso-Laguna, A.; Demiris, Y.; Mikolajczyk, K. D2D: Keypoint extraction with describe to detect approach. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
Wang, Q.; Zhang, J.; Yang, K.; Peng, K.; Stiefelhagen, R. Matchformer: Interleaving attention in transformers for feature matching. In Proceedings of the Asian Conference on Computer Vision, Macau, China, 4–8 December 2022; pp. 2746–2762. [Google Scholar]
Giang, K.T.; Song, S.; Jo, S. TopicFM: Robust and interpretable topic-assisted feature matching. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 2447–2455. [Google Scholar]
Giang, K.T.; Song, S.; Jo, S. Topicfm+: Boosting accuracy and efficiency of topic-assisted feature matching. IEEE Trans. Image Process. 2024, 33, 6016–6028. [Google Scholar] [CrossRef]
Xie, T.; Dai, K.; Wang, K.; Li, R.; Zhao, L. Deepmatcher: A deep transformer-based network for robust and accurate local feature matching. Expert Syst. Appl. 2024, 237, 121361. [Google Scholar] [CrossRef]
Li, Y.; Wu, Y.; Ming, Y.; Zhang, Y.; Cheng, Z. GeoAT: Geometry-Aware Attention Feature Matching Network. IEEE Access 2025, 13, 127351–127367. [Google Scholar] [CrossRef]
Lu, X.; Du, S. Jamma: Ultra-lightweight local feature matching with joint mamba. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 10–17 June 2025; pp. 14934–14943. [Google Scholar]
Edstedt, J.; Athanasiadis, I.; Wadenbäck, M.; Felsberg, M. DKM: Dense kernelized feature matching for geometry estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 17765–17775. [Google Scholar]
Edstedt, J.; Sun, Q.; Bökman, G.; Wadenbäck, M.; Felsberg, M. RoMa: Robust dense feature matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 19790–19800. [Google Scholar]
Wang, X.; Yu, L.; Zhang, Y.; Lao, J.; Ru, L.; Zhong, L.; Chen, J.; Zhang, Y.; Yang, M. Homomatcher: Achieving dense feature matching with semi-dense efficiency by homography estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 7952–7960. [Google Scholar]
Van Dalen, G.J.; Magree, D.P.; Johnson, E.N. Absolute localization using image alignment and particle filtering. In Proceedings of the AIAA Guidance, Navigation, and Control Conference, San Diego, CA, USA, 4–8 January 2016; p. 0647. [Google Scholar]
Yol, A.; Delabarre, B.; Dame, A.; Dartois, J.E.; Marchand, E. Vision-based absolute localization for unmanned aerial vehicles. In Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 14–18 September 2014; pp. 3429–3434. [Google Scholar]
Wan, X.; Liu, J.; Yan, H.; Morgan, G.L. Illumination-invariant image matching for autonomous UAV localisation based on optical sensing. ISPRS J. Photogramm. Remote Sens. 2016, 119, 198–213. [Google Scholar] [CrossRef]
Deuser, F.; Habel, K.; Oswald, N. Sample4geo: Hard negative sampling for cross-view geo-localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 4–6 October 2023; pp. 16847–16856. [Google Scholar]
Deuser, F.; Habel, K.; Werner, M.; Oswald, N. Orientation-Guided Contrastive Learning for UAV-View Geo-Localisation. In Proceedings of the 2023 Workshop on UAVs in Multimedia: Capturing the World from a New Perspective, Ottawa, ON, Canada, 2 November 2023; pp. 7–11. [Google Scholar]
Gurgu, M.M.; Queralta, J.P.; Westerlund, T. Vision-based gnss-free localization for uavs in the wild. In Proceedings of the 2022 IEEE 7th International Conference on Mechanical Engineering and Robotics Research (ICMERR), Krakow, Poland, 9–11 December 2022; pp. 7–12. [Google Scholar]
Zhu, B.; Ye, Y.; Dai, J.; Peng, T.; Deng, J.; Zhu, Q. VDFT: Robust feature matching of aerial and ground images using viewpoint-invariant deformable feature transformation. ISPRS J. Photogramm. Remote Sens. 2024, 218, 311–325. [Google Scholar] [CrossRef]
Zhou, Y.; Gao, J.; Liu, X. A unified feature-motion consistency framework for robust image matching. ISPRS J. Photogramm. Remote Sens. 2024, 218, 368–388. [Google Scholar] [CrossRef]
Li, W.; Weng, D.; Gao, C.; Du, Q. SwinMatcher: Universal Cross-Modal Remote Sensing Image Matching with Interactive Swin Transformer. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4705916. [Google Scholar] [CrossRef]
Xu, C.; Wang, B.; Ye, Z.; Mei, L. ETQ-Matcher: Efficient Quadtree-Attention-Guided Transformer for Detector-Free Aerial–Ground Image Matching. Remote Sens. 2025, 17, 1300. [Google Scholar] [CrossRef]
Su, J.; Ahmed, M.; Lu, Y.; Pan, S.; Bo, W.; Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing 2024, 568, 127063. [Google Scholar] [CrossRef]
Feng, L.; Tung, F.; Ahmed, M.O.; Bengio, Y.; Hajimirsadeghi, H. Were rnns all we needed? arXiv 2024, arXiv:2410.01201. [Google Scholar] [CrossRef]
Park, J.H.; Nam, W.J.; Lee, S.W. A two-stream symmetric network with bidirectional ensemble for aerial image matching. Remote Sens. 2020, 12, 465. [Google Scholar] [CrossRef]
Sattler, T.; Maddern, W.; Toft, C.; Torii, A.; Hammarstrand, L.; Stenborg, E.; Safari, D.; Okutomi, M.; Pollefeys, M.; Sivic, J.; et al. Benchmarking 6dof outdoor visual localization in changing conditions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8601–8610. [Google Scholar]
Taira, H.; Okutomi, M.; Sattler, T.; Cimpoi, M.; Pollefeys, M.; Sivic, J.; Pajdla, T.; Torii, A. InLoc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7199–7209. [Google Scholar]
Vasu, P.K.A.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7907–7917. [Google Scholar]
Dusmanu, M.; Rocco, I.; Pajdla, T.; Pollefeys, M.; Sivic, J.; Torii, A.; Sattler, T. D2-net: A trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8092–8101. [Google Scholar]
Revaud, J.; De Souza, C.; Humenberger, M.; Weinzaepfel, P. R2d2: Reliable and repeatable detector and descriptor. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Tyszkiewicz, M.; Fua, P.; Trulls, E. DISK: Learning local features with policy gradient. Adv. Neural Inf. Process. Syst. 2020, 33, 14254–14265. [Google Scholar]
Rocco, I.; Arandjelović, R.; Sivic, J. Efficient neighbourhood consensus networks via submanifold sparse convolutions. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 605–621. [Google Scholar]
Li, X.; Han, K.; Li, S.; Prisacariu, V. Dual-resolution correspondence networks. Adv. Neural Inf. Process. Syst. 2020, 33, 17346–17357. [Google Scholar]
Lu, X.; Du, S. Raising the ceiling: Conflict-free local feature matching with dynamic view switching. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 Septemer–4 October 2024; pp. 256–273. [Google Scholar]
Kelenyi, B.; Domsa, V.; Tamas, L. SAM-Net: Self-attention based feature matching with spatial transformers and knowledge distillation. Expert Syst. Appl. 2024, 242, 122804. [Google Scholar] [CrossRef]
Zhu, S.; Liu, X. Pmatch: Paired masked image modeling for dense geometric matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 21909–21918. [Google Scholar]
Sarlin, P.E.; Cadena, C.; Siegwart, R.; Dymczyk, M. From Coarse to Fine: Robust Hierarchical Localization at Large Scale. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wei, R.; Wei, X.; Xia, S.; Chang, K.; Ling, M.; Nong, J.; Xu, L. Multi-scale wavelet feature fusion network for low-light image enhancement. Comput. Graph. 2025, 127, 104182. [Google Scholar] [CrossRef]

Figure 1. Comparing comprehensive performance of LiteSAM against other methods. The LiteSAM system exhibits a favorable accuracy–efficiency trade-off across both satellite–aerial datasets (self-made and Uav-Visloc [19]) and natural image datasets (Hpatches [20], MegaDepth [21], and ScanNet [22]) while maintaining significant advantages in computational efficiency and parameter size. SP+LG [23,24] achieves the fastest inference speed but at the expense of accuracy, whereas LoFTR [15] and EffiicentLoFTR [16] attain higher accuracy with substantially increased computational costs.

Figure 2. The framework of proposed LiteSAM: (1) image pair

(I^{A}

,

I^{B})

is provided, and coarse features

(F_{e}^{A}, F_{e}^{B})

are extracted by the shared reparameterized network. (2) Feature interaction is then performed to enhance the discriminability of these features, yielding

(F_{c}^{A}, F_{c}^{B})

. (3) Coarse matching relationships

P_{c}

are established to obtain initial matching

M_{c}

. (4) Refinement process is applied to obtain pixel-level matching

M_{f}

. (5) Finally, further refinement is conducted to achieve subpixel-level matching

M_{s}

.

Figure 2. The framework of proposed LiteSAM: (1) image pair

(I^{A}

,

I^{B})

is provided, and coarse features

(F_{e}^{A}, F_{e}^{B})

are extracted by the shared reparameterized network. (2) Feature interaction is then performed to enhance the discriminability of these features, yielding

(F_{c}^{A}, F_{c}^{B})

. (3) Coarse matching relationships

P_{c}

are established to obtain initial matching

M_{c}

. (4) Refinement process is applied to obtain pixel-level matching

M_{f}

. (5) Finally, further refinement is conducted to achieve subpixel-level matching

M_{s}

.

Figure 3. Token Aggregation–Interaction Transformer (TAIFormer). Input features

f_{i}

and

f_{j}

are aggregated via convolution or max pooling to simplify computation. After this, local and global features

(F_{l o v a l}, F_{g l o b a l})

are extracted and fused by the convolutional token mixer (CTM). Finally, the convolutional feed-forward network (ConvFFN) refines spatial structure and local context.

Figure 3. Token Aggregation–Interaction Transformer (TAIFormer). Input features

f_{i}

and

f_{j}

are aggregated via convolution or max pooling to simplify computation. After this, local and global features

(F_{l o v a l}, F_{g l o b a l})

are extracted and fused by the convolutional token mixer (CTM). Finally, the convolutional feed-forward network (ConvFFN) refines spatial structure and local context.

Figure 4. Feature visualization on the MegaDepth dataset. Feature maps are aggregated via L2 normalization across the channel dimension and rendered as heatmaps, where warmer hues (e.g., red) indicate higher activation magnitudes, reflecting greater model attention to corresponding regions. In the matching results, green lines denote inliers validated by an epipolar error threshold, while red lines indicate outliers.

Figure 5. Learnable subpixel refinement module. A

3 \times 3

search window is centered at the matched point

k_{j}^{B}

on the pixel-level feature map

F_{s}^{B}

of the target image

I^{B}

. A multi-layer MinGRU is utilized to iteratively refine the initial correspondence

C_{s}

, achieving subpixel-level accuracy.

Figure 5. Learnable subpixel refinement module. A

3 \times 3

search window is centered at the matched point

k_{j}^{B}

on the pixel-level feature map

F_{s}^{B}

of the target image

I^{B}

. A multi-layer MinGRU is utilized to iteratively refine the initial correspondence

C_{s}

, achieving subpixel-level accuracy.

Figure 6. Dataset construction pipeline. UAV images are geolocated using high-precision RTK positioning, with coordinates serving as reference centers. Synthetic perturbations (easy: 0–150 m; moderate: 150–300 m; hard: 300–600 m) are applied to simulate varying localization difficulties. Images are cropped at multiple scales with a consistent 2000-pixel short edge to ensure scale uniformity.

Figure 7. Comparison of feature matching results on the UAV-VisLoc dataset. This figure presents the feature matching results of SuperPoint+LightGlue [23,24], EfficientLoFTR [16], and LiteSAM on the UAV-VisLoc dataset [19]. From left to right, the results correspond to SuperPoint+LightGlue, EfficientLoFTR, and LiteSAM, respectively, across varying difficulty levels. The figure is organized into three groups from top to bottom: the top two rows represent Easy Mode, the middle two rows represent Moderate Mode, and the bottom two rows represent Hard Mode.

Figure 8. Comparison of feature matching results on the self-made dataset. This figure presents the feature matching results of SuperPoint+LightGlue [23,24], EfficientLoFTR [16], and LiteSAM on the self-made dataset. From left to right, the results correspond to SuperPoint+LightGlue, EfficientLoFTR, and LiteSAM, respectively, across varying difficulty levels. The figure is organized into three groups from top to bottom: the top two rows represent Easy Mode, the middle two rows represent Moderate Mode, and the bottom two rows represent Hard Mode.

Table 1. Relative pose estimation results on MegaDepth and ScanNet. All models were trained solely on the MegaDepth dataset and evaluated across all benchmarks to assess generalization performance. Reported metrics include pose estimation errors at multiple thresholds and inference speeds on the MegaDepth dataset at an input resolution of

1184 \times 1184

pixels. In the semi-dense category, the best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Table 1. Relative pose estimation results on MegaDepth and ScanNet. All models were trained solely on the MegaDepth dataset and evaluated across all benchmarks to assess generalization performance. Reported metrics include pose estimation errors at multiple thresholds and inference speeds on the MegaDepth dataset at an input resolution of

1184 \times 1184

pixels. In the semi-dense category, the best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Category	Method	MegaDepth			ScanNet			Times (ms)
Category	Method	AUC@5 $° ↑$	AUC@10 $° ↑$	AUC@20 $° ↑$	AUC@5 $° ↑$	AUC@10 $° ↑$	AUC@20 $° ↑$	Times (ms)
Sparse	SP+SG	42.2	61.2	76.0	10.4	22.9	37.2	99.4
	SP+LG	47.6	64.8	77.9	15.1	32.6	61.02	47.1
	SP+OG	47.4	65.0	77.8	14.0	28.9	44.3	414.9
Semi-Dense	LoFTR	52.8	69.2	81.2	16.9	33.6	50.6	314.82
	MatchFormer	53.3	69.7	81.8	15.8	32.0	48.0	691.0
	ASpanFormer	55.3	71.5	83.1	19.6	37.7	53.3	352.07
	TopicFM	54.1	70.1	81.6	17.3	35.5	50.9	270.69
	TopicFM+	58.2	72.8	83.2	20.4	38.5	54.5	220.37
	RCM	53.2	69.4	81.5	17.3	34.6	52.1	-
	DeepMatcher	55.7	72.2	83.4	-	-	-	-
	EfficientLoFTR	56.4	72.2	83.5	19.2	37.0	53.6	154.2
	EfficientLoFTR (opt.)	55.4	71.4	82.9	17.4	34.4	51.2	98.8
	GeoAT	53.6	70.3	82.2	-	-	-	-
	ASpan_Homo	57.1	73.0	84.1	22.0	40.5	57.2	>352
	LiteSAM	56.1	72.0	83.4	21.0	39.3	55.9	133.0
	LiteSAM (opt.)	56.3	72.1	83.4	19.5	37.8	54.7	79.9
Dense	DKM	60.4	74.9	85.1	26.64	47.07	64.17	528.21
Dense	ROMA	62.6	76.7	86.3	28.9	50.4	68.3	3336.22

Table 2. Absolute visual localization results on the Aerial Image Dataset. All models were tested on

512 \times 512

images. The Percentage of Correct Keypoints (PCK), inference speed, FLOPs, and model parameters are shown. In the Semi-Dense category, the best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Table 2. Absolute visual localization results on the Aerial Image Dataset. All models were tested on

512 \times 512

images. The Percentage of Correct Keypoints (PCK), inference speed, FLOPs, and model parameters are shown. In the Semi-Dense category, the best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Category	Method	DeepAerial			Time (ms)	FLOPs (G)	Params (M)
Category	Method	PCK@1% ↑	PCK@3% ↑	PCK@5% ↑	Time (ms)	FLOPs (G)	Params (M)
Sparse	SP+NN	11.5	19.0	23.2	9.53	44.46	1.3
	SP+SG	8.4	16.5	19.8	66.85	108.13	13.32
	SP+LG	21.4	27.0	29.3	43.74	91.62	11.38
Semi-Dense	LoFTR	18.5	24.8	28.1	52.74	308.73	11.56
	AspanFormer	21.5	26.1	28.5	135.52	551.76	15.75
	EfficientLoFTR	23.3	30.3	33.2	36.55	189.40	15.05
	EfficientLoFTR (opt.)	21.4	29.9	32.8	32.08	189.28	15.05
	LiteSAM	23.7	30.2	32.7	33.29	92.39	6.31
	LiteSAM (opt.)	22.2	29.2	31.9	29.64	92.27	6.31
Dense	DKM	30.6	36.1	38.9	291.62	1543.86	70.21
Dense	ROMA	38.0	50.8	56.4	243.88	1864.42	111.29

Table 3. Absolute visual localization results on UAV-VisLoc dataset. Metrics include RMSE under a 30 m threshold, average hit rate, number of matched points, and per-frame inference time across Easy, Moderate, and Hard settings. In the Semi-Dense category, the best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Category	Method	UAV-VisLoc Dataset (Easy/Moderate/Hard)			Time (ms) ↓
Category	Method	RMSE@30 ↓	Avg.HR (%) ↑	Num of Point ↑	Time (ms) ↓
Sparse	SP+NN	18.71	37.15/29.54/19.89	819/820/804	19.82
	SP+SG	17.88	58.38/56.64/51.03	454/378/251	138.27
	SP+LG	17.81	60.34/59.57/54.32	473/402/276	44.15
Semi-Dense	LoFTR	17.99	59.96/59.96/50.59	548/422/254	135.09
	AspanFormer	17.87	62.92/59.10/49.96	1368/887/556	159.16
	EfficientLoFTR	17.87	65.78/63.62/57.65	1849/1483/971	112.60
	EfficientLoFTR (opt.)	17.94	63.85/57.78/42.30	3873/3311/2639	77.53
	LiteSAM	17.86	66.66/65.37/61.65	2096/1662/1065	83.79
	LiteSAM (opt.)	17.98	65.09/61.34/46.16	4227/3587/2826	60.97
Dense	DKM	17.87	59.11/57.29/52.52	10,000/10,000/10,000	498.88
Dense	ROMA	17.81	64.13/60.67/52.70	10,000/10,000/10,000	688.32

Table 4. Absolute visual localization results on self-made dataset. Results include RMSE within 30 m, average hit rate, number of correspondences, inference time, and FLOPs across varying difficulty levels. In the Semi-Dense category, the best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Category	Method	Self-Made Dataset (Easy/Moderate/Hard)			Time (ms)	FLOPs (G)
Category	Method	RMSE@30 ↓	Avg.HR (%) ↑	Num of Point	Time (ms)	FLOPs (G)
Sparse	SP+NN	13.26	30.27/21.37/14.08	1123/1103/1071	20.64	148.7
	SP+SG	5.70	71.28/64.96/54.78	492/371/233	142.28	967.9
	SP+LG	6.76	78.85/70.03/58.31	622/482/322	49.49	942.16
Semi-Dense	LoFTR	8.74	63.03/51.71/41.09	288/147/87	145.61	1570.92
	AspanFormer	7.93	73.52/59.85/42.03	863/515/272	163.45	1738.81
	EfficientLoFTR	7.28	90.03/79.79/61.84	1540/1114/728	120.72	1036.61
	EfficientLoFTR (opt.)	8.43	88.95/75.84/54.17	4526/3953/3382	80.34	1033.54
	LiteSAM	6.12	92.09/87.88/77.30	1905/1406/888	85.31	588.51
	LiteSAM (opt.)	6.89	92.34/86.93/72.10	4356/3743/3128	61.98	586.06
Dense	DKM	5.86	77.09/75.15/71.06	10,000/10,000/10,000	512.71	3022.67
Dense	ROMA	5.60	92.00/86.11/76.10	10,000/10,000/10,000	691.48	6750.08

Table 5. Visual localization results on the Aachen Day-Night v1.1 and InLoc datasets. The visual localization performance is evaluated on the Aachen Day-Night v1.1 [48] and InLoc [49] datasets using the HLoc framework [59]. In the Semi-Dense category, the best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Category	Method	AachenV1.1		Inloc
		Day	Night	DUC1	DUC2
		(0.25 m, 2°)/(0.5 m, 5°)/(1.0 m, 10°)		(0.25 m, 2°)/(0.5 m, 5°)/(1.0 m, 10°)
Sparse	SP+SG	89.8/96.1/99.4	77.0/90.6/100.0	49.0/68.7/80.8	53.4/77.1/82.4
Sparse	SP+LG	89.2/95.4/98.5	87.8/93.9/100.0	49.0/68.2/79.3	55.0/74.8/79.4
Semi-Dense	LoFTR	88.7/95.6/99.0	78.5/91.1/99.0	47.5/72.2/85.5	54.2/74.8/85.5
	MatchFormer	-	-	46.5/73.2/85.9	55.7/71.8/85.5
	ASpanFormer	89.4/95.6/99.0	77.5/91.6/99.5	51.5/73.7/86.4	55.0/74.0/85.5
	TopicFM	90.2/95.9/98.9	77.5/91.1/99.5	52.0/74.7/87.4	53.4/74.8/83.2
	RCM	89.7/96.0/98.7	72.8/91.6/99.0	-	-
	SAM	89.7/95.8/99.0	78.6/91.8/100.0	51.8/73.9/87.8	56.0/75.8/83.1
	EfficientLoFTR	89.6/96.2/99.0	77.0/91.1/99.5	52.0/74.7/86.9	58.0/80.9/89.3
	LiteSAM	88.6/96.0/98.8	76.4/91.1/99.5	52.0/73.7/86.4	60.3/85.5/89.3
	LiteSAM (opt.)	88.6/95.5/98.8	76.4/90.6/99.5	54.0/74.2/86.9	55.7/80.9/85.5
Dense	DKM	-	-	51.5/75.3/86.9	63.4/82.4/87.8
Dense	ROMA	-	-	60.6/79.3/89.9	66.4/83.2/87.8

Table 6. Homography estimation results on the HPatches dataset. Performance comparison of feature matching methods based on the percentage of image pairs with average corner error below predefined pixel thresholds. All images are resized to 480 pixels (short side), and homographies are estimated from 1000 correspondences using an RANSAC-based solver. In the Semi-Dense category, the best result is in red, the second-best is in orange, and the third-best is in blue. Methods marked with ^† are tested under the same experimental conditions as LiteSAM for a fair comparison. The direction indicated by the arrows denotes superior performance.

Category	Method	Homography est. AUC
Category	Method	@3 px↑	@5 px↑	@10 px↑
Sparse	D2Net+NN	23.2	35.9	53.6
	R2D2+NN	50.6	63.9	76.8
	DISK+NN	52.3	64.9	78.9
	SP+SG	53.9	68.3	81.7
	SP+LG ^†	60.8	72.3	84.0
Semi-Dense	Sparse-NCNet	48.9	54.2	67.1
	DRC-Net	50.6	56.2	68.3
	LoFTR	65.9	75.6	84.6
	EfficientLoFTR	66.5	76.4	85.5
	EfficientLoFTR (opt.) ^†	65.1	75.2	84.8
	AspanFormer	67.4	76.9	85.6
	TopicM	67.3	77.0	85.7
	GeoAT	69.1	78.2	87.1
	ASpan_Homo	70.2	79.6	87.8
	JamMa	68.1	77.0	85.4
	LiteSAM	67.3	77.2	86.1
	LiteSAM (opt.)	65.8	76.2	85.6
Dense	DKM	71.3	80.6	88.5
Dense	Pmatch	71.9	80.7	88.5

Table 7. Ablation study on self-made dataset. AVL performance on the self-made dataset, reported in terms of RMSE@30m, average hit rate, number of correspondences, inference time, and computational cost (FLOPs) across different architectural variants of LiteSAM. The best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Method	Self-Made Dataset(Easy/Moderate/Hard)			FLOPs (G)	Time (ms)
Method	RMSE@30 ↓	Avg.HR (%) ↑	Num of Point	FLOPs (G)	Time (ms)
Full	6.12	92.09/87.88/77.30	1905/1406/888	588.51	85.31
(1) Replace MobileOne with RepVGG	6.46	92.95/89.81/79.53	1520/1132/756	1066.24	100.77
(2) Replace TAIFormer to Agg. Attention	6.94	89.68/81.38/63.99	1477/1091/727	697.11	99.72
(3) Replace MinGRU to heatmap refine	6.39	91.75/86.67/73.45	1759/1317/859	588.49	86.49
(4) Remove MinGRU refinement	6.47	90.14/83.71/70.49	1864/1385/867	588.49	83.94
(5) Our Optimal (w/o dual softmax)	6.89	92.34/86.93/72.10	4356/3743/3128	586.06	61.98

Table 8. Ablation study on MegaDepth dataset. Relative pose estimation performance of LiteSAM variants, trained and tested on MegaDepth at

1184 \times 1184

resolution, reporting pose estimation AUC and inference times. The best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Table 8. Ablation study on MegaDepth dataset. Relative pose estimation performance of LiteSAM variants, trained and tested on MegaDepth at

1184 \times 1184

resolution, reporting pose estimation AUC and inference times. The best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Method	Pose Estimation AUC			Times (ms)
Method	AUC@5° ↑	AUC@10° ↑	AUC@20° ↑	Times (ms)
Full	56.1	72.0	83.4	133.0
(1) Replace MobileOne with RepVGG	55.7	71.8	83.3	147.8
(2) Replace MobileOne with ResNet	55.4	71.7	83.2	162.4
(3) Replace TAIFormer to Agg. Attention	55.0	71.1	82.7	141.4
(4) Replace MinGRU to heatmap refine	55.6	71.9	83.4	133.4
(5) Remove MinGRU refinement	55.4	71.6	83.2	131.4
(6) Our Optimal (w/o dual softmax)	56.3	72.1	83.4	79.9

Table 9. Impact of jmage resolution on localization performance on the MegaDepth dataset. AUC of pose estimation and inference time for LiteSAM (original) with dual softmax and LiteSAM (opt.) without dual softmax. The best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Resolution	Dual Softmax	Pose Estimation AUC			Prec ↑	Num of Points	Times (ms)
Resolution	Dual Softmax	AUC@5° ↑	AUC@10° ↑	AUC@20° ↑	Prec ↑	Num of Points	Times (ms)
640 × 640	✓	50.7	67.5	79.9	97.8	1068.6	38.9
640 × 640	-	50.7	67.5	79.9	96.6	1085.6	30.7
800 × 800	✓	53.2	69.9	81.9	98.0	1639.2	55.7
800 × 800	-	53.5	69.8	81.7	96.7	1682.3	38.2
960 × 960	✓	55.0	71.1	82.9	98.2	2320.3	77.5
960 × 960	-	55.3	71.3	82.9	96.7	2403.7	50.1
1184 × 1184	✓	56.1	72.0	83.4	98.0	3449.3	133.0
1184 × 1184	-	56.3	72.1	83.4	96.1	3619.0	79.9
1408 × 1408	✓	56.3	72.4	83.9	98.0	4765.3	227.5
1408 × 1408	-	56.8	72.6	83.9	95.5	5065.1	132.5

Table 10. Localization performance on the self-made dataset using NVIDIA Jetson AGX Orin. Results include RMSE within 30 m, average hit rate, number of correspondences, and inference time across difficulty levels. The best result is in red, the second-best is in orange, and the third-best is in blue. The direction indicated by the arrows denotes superior performance.

Method	Self-Made Dataset (Easy/Moderate/Hard)			Time (ms)
Method	RMSE@30 ↓	Avg.HR (%) ↑	Num of Point	Time (ms)
EfficientLoFTR	7.53	90.76/78.63/59.24	2443/2011/1620	758.84
EfficientLoFTR (opt.)	8.31	88.87/75.67/53.44	4572/4011/3492	620.67
LiteSAM	6.17	92.69/89.08/77.21	2691/2113/1578	600.37
LiteSAM (opt.)	6.72	92.48/86.20/72.23	4388/3821/3220	497.49

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, B.; Wang, S.; Han, Y.; Xu, L.; Ye, D. LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery. Remote Sens. 2025, 17, 3349. https://doi.org/10.3390/rs17193349

AMA Style

Wang B, Wang S, Han Y, Xu L, Ye D. LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery. Remote Sensing. 2025; 17(19):3349. https://doi.org/10.3390/rs17193349

Chicago/Turabian Style

Wang, Boya, Shuo Wang, Yibin Han, Linfeng Xu, and Dong Ye. 2025. "LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery" Remote Sensing 17, no. 19: 3349. https://doi.org/10.3390/rs17193349

APA Style

Wang, B., Wang, S., Han, Y., Xu, L., & Ye, D. (2025). LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery. Remote Sensing, 17(19), 3349. https://doi.org/10.3390/rs17193349

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LiteSAM: Lightweight and Robust Feature Matching for Satellite and Aerial Imagery

Abstract

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Detector-Based Feature Matching

2.2. Detector-Free Feature Matching

2.3. Feature-Matching in AVL

3. Methodology

3.1. Reparameterization-Based Feature Extraction

3.2. Feature Information Propagation

3.2.1. Convolutional Token Mixer (CTM)

3.2.2. Convolutional Feed-Forward Network

3.3. Coarse-Level Matching Module

3.4. Learnable Correspondence Refinement Module

3.4.1. Pixel-Level Refinement

3.4.2. Subpixel-Level Refinement

3.5. Loss Function

3.5.1. Coarse-Level Matching Loss Function

3.5.2. Pixel-Level Matching Loss Function

3.5.3. Subpixel-Level Matching Loss Function

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Baseline

4.4. Relative Pose Estimation

4.4.1. Metric

4.4.2. Results

4.5. Absolute Visual Localization

4.5.1. Metric

4.5.2. Results

4.6. Visual Localization

4.6.1. Metric

4.6.2. Results

4.7. Homography Estimation

4.7.1. Metric

4.7.2. Results

4.8. Ablation Study

4.9. Image Resolution

4.10. Performance on Edge Devices

5. Conclusions

6. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI