A Cross-Mamba Interaction Network for UAV-to-Satallite Geolocalization

Tian, Lingyun; Shen, Qiang; Gao, Yang; Wang, Simiao; Liu, Yunan; Deng, Zilong

doi:10.3390/drones9060427

Open AccessArticle

A Cross-Mamba Interaction Network for UAV-to-Satallite Geolocalization

by

Lingyun Tian

¹

,

Qiang Shen

¹,

Yang Gao

¹

,

Simiao Wang

^2,3

,

Yunan Liu

^3,4

and

Zilong Deng

^1,*

¹

College of Mechatronical Engineering, Beijing Institute of Technology, Beijing 100081, China

²

School of Artificial Intelligence, Dalian Maritime University, Dalian 116024, China

³

Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China

⁴

Computational Intelligence Center (CIC), School of Computer Science and Technology, Shandong Jianzhu University, Jinan 250102, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(6), 427; https://doi.org/10.3390/drones9060427

Submission received: 28 March 2025 / Revised: 31 May 2025 / Accepted: 11 June 2025 / Published: 12 June 2025

Download

Browse Figures

Versions Notes

Abstract

The geolocalization of unmanned aerial vehicles (UAVs) in satellite-denied environments has emerged as a key research focus. Recent advancements in this area have been largely driven by learning-based frameworks that utilize convolutional neural networks (CNNs) and Transformers. However, both CNNs and Transformers face challenges in capturing global feature dependencies due to their restricted receptive fields. Inspired by state-space models (SSMs), which have demonstrated efficacy in modeling long sequences, we propose a pure Mamba-based method called the Cross-Mamba Interaction Network (CMIN) for UAV geolocalization. CMIN consists of three key components: feature extraction, information interaction, and feature fusion. It leverages Mamba’s strengths in global information modeling to effectively capture feature correlations between UAV and satellite images over a larger receptive field. For feature extraction, we design a Siamese Feature Extraction Module (SFEM) based on two basic vision Mamba blocks, enabling the model to capture the correlation between UAV and satellite image features. In terms of information interaction, we introduce a Local Cross-Attention Module (LCAM) to fuse cross-Mamba features, providing a solution for feature matching via deep learning. By aggregating features from various layers of SFEMs, we generate heatmaps for the satellite image that help determine the UAV’s geographical coordinates. Additionally, we propose a Center Masking strategy for data augmentation, which promotes the model’s ability to learn richer contextual information from UAV images. Experimental results on benchmark datasets show that our method achieves state-of-the-art performance. Ablation studies further validate the effectiveness of each component of CMIN.

Keywords:

UAV geolocalization; state-space models; Siamese feature extraction module; local cross-attention module

1. Introduction

With the continuous advancement of drone technology, unmanned aerial vehicles (UAVs) have found widespread applications across various fields, including agriculture [1,2], equipment inspection [3], and search and rescue [4,5]. UAVs typically rely on Global Positioning System (GPS) data provided by satellite signals for navigation and positioning. However, in real-world scenarios, GPS signals can be weakened or lost entirely due to factors like terrain or signal obstruction, resulting in satellite-denied environments. In these situations, UAVs face a heightened risk of operational failure or loss of control. As a result, enabling reliable positioning and navigation in challenging environments has become a critical area of research.

In practice, even in GPS-denied environments, it is often feasible to obtain satellite imagery of the UAV’s operating area from publicly available sources. Many regions, especially those involved in urban monitoring, precision agriculture, or emergency response, have existing high-resolution satellite coverage. Alternatively, satellite images can be retrieved in near real-time from online services such as Google Earth, NASA WorldView, Bing Maps, or Tianditu. These satellite views, when combined with UAV-captured images, enable cross-view image-based localization, offering a promising substitute for GPS in challenging scenarios. Many studies leverage visual information to perform UAV geolocation and navigation tasks [6], among which the keypoint matching-based strategy [7,8,9] is a well-established method, where feature points are extracted from images for matching. However, most hand-craft keypoints, such as SIFT [10] and their feature descriptors exhibit limited robustness when image quality varies significantly across different domains. In recent years, deep learning has emerged as a powerful alternative and has been successfully applied to UAV localization tasks. Current cross-view geolocalization technology for UAVs is mainly realized through two approaches: image retrieval [11,12,13,14,15] and the method of finding points with images [16,17,18].

Some methods [11,12,13,14,15] perform UAV localization via image retrieval by dividing satellite images into patches and matching them with UAV images to estimate geographic position. While effective on benchmark datasets, these methods face challenges such as limited pixel-level positioning precision due to block-level segmentation and increased data redundancy from overlapping regions, which raise storage and computational demands. An alternative approach focuses on directly locating points through image matching. Dai et al. [16] first apply this to UAV localization. Wang et al. [17] extend this with a multi-feature fusion network, though its dual-stream design and complex relational modeling substantially increase model complexity. To mitigate this, Chen et al. [18] propose a streamlined one-stream framework integrating feature extraction and relational modeling, inspired by object tracking techniques that generate heatmaps to predict UAV locations. Current UAV localization methods predominantly rely on CNNs [11,13,15] and Transformers [12,14,16,17,18] as feature extractors. CNNs suffer from limited receptive fields and local biases due to weight sharing, reducing adaptability. Transformers, while capturing global context, incur high computational costs from quadratic attention complexity. Nonetheless, capturing global image context remains crucial for accurate heatmap-based UAV localization.

In recent years, Mamba [19,20,21] has garnered significant attention in the field of computer vision. These internal state-space models (SSMs) have shown great promise in global information modeling with linear complexity. In this paper, we explore leveraging the advantages of Mamba to construct a Cross-Mamba Interaction Network (CMIN) for UAV geolocalization. The proposed CMIN features a pyramid structure with three components: feature extraction, information interaction, and feature fusion. For feature extraction, we present a Siamese Feature Extraction Module (SFEM) using two basic Mamba blocks with shared parameters. This module helps identify the correlation of features between UAV and satellite images. For information interaction, we construct a Local Cross-Attention Module (LCAM) to fuse cross-Mamba features. Finally, we combine features from different levels of the pyramid to generate a heatmap for the satellite image, which helps determine the UAV’s geographical coordinates. Furthermore, we propose a Center Masking (CM) augmentation strategy, which encourages the model to learn richer contextual clues from UAV images. The contributions of this paper are summarized as follows:

(1): We propose a simple and effective baseline pipeline, named CMIN, for UAV geolocalization. To the best of our knowledge, this is the first work to utilize a state-space model to capture the global correlation between UAV and satellite from a larger receptive field.
(2): We present an SFEM to extract shared features from both UAV and satellite images and employ an LCAM to fuse the cross-Mamba features. Moreover, a CM method is introduced for data augmentation, enabling the model to capture more detailed contextual clues.
(3): Extensive experiments show that our method achieves state-of-the-art performance on the benchmark dataset. Specifically, our CMIN achieves 77.52% w.r.t. Relative Distance Score (RDS) metric on the UL14 dataset, outperforming a well-established method [17] by 12.19%. Ablation studies further demonstrate the effectiveness of CMIN for UAV localization.

The rest of the paper is organized as follows: Section 2 briefly reviews the related works. Section 3 describes the proposed method in detail. Model analysis and comparisons with the state-of-the-art methods are presented in Section 4. Finally, concluding remarks are made in Section 5.

2. Related Work

2.1. Cross-View Geolocalization

Early methods for image geolocalization primarily rely on image retrieval techniques, focusing on the matching between two ground images [22,23], as well as ground-to-aerial localization [24,25]. These methods localize the query image by finding its most similar match in datasets containing location information. However, significant feature discrepancies often arise between images captured from different viewpoints, making it difficult to learn consistent features across viewpoints in cross-view geolocalization. To tackle this challenge, Shi et al. [26] propose to align aerial images with satellite images using polar coordinate transformation, effectively bridging the differences between them. Zhu et al. [27] introduce a model that leverages a spatial hierarchical structure to learn viewpoint-invariant features in cross-view images. Zhu et al. [28] leverage the strengths of the Transformer in explicit position encoding, avoiding the reliance on polar transform for localization.

With the continuous advancement of UAV technology and satellite remote sensing, several methods have been proposed to address the challenges of UAV-view geolocalization. One line of works follows the image retrieval pipeline. Ding et al. [11] propose a cross-view image matching method based on location classification, considering the similarity between UAV and satellite views. Gong et al. [12] present a cross-view image geolocalization method that leverages multi-scale information and a dual-channel attention mechanism. Tian et al. [13] introduce an end-to-end cross-view matching approach that converts oblique-view UAV images into satellite images through perspective transformation and conditional generative adversarial networks. Dai et al. [14] propose a Transformer-based framework to understand contextual information and the distribution of instances. Wang et al. [15] offer a coarse-to-fine sequence-matching solution, which improves geolocalization accuracy by matching UAV images with a small set of relevant reference image patches rather than using the entire image database. However, cross-view geolocalization based on image retrieval heavily depends on the assumption that the database contains images aligned with the query image, a condition that does not always hold in real-world scenarios.

To address this issue, some works transform UAV geolocalization into the task of identifying specific points within images. Dai et al. [16] develop a two-stream network to extract features from both UAV and satellite images, where the point with the highest response value in the response map indicates the predicted position of the UAV image. Building on this approach, Wang et al. [17] propose a weight-adaptive multi-feature fusion module, which introduces a weighting mechanism to combine different features. Chen et al. [18] present a coarse-to-fine one-stream network that establishes an information transfer bridge between UAV and satellite images during the feature extraction process. However, existing methods, whether based on CNNs [11,13,15] or Transformers [12,14,16,17,18] as their network backbone, struggle to capture global correlations between UAV and satellite images due to their limited receptive fields.

2.2. Mamba in Computer Vision

State-space models (SSMs) [29,30,31,32,33] have recently gained prominence as efficient frameworks for modeling long-range dependencies with linear complexity. Among them, S4 [29] stands out as the first structured SSM explicitly designed for this purpose. Building on S4, S5 [30] proposes a diagonal approximation to the SSM, enabling recurrent computation through a parallel scan. Ma et al. [31] refine S4 by converting it into a real-valued representation, offering a new perspective as an exponential moving average. Expanding on these developments, subsequent methods [32,33] emphasize the convolutional interpretation of S4, crafting global or long convolution kernels with diverse parameterization strategies.

Mamba [34], an enhanced variant of state-space models (SSMs), incorporates learnable parameters into a selective scan mechanism, enabling it to adaptively extract relevant information in a data-dependent manner. With its exceptional computational efficiency and strong ability to model long-range dependencies, Mamba has gained significant attention in computer vision applications. Building on the principles of SSMs, Liu et al. [19] introduce vision Mamba, which employs a 2D selective scan scheme to effectively capture spatial information across cross-directions. This work represents the first attempt to harness Mamba’s strengths for computer vision tasks. Following this breakthrough, various SSMs with diverse structural innovations have emerged, showcasing remarkable performance in specific applications, such as image classification [35,36], segmentation [37,38], and restoration [20,21]. Although Mamba has achieved significant success in various vision tasks, its potential to capture global correlations between UAV and satellite images remains unexplored. In this study, we propose a novel Mamba-based framework for UAV geolocalization. Specifically, we employ one of the simplest Mamba blocks [19] as the fundamental building unit for feature extraction. As more sophisticated enhancements to the Mamba block continue to emerge, integrating any of these improvements into our framework is anticipated to yield further performance gains.

3. Method

In this paper, we propose a Cross-Mamba Interaction Network (CMIN) for UAV geolocalization. In Section 3.1, we provide an overview of the proposed method. In Section 3.2 and Section 3.3, we describe the two key modules of CMIN: the Siamese Feature Extraction Module (SFEM) and the Local Cross-Attention Module (LCAM), respectively. In Section 3.4, we introduce the data augmentation strategies and optimization objectives used to train CMIN.

3.1. Overview

In this paper, we propose a CMIN, with its network structure shown in Figure 1. The backbone of CMIN follows a pyramid structure, consisting of four stages. Before feeding the input image into the first stage, we use a stem module to divide it into patches. In each subsequent stage, SFEM and LCAM are alternately stacked, with SFEM extracting features and LCAM facilitating feature interaction. In the neck, a simple fusion module is employed to progressively integrate features from different stages. Finally, the fused features are passed to the head, where a linear fully connected layer regresses a heatmap.

Let

I_{s}

and

I_{u}

represent the satellite image and UAV image, respectively; we begin by partitioning them into patches using a stem module, preserving their 2D structure and generating feature maps with

C_{1}

channels. These feature maps are then fed into the first stage, which comprises

n_{1}

SFEM and LCAM stacked alternately. In SFEM, two parametershared Mamba blocks are used to independently extract view-specific features from

I_{s}

and

I_{u}

. In LCAM, a local window attention mechanism captures fused cross-Mamba features, which are then connected to the output of the next LCAM through residual connection. Then, the view-independent features from the first stage can be expressed as:

(M_{u}^{1 - n_{1}}, M_{s}^{1 - n_{1}}) = S_{1}^{n_{1}} (\dots S_{1}^{2} (S_{1}^{1} (I_{u}, I_{s})) \dots),

(1)

where

S_{1}^{j} (\cdot)

denotes the operation of j-th SFEM in the first stage, and the cross-Mamba features can be represented as:

F_{1}^{n_{1}} = L_{1}^{1} (M_{u}^{1 - 1}, M_{s}^{1 - 1}) + \dots + L_{1}^{n_{1}} (M_{u}^{1 - n_{1}}, M_{s}^{1 - n_{1}}),

(2)

where

L_{1}^{j} (\cdot)

denotes the operation of j-th LCAM in the first stage, “+” denotes the elementwise addition operation,

M_{u}^{1 - j}

and

M_{s}^{1 - j}

represent the features of

I_{u}

and

I_{s}

from the j-th SEFM, respectively. Subsequently, we feed

M_{u}^{1 - n_{1}}

and

M_{s}^{1 - n_{1}}

into the following stage to extract deeper features. In the neck, we use three fusion modules to further fuse the cross-stage features from four stages. In each fusion module, the low-resolution features are first upsampled by a factor of 2, and then element-wise concatenated with the high-resolution features. The features through the neck can be represented as:

F_{o u t} = H_{3} ([H_{2} ([H_{1} ([F_{4}^{n_{4}} ↑, F_{3}^{n_{3}}]) ↑, F_{2}^{n_{2}}]) ↑, F_{1}^{n_{1}}]),

(3)

where ↑ denotes the upsampling operation,

H_{i} (\cdot)

indicates the i-th

1 \times 1

convolution operation,

[\cdot]

denotes channel-wise concatenation,

F_{z}^{n_{z}}

denotes the cross-stage features obtained from the z-th stage, and

n_{z}

indicates the number of LCAM in the z-th stage. Finally, we pass

F_{o u t}

through a linear layer in the head to regress the heatmap, denoted as y, which has the same resolution as

I_{s}

. The coordinates of

I_{s}

corresponding to the maximum value in y are then used as the center point coordinates for

I_{u}

, enabling UAV geolocalization.

From the above process, it is clear that SEFM and LCAM are the core modules of our method. Additionally, during the model training phase, we introduce a new data augmentation strategy that aids the model in capturing richer contextual cues, leading to more accurate heatmap regression. We will now provide a detailed explanation of these components.

3.2. Siamese Feature Extraction Module

Existing UAV geolocalization methods primarily rely on CNNs [11,13,15] and Transformers [12,14,16,17,18] to build models. While CNN-based methods are limited by their inability to capture long-range dependencies and have a restricted receptive field, Transformer-based methods suffer from unacceptable quadratic computational complexity when used directly. To address these challenges, Mamba blocks have been introduced. They offer linear computational complexity while effectively capturing a larger receptive field. Leveraging this advantage, we use Mamba blocks (i.e., Vanilla VSS Blocks [19]) as the building blocks of our feature extraction network, enabling the extraction of richer contextual clues from the pair of UAV and satellite images.

Due to significant differences in viewpoint, resolution, and lighting between UAV and satellite images, preserving feature similarity across views is critical during feature extraction. Key features to maintain include local structures, textures, and semantic information that correspond to the same geographic regions. To achieve this, we adopt the Siamese Feature Extraction Module (SFEM), shown in Figure 2a, which consists of two parameter-shared Mamba blocks encoding UAV and satellite images separately. The shared parameters enforce consistent feature representations and reduce discrepancies caused by viewpoint and scale differences. By structurally constraining feature extraction, SFEM enhances alignment of corresponding regions in the feature space, focusing on view-invariant structure and semantics. This design guides the model to learn stable and similar features across views, providing a reliable foundation for subsequent matching and localization.

In Figure 2b, we illustrate the detailed structure of each Mamba block, inspired by the approach in [19]. The input feature first undergoes Layer Normalization (LN) for initialization. The normalized output is then split into two branches to process different aspects of the feature representation: Branch A passes through a single linear transformation, serving as a reference or gating signal. Branch B undergoes a linear transformation, followed by a

3 \times 3

depth-wise convolution [39], and then the core SS2D module, which models spatial dependencies in a structured state-space manner. The output of SS2D is normalized via LN and element-wise multiplied with the output from Branch A, enabling feature gating and modulation. Finally, a residual connection is added to preserve the original input features, producing the final output of the Mamba block.

In Figure 2c, we present the operation flow of 2D-Selective Scan (SS2D). The Selective Scan Space State Sequential Model (S6) [34] processes the input data causally, which limits it to capturing information only within the scanned portion. This makes S6 unsuitable for handling images, which have non-causal properties. SS2D [19] addresses this issue by incorporating the Cross-Scan Module (CSM) strategy. First, image patches are expanded into sequences along the rows and columns (i.e., scan expand), and then scanned in four directions: top-left to bottom-right, bottom-right to top-left, top-right to bottom-left, and bottom-left to top-right. In this way, each pixel integrates information from all other pixels in different directions. Assume the one-dimensional input is denoted as

x_{t} \in R^{L}

, the formulation of the S6 block can be expressed as:

\begin{matrix} h_{t} = A h_{t - 1} + B x_{t}, \\ y_{t} = C h_{t}, \end{matrix}

(4)

where

A \in R^{N \times N}

is the state matrix, while

B \in R^{N \times 1}

and

C \in R^{N \times 1}

denote the projection parameters. Finally, the processed sequences from four directions are aggregated through scan merge to reconstruct the 2D feature map. The integration of S6 and CSM, referred to as the S6 block, serves as the core component of the Mamba block. The reconstructed feature map effectively models long-range dependencies and captures rich global information.

3.3. Local Cross-Attention Module

The Transformer model [40,41], with its self-attention mechanism, effectively captures global long-range dependencies in word sequences. This mechanism allows for matching between words in different positions within a sequence by leveraging query-key correlations. In the self-attention, query-key similarity is computed through dot products, demonstrating its ability to capture spatial relationships. Therefore, it is crucial to explore the application of the attention mechanism to UAV localization, aiming to achieve explicit feature matching.

In this paper, we employ a Local Cross-Attention Module (LCAM) to enable cross-Mamba feature interaction. As described in Equation (2), we first extract Mamba features from UAV and satellite images using SFEM, then feed them into LCAM to extract cross-Mamba features. As shown in Figure 1, at each stage, we combine the features from LCAM using residual connections to produce the output for that stage. This approach allows the cross-view similar information at the same scale to accumulate and become increasingly richer as the stages progress. Features from different stages capture cross-view information at various scales, which are progressively fused in the neck. The attention mechanism in LCAM enables explicit feature matching across views at different depths and scales.

As shown in Figure 3, the query Q is derived from satellite image features, while the key K and value V come from UAV image features, forming a cross-attention mechanism. In the geolocalization context, the satellite image acts as the moving image to be aligned, and the UAV image serves as the fixed reference. This asymmetric design enables the model to query satellite features within a stable UAV feature space, effectively capturing location-sensitive feature correspondences. For example, consider a satellite feature patch

P_{s}

and its corresponding UAV patch

P_{u}

at the same spatial position. The query associated with

P_{s}

attends only to the key and value patches within a local neighborhood around

P_{u}

. Specifically, an attention weight matrix is computed using Q and K to measure cross-view feature similarity across spatial locations. This weight matrix is then used to aggregate V, producing enhanced similarity-aware feature representations. The attention computation in LCAM is defined as:

LCAM (Q_{s}, K_{u}, V_{u}) = softmax (\frac{Q_{s} \times K_{u}^{T}}{\sqrt{d}}) \times V_{u},

(5)

where

Q_{s}, K_{u}, V_{u} \in R^{P_{x} P_{y} P_{z} \times d}

denote the query, key, and value matrices, respectively; d is the feature dimension, and

P_{x} P_{y} P_{z}

represents the number of tokens within a local 3D window. By treating the satellite and UAV images as moving and fixed views, respectively, LCAM effectively captures cross-Mamba features that enhance spatial correspondence and substantially improve the performance of cross-view matching.

3.4. Training Details

During training, we adopt the Random Scale Crop (RSC) strategy from [16] to enhance the diversity of satellite image by varying their pixel resolutions and spatial offsets. However, using this strategy for augmenting images is insufficient. The learning objective of our CMIN model is to search for corresponding regions in the center of the UAV image based on the satellite image. As training progresses, the model increasingly focuses on the central region of the UAV image, gradually overlooking a large portion of the contextual information. To address this issue, we propose a simple yet effective data augmentation technique called Center Masking (CM) to improve the diversity of UAV images.

For satellite images, the process of RSC augmentation is shown on the left side of Figure 4. The RSC method involves two hyperparameters: C and S. Suppose the height and width of the satellite image

I_{s}

are

H_{s}

and

W_{s}

, respectively. First, a local image

I_{s}^{1}

is cropped from

I_{s}

, centered around the UAV position, with dimensions

H_{s} \times C

and

W_{s} \times C

. Then, random candidate points are sampled from

I_{s}^{1}

. For each, a new image

I_{s}^{2}

of size

S \times S

is cropped around the point, where S is randomly selected between 512 and 1000 pixels to simulate variations in spatial resolution. Any pixels outside the boundaries of

I_{s}

are padded with 0. In simple terms, C defines the size of the red region where the queries can be distributed, and different values of S are used to generate candidate images at various scales. The RSC augmentation strategy is designed to improve the model’s robustness to scale and offset variations. It is worth noting that geometric scaling is not applied, as abrupt variations in object scale between UAV and satellite images are unlikely to occur in real-world scenarios.

As shown on the right side of Figure 4, the core idea behind the proposed CM strategy is to reduce the network’s over-reliance on the central region of the UAV image. To achieve this, we first generate a circular mask centered at the UAV’s position with a radius of R. The central region of the UAV image is then masked by setting the pixel values within the mask to zero, producing the augmented UAV image

I_{u}^{1}

. By varying R, we control the mask size, which in turn affects localization accuracy. This data augmentation method for UAV images offers two main advantages: First, by always covering the center, it simulates the loss of central information, encouraging the model to focus more on the surrounding background for UAV localization. Second, the circular shape of the mask provides symmetry and invariance, improving the robustness of the augmentation.

During the model training phase, we use a mixed loss function to optimize the model: (1) We input

I_{s}^{2}

and

I_{u}

into CMIN to obtain the predicted heatmap

y_{1}

. Following [16], we optimize the parameters of CMIN by minimizing the Weighted Balance Loss (WBL)

L_{w b l}

between the ground truth

\hat{y}

and

y_{1}

. (2) We input

I_{s}^{2}

and

I_{u}^{1}

into CMIN to obtain the predicted heatmap

y_{2}

. Since the central region of

I_{u}^{1}

is masked, we expect

y_{2}

to be similar to

y_{1}

, thereby enhancing the model’s ability to infer content about the central region from context cues. To achieve this, we use the Mean Squared Error (MSE) loss

L_{m s e}

to optimize CMIN. As shown in Figure 5, we optimize the parameters of CMIN in each iteration by combining

L_{w b l}

and

L_{m s e}

, where

L_{m s e}

is assigned a weight of

α

.

4. Experiment

In this section, we first provide a detailed description of the experimental setup. We then present a comparison of the proposed CMIN with other state-of-the-art methods. Finally, we perform ablation studies and model analysis to evaluate the contribution of each component in our CMIN.

4.1. Experimental Settings

Dataset. We use the UL14 dataset [16] and UAV-VisLoc dataset [42] for training and evaluation. The UL14 dataset contains paired UAV and satellite images. UAV images are captured at three altitudes (80 m, 90 m, and 100 m), and are center-cropped and resized to

512 \times 512 \times 3

. Corresponding satellite patches are extracted based on the UAV’s GPS coordinates and resized to

1280 \times 1280 \times 3

, ensuring central alignment. The training set includes 6768 image pairs from 10 universities, while the test set contains 2331 pairs from 4 universities. The UAV-VisLoc dataset includes 6742 UAV images and 11 satellite images covering the respective flight areas. UAV images were captured at altitudes ranging from 400 m to 850 m, with a spatial resolution of 0.1–0.2 m/pixel. Satellite images, at 0.3 m/pixel, span diverse terrains such as urban areas, towns, farmland, and rivers. From the UAV images, 4500 are randomly selected, center-cropped, and resized to

512 \times 512

for training. Corresponding

1280 \times 1280

satellite patches centered at the UAV’s location are extracted to form the training pairs. The remaining UAV images comprise the test set. For each test pair, a satellite patch is generated by applying a random offset (0–512 pixels horizontally and vertically) around the ground truth location before cropping a

1280 \times 1280

reference image.

Implementation Details. We implement the proposed CMIN using PyTorch 2.2.0 and Python 3.10. All experiments are conducted on an NVIDIA RTX A5000 GPU. We resize the drone images and satellite images to

128 \times 128 \times 3

and

384 \times 384 \times 3

, respectively, and then use them as inputs. Following [16], in the RSC, we set

C = 0.85

and randomly select S from the range of 512 to 1000. All initial parameters of backbones are pre-trained on ImageNet [43]. The AdamW [44] optimizer is employed with a weight decay of 0.0005. The initial learning rate is set to 0.00015, and it is gradually reduced following a cosine annealing learning rate decay schedule. The batch size is set to 8, and the model is trained for 30 epochs until convergence.

Evaluation Indicators. Following prior works [16,17,18], we adopt two commonly used metrics, meter-level accuracy (MA) and relative distance score (RDS), to evaluate model performance in UAV geolocalization. The MA metric can be represented as

MA @ K

, which indicates the accuracy of positioning error being less than

Km

as follows:

MA @ K = \frac{\sum_{i = 1}^{N} 1_{SD < Km}}{N},

(6)

1_{SD < Km} = \{\begin{matrix} 1 SD < Km \\ 0 SD \geq Km \end{matrix},

(7)

where

K

denotes an adjustable parameter, and

SD

indicates the real spatial distance in meters. Specifically,

MA @ K

represents the proportion of samples with positioning errors within

Km

relative to the total number of samples, N. The expression of

SD

is as follows:

SD = \sqrt{{(Δ x)}^{2} + {(Δ y)}^{2}},

(8)

where

Δ x

represents the meter-level error in the longitude direction between the prediction and the ground truth, while

Δ y

represents the meter-level error in the latitude direction.

For RDS, a smaller pixel distance between the actual and predicted positions results in a score closer to 1, while a larger distance leads to a score closer to 0. Its formula is as follows:

RDS = e^{- 10 \times \sqrt{\frac{{(\frac{d_{x}}{w})}^{2} + {(\frac{d_{y}}{h})}^{2}}{2}}},

(9)

where

d_{x}

and

d_{y}

denote the pixel distance between the actual and predicted positions,

d_{x}

is the pixel distance between abscissas,

d_{y}

is the pixel distance between ordinates, w and h are the width and the height of the satellite image, respectively.

In practice, MA reflects the gap between predicted and actual positions in the real world, while RDS measures the pixel distance error between them in the image.

4.2. Comparison with the State-of-the-Art Methods

We compare the proposed CMIN with previous methods, including image retrieval-based methods [7,11,14,45,46] and heatmap regression-based methods [16,17,18]. Following [19], we construct CMIN with three different architecture settings: (1) CMIN-T setup:

C_{1} = 96

,

C_{2} = 192

,

C_{3} = 384

,

C_{4} = 768

,

n_{1} = 2

,

n_{2} = 2

,

n_{3} = 9

, and

n_{4} = 2

; (2) CMIN-S setup:

C_{1} = 96

,

C_{2} = 192

,

C_{3} = 384

,

C_{4} = 768

,

n_{1} = 2

,

n_{2} = 2

,

n_{3} = 27

, and

n_{4} = 2

; and (3) CMIN-B setup:

C_{1} = 128

,

C_{2} = 256

,

C_{3} = 512

,

C_{4} = 1024

,

n_{1} = 2

,

n_{2} = 2

,

n_{3} = 27

, and

n_{4} = 2

.

Table 1 presents the localization accuracy results on the UL14 dataset, from which we have the following conclusions: (1) Based on the MA metric, heatmap regression-based methods generally outperform image retrieval-based methods, indicating that block-level localization tends to introduce more errors compared to direct point-level localization. (2) OS-FPI [18] achieves higher accuracy than the other two heatmap regression-based methods [16,17], demonstrating that using a single-stream backbone network extracts more discriminative features for localization compared to two parallel, unrelated networks. (3) Both the proposed CMIN and OS-FPI provide models with different sizes for performance evaluation. In the lightweight setting, CMIN-T consistently outperforms OS-FPI* across all metrics. In the larger model settings, CMIN-S and CMIN-B achieve better results than OS-FPI in RDS and MA@20, which reflect large-range localization accuracy, but slightly underperform in MA@3 and MA@10. This difference mainly results from CMIN’s use of the Mamba backbone, which effectively models global feature dependencies, while OS-FPI focuses more on optimizing short-range localization accuracy. Figure 6 shows the visual localization results of different methods on four representative scenarios from the UL14 dataset. Red circles mark the ground-truth UAV positions in the satellite images, while the red regions in the heatmaps indicate the predicted areas by each method. Enlarged views highlight localization accuracy. Compared with WAMF [17] and FPI [16], our CMIN method achieves more accurate UAV localization.

Table 2 reports the localization accuracy on the UAV-VisLoc dataset. Compared with UL14, UAV-VisLoc presents more challenging and complex scenarios, where the substantial viewpoint variations between UAV and satellite images significantly hinder accurate localization across all methods. As a result, all approaches exhibit a notable performance drop across various metrics. Despite this, the proposed CMIN consistently delivers competitive localization performance across all evaluation metrics, surpassing existing state-of-the-art methods. These results demonstrate the effectiveness of CMIN for UAV-based localization and its strong generalization capability in diverse environments.

4.3. Ablation Study

Impact of network depth on performance. Following [19], we design CMIN with three architectural variants, detailed in Section 4.2. During training, CMIN-T, CMIN-S, and CMIN-B require approximately 2.7 h, 7 h, and 8.5 h to converge, respectively. Table 3 provides a comprehensive comparison of these variants. For computational complexity and inference speed, we use GFLOPs and FPS (Frames Per Second), respectively. A higher GFLOPs value indicates greater computational complexity, while a higher FPS value indicates faster processing speed. As network depth increases, positioning accuracy improves; however, this comes at the cost of a significantly larger number of parameters and reduced computational efficiency. In the following ablation experiments, we adopt the lightweight CMIN-T as the default configuration.

Impact of different network architectures on performance. As shown in Figure 1, we alternately connect SFEM and LCAM in each stage, with the former used for feature extraction and the latter for feature fusion. There is also an alternative way to combine the two. Specifically, we first connect several SFEMs in each stage and then use a single LCAM for cross-Mamba feature fusion. In Table 4, we compare the positioning accuracy of the two network architectures. The alternating approach we selected achieves higher positioning accuracy, indicating that frequent feature fusion at each stage helps facilitate more effective feature interaction.

Impact of different feature extraction blocks on localization performance. In this study, we adopt the vanilla VSS block [19] as the core component of SFEM, rather than introducing more complex Mamba variants. To provide a comprehensive analysis of the impact of different feature extraction blocks on localization performance, we evaluate SFEM using a variety of backbones, including CNN-based (ResNet [47]), Transformer-based (ViT-S and ViT-B [48]), and the more advanced Mamba-based LocalVim [49]. The results are summarized in Table 5. As shown in the table, the CMIN model equipped with the vanilla VSS block significantly outperforms ResNet, ViT-S, and ViT-B across all evaluation metrics (RDS, MA@3, MA@10, MA@20). This improvement is largely due to Mamba’s ability to capture long-range dependencies with linear computational complexity, enabling the model to extract richer and more informative global features from both UAV and satellite imagery. Moreover, replacing the VSS block with the more expressive LocalVim module further boosts performance, achieving an RDS of 78.67% and an MA@20 of 0.863. It is worth noting, however, that designing new Mamba variants is not the primary focus of this work. Rather, our aim is to demonstrate the modeling capabilities of Mamba in UAV geolocalization tasks, beyond simply improving metrics through stronger backbones. Nonetheless, these results underscore the strong potential of Mamba-based architectures for enhancing localization accuracy in complex visual scenarios.

Impact of satellite image size on performance. In Figure 7, we conduct experiments to investigate the localization performance of our proposed CMIN-T model on satellite images of varying sizes, evaluated using the meter-level accuracy (MA) metric at 10 m, 20 m, and 30 m levels. The x-axis represents the pixel dimensions of the satellite images, while the y-axis shows the positioning accuracy based on the MA metric. Contrary to intuition, the relationship between satellite image size and localization accuracy is non-monotonic. The proposed CMIN-T achieves the highest positioning accuracy around 900–1100 pixels (corresponding to a spatial dimension of approximately 230–280 m at the dataset’s resolution), with accuracy gradually declining as the pixel size increases beyond this range or decreases below it. Despite this sensitivity to image size, CMIN-T significantly outperforms other methods [16,17] across all tested sizes.

This non-monotonic behavior can be explained by analyzing the impact of satellite image size on feature representation and model performance. At smaller image sizes (e.g., below 900 pixels), the model captures finer details due to higher pixel density, but when the UAV’s center point is located near the edge of the satellite image, critical contextual information is cropped out. This loss of relevant spatial context leads to reduced localization accuracy, as the model struggles to correlate features between the UAV and satellite images effectively. Conversely, at larger image sizes (e.g., above 1100 pixels), the inclusion of excessive spatial information introduces irrelevant or noisy features, such as distant landmarks or unrelated terrain, which can dilute the model’s focus on the target region and impair localization precision. The optimal range of 900–1100 pixels strikes a balance, providing sufficient contextual information to capture relevant features while minimizing the inclusion of extraneous data that could confuse the model. For practical algorithm deployment, these findings suggest a clear criterion for selecting satellite image size. For UAVs operating at altitudes of 80–100 m, as in the UL14 dataset, we recommend using satellite images with a spatial dimension of approximately 230–280 m (equivalent to 900–1100 pixels at the dataset’s resolution). This range maximizes localization accuracy by ensuring that the model can leverage sufficient contextual information without being overwhelmed by irrelevant features.

Impact of Center Masking on performance. In this paper, we propose a Center Masking (CM) strategy for UAV augmentation. As shown in Figure 4b, we use a mask with a radius of R to cover the central region of the UAV image. From Table 6, it is evident that after applying CM augmentation on the UAV image, the positioning accuracy improves significantly. This enhancement is mainly due to the model’s increased focus on contextual content. We then explore the impact of different values of R on performance, where R represents the proportion of the UAV image’s width. It can be observed that using CM for data augmentation results in a significant increase in accuracy. When

R = 20 %

, our method achieves the highest positioning accuracy in both RDS and MA metrics.

Impact of different loss weights on performance. In this paper, we optimize the model using a combined loss function comprising

L_{w b l}

and

L_{m s e}

, where

L_{m s e}

is assigned a weight of

α

. Table 7 presents the effect of varying

α

values on the RDS and MA metrics. The results indicate that the highest positioning accuracy is achieved when

α = 0.3

, suggesting that

L_{w b l}

still should play a more dominant role in the joint optimization process. In Figure 8, we visualize the positioning results. The model trained with

L_{w b l}

tends to focus on the central region, while the

L_{m s e}

-trained model encourages the model to infer the central region using background information. When both

L_{w b l}

and

L_{m s e}

are used together, the model achieves more accurate positioning than those trained with either loss function alone.

5. Conclusions

In this paper, we present a simple and efficient end-to-end framework, called CMIN, designed for UAV localization. Within the CMIN framework, we leverage a vanilla VSS block as the foundation to build the SFEM, enabling the extraction of shared features from UAV and satellite images. Additionally, we design an LCAM with a local attention mechanism to enhance cross-layer feature interaction. To further improve performance, we introduce a Center Masking method for UAV image augmentation, allowing the model to capture richer contextual details. Extensive experiments validate the superior performance of CMIN. Despite its effectiveness, this study is limited by the dataset, which primarily includes data from urban areas with top-down views, lacking coverage of oblique perspectives, partial occlusions, mountainous terrain, and suburban regions. In future work, we aim to expand the dataset to encompass a broader range of viewing conditions and environmental scenarios, enabling a more comprehensive evaluation of model robustness under complex real-world conditions. Additionally, we will further optimize the structure of the Mamba model to achieve a better balance between localization accuracy and computational efficiency. We are confident that, through these ongoing efforts, we can make a valuable contribution to UAV localization.

Author Contributions

Conceptualization, L.T. and Q.S.; Methodology, L.T.; Software, L.T. and Y.G.; Validation, Y.L.; Writing—original draft, L.T.; Writing—review & editing, Z.D.; Visualization, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation under Grants 62406051, 62403055, and 62372077, in part by the Open Foundation of Key Laboratory of Computing Power Network and Information Security under Grant SKLCN-2023-08, and in part by the National Key Laboratory Fund Project under Grant 614260124030209.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ibraiwish, H.; Eltokhey, M.W.; Alouini, M. UAV-Assisted VLC Using LED-Based Grow Lights in Precision Agriculture Systems. IEEE Internet Things Mag. 2024, 7, 100–105. [Google Scholar] [CrossRef]
Hu, Z.; Fan, S.; Li, Y.; Tang, Q.; Bao, L.; Zhang, S.; Sarsen, G.; Guo, R.; Wang, L.; Zhang, N.; et al. Estimating Stratified Biomass in Cotton Fields Using UAV Multispectral Remote Sensing and Machine Learning. Drones 2025, 9, 186. [Google Scholar] [CrossRef]
Huang, D.; Wang, Y.; Li, H. Study on UAV Inspection Safety Distance of Substation High-Voltage and Current-Carrying Equipment Based on Power-Frequency Magnetic Field. IEEE Trans. Instrum. Meas. 2024, 73, 3539008. [Google Scholar] [CrossRef]
Xu, J.; Fan, X.; Jian, H.; Xu, C.; Bei, W.; Ge, Q.; Zhao, T. YoloOW: A Spatial Scale Adaptive Real-Time Object Detection Neural Network for Open Water Search and Rescue From UAV Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5623115. [Google Scholar] [CrossRef]
Gaigalas, J.; Perkauskas, L.; Gricius, H.; Kanapickas, T.; Kriščiūnas, A. A Framework for Autonomous UAV Navigation Based on Monocular Depth Estimation. Drones 2025, 9, 236. [Google Scholar] [CrossRef]
Kramarić, L.; Jelušić, N.; Radišić, T.; Muštra, M. A Comprehensive Survey on Short-Distance Localization of UAVs. Drones 2025, 9, 188. [Google Scholar] [CrossRef]
Lin, J.; Zheng, Z.; Zhong, Z.; Luo, Z.; Li, S.; Yang, Y.; Sebe, N. Joint Representation Learning and Keypoint Detection for Cross-View Geo-Localization. IEEE Trans. Image Process. 2022, 31, 3780–3792. [Google Scholar] [CrossRef]
Liang, Y.; Wu, X. Do Keypoints Contain Crucial Information? Mining Keypoint Information to Enhance Cross-View Geo-Localization. In Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 15–19 July 2024; pp. 1–6. [Google Scholar]
Li, Q.; Yang, X.; Fan, J.; Lu, R.; Tang, B.; Wang, S.; Su, S. GeoFormer: An Effective Transformer-Based Siamese Network for UAV Geolocalization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 9470–9491. [Google Scholar] [CrossRef]
Bellavia, F. SIFT Matching by Context Exposed. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 2445–2457. [Google Scholar] [CrossRef]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2021, 13, 47. [Google Scholar] [CrossRef]
Gong, N.; Li, L.; Sha, J.; Sun, X.; Huang, Q. A Satellite-Drone Image Cross-View Geolocalization Method Based on Multi-Scale Information and Dual-Channel Attention Mechanism. Remote Sens. 2024, 16, 941. [Google Scholar] [CrossRef]
Tian, X.; Shao, J.; Ouyang, D.; Shen, H.T. UAV-Satellite View Synthesis for Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4804–4815. [Google Scholar] [CrossRef]
Dai, M.; Hu, J.; Zhuang, J.; Zheng, E. A Transformer-Based Feature Segmentation and Region Alignment Method for UAV-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4376–4389. [Google Scholar] [CrossRef]
Wang, Z.; Shi, D.; Qiu, C.; Jin, S.; Li, T.; Shi, Y.; Liu, Z.; Qiao, Z. Sequence Matching for Image-Based UAV-to-Satellite Geolocalization. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5607815. [Google Scholar] [CrossRef]
Dai, M.; Chen, J.; Lu, Y.; Hao, W.; Zheng, E. Finding Point with Image: An End-to-End Benchmark for Vision-based UAV Localization. arXiv 2022, arXiv:2208.06561. [Google Scholar]
Wang, G.; Chen, J.; Dai, M.; Zheng, E. WAMF-FPI: A Weight-Adaptive Multi-Feature Fusion Network for UAV Localization. Remote Sens. 2023, 15, 910. [Google Scholar] [CrossRef]
Chen, J.; Zheng, E.; Dai, M.; Chen, Y.; Lu, Y. OS-FPI: A Coarse-to-Fine One-Stream Network for UAV Geolocalization. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 7852–7866. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Zou, Z.; Yu, H.; Huang, J.; Zhao, F. FreqMamba: Viewing Mamba from a Frequency Perspective for Image Deraining. In Proceedings of the MM’24: The 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 1905–1914. [Google Scholar]
Guo, H.; Li, J.; Dai, T.; Ouyang, Z.; Ren, X.; Xia, S. MambaIR: A Simple Baseline for Image Restoration with State-Space Model. In Proceedings of the ECCV, Milan, Italy, 29 September–4 October 2024; pp. 222–241. [Google Scholar]
Zamir, A.R.; Shah, M. Image Geo-Localization Based on MultipleNearest Neighbor Feature Matching UsingGeneralized Graphs. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1546–1558. [Google Scholar] [CrossRef]
Kim, H.J.; Dunn, E.; Frahm, J. Learned Contextual Feature Reweighting for Image Geo-Localization. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 3251–3260. [Google Scholar]
Cai, S.; Guo, Y.; Khan, S.H.; Hu, J.; Wen, G. Ground-to-Aerial Image Geo-Localization With a Hard Exemplar Reweighting Triplet Loss. In Proceedings of the ICCV, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8390–8399. [Google Scholar]
Zeng, Z.; Wang, Z.; Yang, F.; Satoh, S. Geo-Localization via Ground-to-Satellite Cross-View Image Retrieval. IEEE Trans. Multim. 2023, 25, 2176–2188. [Google Scholar] [CrossRef]
Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-Aware Feature Aggregation for Image based Cross-View Geo-Localization. In Proceedings of the NeurIPS, Vancouver, BC, Canada, 8–14 December 2019; pp. 10090–10100. [Google Scholar]
Zhu, Y.; Sun, B.; Lu, X.; Jia, S. Geographic Semantic Network for Cross-View Image Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4704315. [Google Scholar] [CrossRef]
Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. In Proceedings of the CVPR, New Orleans, LA, USA, 18–24 June 2022; pp. 1152–1161. [Google Scholar]
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers. In Proceedings of the NeurIPS, Virtual, 6–14 December 2021; pp. 572–585. [Google Scholar]
Smith, J.T.H.; Warrington, A.; Linderman, S.W. Simplified State Space Layers for Sequence Modeling. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Ma, X.; Zhou, C.; Kong, X.; He, J.; Gui, L.; Neubig, G.; May, J.; Zettlemoyer, L. Mega: Moving Average Equipped Gated Attention. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Fu, D.Y.; Epstein, E.L.; Nguyen, E.; Thomas, A.W.; Zhang, M.; Dao, T.; Rudra, A.; Ré, C. Simple Hardware-Efficient Long Convolutions for Sequence Modeling. In Proceedings of the ICML, Kigali, Rwanda, 1–5 May 2023; pp. 10373–10391. [Google Scholar]
Li, Y.; Cai, T.; Zhang, Y.; Chen, D.; Dey, D. What Makes Convolutional Models Great on Long Sequence Modeling? In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar]
He, Y.; Tu, B.; Jiang, P.; Liu, B.; Li, J.; Plaza, A. IGroupSS-Mamba: Interval Group Spatial-Spectral Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5538817. [Google Scholar] [CrossRef]
Yang, A.; Li, M.; Ding, Y.; Fang, L.; Cai, Y.; He, Y. GraphMamba: An Efficient Graph Structure Learning Vision Mamba for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5537414. [Google Scholar] [CrossRef]
Ding, H.; Xia, B.; Liu, W.; Zhang, Z.; Zhang, J.; Wang, X.; Xu, S. A Novel Mamba Architecture with a Semantic Transformer for Efficient Real-Time Remote Sensing Semantic Segmentation. Remote Sens. 2024, 16, 2620. [Google Scholar] [CrossRef]
Yang, Y.; Ma, C.; Yao, J.; Zhong, Z.; Zhang, Y.; Wang, Y. ReMamber: Referring Image Segmentation with Mamba Twister. In Proceedings of the ECCV, MiCo Milano, Italy, 29 September–4 October 2024; pp. 108–126. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhao, Y.; Luo, C.; Zha, Z.; Zeng, W. Multi-Scale Group Transformer for Long Sequence Modeling in Speech Separation. In Proceedings of the IJCAI, Yokohama, Japan, 11–17 July 2020; pp. 3251–3257. [Google Scholar]
Li, Z.; Liu, X.; Drenkow, N.; Ding, A.S.; Creighton, F.X.; Taylor, R.H.; Unberath, M. Revisiting Stereo Depth Estimation From a Sequence-to-Sequence Perspective with Transformers. In Proceedings of the ICCV, Virtual, 11–17 October 2021; pp. 6177–6186. [Google Scholar]
Xu, W.; Yao, Y.; Cao, J.; Wei, Z.; Liu, C.; Wang, J.; Peng, M. UAV-VisLoc: A Large-scale Dataset for UAV Visual Localization. arXiv 2024, arXiv:2405.11936. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the CVPR, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the ICLR, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Wang, T.; Zheng, Z.; Yan, C.; Zhang, J.; Sun, Y.; Zheng, B.; Yang, Y. Each Part Matters: Local Patterns Facilitate Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 867–879. [Google Scholar] [CrossRef]
Shen, T.; Wei, Y.; Kang, L.; Wan, S.; Yang, Y. MCCG: A ConvNeXt-Based Multiple-Classifier Method for Cross-View Geo-Localization. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 1456–1468. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 770–778. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, 3–7 May 2021; OpenReview.net: Alameda, CA, USA, 2021. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. arXiv 2024, arXiv:2403.09338. [Google Scholar]

Figure 1. Overview of the proposed CMIN, which consists of the backbone, neck, and head. In the head, we use a pyramid structure containing four stages, where in each stage, the Siamese Feature Extraction Module (SFEM) and the Local Cross-Attention Module (LCAM) are alternately stacked for feature extraction and interaction, respectively. In the neck, we use a fusion module (FM) to progressively aggregate features, which are then used to regress the heatmap in the head.

Figure 2. The structure of SFEM. (a) Each SFEM contains two parameter-shared Mamba blocks. (b) Structure of the Mamba block, following the visual state-space (VSS) design [19]. The input splits into two branches: Branch A generates a modulation signal via a linear layer, while Branch B models spatial dependencies through a linear layer, depth-wise

3 \times 3

convolution, and the SS2D module. Outputs are fused by element-wise multiplication and a residual connection. (c) In the SS2D operation, the input image is first divided into patches and flattened along four scanning paths. These sequences are processed separately by distinct S6 blocks, and the outputs are merged to form the final 2D feature map.

Figure 2. The structure of SFEM. (a) Each SFEM contains two parameter-shared Mamba blocks. (b) Structure of the Mamba block, following the visual state-space (VSS) design [19]. The input splits into two branches: Branch A generates a modulation signal via a linear layer, while Branch B models spatial dependencies through a linear layer, depth-wise

3 \times 3

convolution, and the SS2D module. Outputs are fused by element-wise multiplication and a residual connection. (c) In the SS2D operation, the input image is first divided into patches and flattened along four scanning paths. These sequences are processed separately by distinct S6 blocks, and the outputs are merged to form the final 2D feature map.

Figure 3. The flow of LCAM. Satellite features are treated as the moving input to be aligned, while UAV features serve as the fixed reference. This asymmetric design allows the model to query satellite features within a stable UAV feature space, effectively capturing location-sensitive correspondences. Cross-attention is computed using the satellite query (

Q_{s}

) and the UAV key (

K_{u}

) and value (

V_{u}

) to enable Mamba-based feature interaction.

Figure 3. The flow of LCAM. Satellite features are treated as the moving input to be aligned, while UAV features serve as the fixed reference. This asymmetric design allows the model to query satellite features within a stable UAV feature space, effectively capturing location-sensitive correspondences. Cross-attention is computed using the satellite query (

Q_{s}

) and the UAV key (

K_{u}

) and value (

V_{u}

) to enable Mamba-based feature interaction.

Figure 4. The processes of the RSC and CM data augmentation methods. (a) In RSC, a rectangular region

I_{s}^{1}

with width

W_{s} \times C

and height

H_{s} \times C

is created, centered around a green pentagram. Then, a blue circle is selected as the center, and a rectangular region

I_{s}^{2}

with both width and height equal to S is created. The pixels in the red region are set to 0 to fill in the missing content. (b) In CM, a circular mask with a radius of R is applied to cover the central region, resulting in the augmented image

I_{u}^{1}

.

Figure 4. The processes of the RSC and CM data augmentation methods. (a) In RSC, a rectangular region

I_{s}^{1}

with width

W_{s} \times C

and height

H_{s} \times C

is created, centered around a green pentagram. Then, a blue circle is selected as the center, and a rectangular region

I_{s}^{2}

with both width and height equal to S is created. The pixels in the red region are set to 0 to fill in the missing content. (b) In CM, a circular mask with a radius of R is applied to cover the central region, resulting in the augmented image

I_{u}^{1}

.

Figure 5. The optimization objective of the proposed CMIN. We generate

y_{1}

using

I_{s}^{2}

and

I_{u}

, and generate

y_{2}

using

I_{s}^{2}

and

I_{u}^{1}

. In each iteration, we calculate

L_{w b l}

using

\hat{y}

and

y_{1}

, and calculate

L_{m s e}

using

y_{1}

and

y_{2}

. By minimizing the combination of

L_{w b l}

and

L_{m s e}

, the parameters of CMIN are optimized during each iteration.

Figure 5. The optimization objective of the proposed CMIN. We generate

y_{1}

using

I_{s}^{2}

and

I_{u}

, and generate

y_{2}

using

I_{s}^{2}

and

I_{u}^{1}

. In each iteration, we calculate

L_{w b l}

using

\hat{y}

and

y_{1}

, and calculate

L_{m s e}

using

y_{1}

and

y_{2}

. By minimizing the combination of

L_{w b l}

and

L_{m s e}

, the parameters of CMIN are optimized during each iteration.

Figure 6. Visualization of localization results by different methods on the UL14 dataset. WAMF [17] and FPI [16] are included as reference methods for comparison. The center of the red circle represents the true position of the UAV image.

Figure 7. The impact of satellite image size on localization accuracy. FPI [16] and WAMF [17] are used as references.

Figure 8. Visualization of localization results from different trained models. From top to bottom: model optimized with

L_{w b l}

only, model optimized with

L_{m s e}

only, and model optimized with both

L_{w b l}

and

L_{m s e}

. The center of the red circle represents the true position of the UAV image.

Figure 8. Visualization of localization results from different trained models. From top to bottom: model optimized with

L_{w b l}

only, model optimized with

L_{m s e}

only, and model optimized with both

L_{w b l}

and

L_{m s e}

. The center of the red circle represents the true position of the UAV image.

Table 1. Comparison of the proposed method with state-of-the-art methods on the UL14 dataset. Bold/underline represent the best/second-best performance, respectively.

Method		RDS	MA@3	MA@10	MA@20
Image retrieval	LCM [11]	-	0.014	0.112	0.250
	LPN [45]	-	0.015	0.158	0.273
	RKNet [7]	-	0.021	0.177	0.317
	MCCG [46]	-	0.078	0.574	0.752
	FSRA [14]	-	0.079	0.580	0.743
Heatmap regression	FPI [16]	57.22%	-	0.384	0.577
	WAMF [17]	65.33%	0.125	0.526	0.697
	OS-FPI* [18]	66.22%	0.157	0.576	0.706
	OS-FPI [18]	76.25%	0.228	0.723	0.825
	Our CMIN-T	75.23%	0.177	0.661	0.818
	Our CMIN-S	76.74%	0.199	0.701	0.837
	Our CMIN-B	77.52%	0.204	0.713	0.850

Table 2. Comparison with state-of-the-art methods on the UAV-VisLoc dataset.

Method	RDS	MA@3	MA@10	MA@20
LPN [45]	-	0.0009	0.0116	0.0318
FSRA [14]	-	0.0011	0.0135	0.0358
FPI [16]	15.82%	0.0012	0.0152	0.0447
WAMF [17]	16.23%	0.0011	0.0157	0.0464
OS-FPI* [18]	17.14%	0.0019	0.0194	0.0457
OS-FPI [18]	18.21%	0.0025	0.0230	0.0829
Our CMIN-T	20.01%	0.0072	0.0548	0.0938
Our CMIN-S	31.09%	0.0860	0.2234	0.2652
Our CMIN-B	35.45%	0.0921	0.2280	0.3053

Table 3. Comparison of different CMIN variants in terms of computational cost, number of parameters, and positioning accuracy.

Method	GFLOPs	Speed	Parameters	RDS	MA@5	MA@30
CMIN-T	20.9 G	95.48 FPS	34.8 M	75.23%	0.352	0.847
CMIN-S	34.5 G	49.24 FPS	50.2 M	76.74%	0.388	0.860
CMIN-B	57.8 G	40.59 FPS	88.5 M	77.52%	0.396	0.872

Table 4. Comparison of different network architectures. (a) represents the case where several SFEMs are cascaded in each stage, followed by an LCAM. (b) represents the alternating connection of SFEM and LCAM used in our CMIN.

Method	RDS	MA@3	MA@5	MA@20	MA@30
(a)	70.07%	0.117	0.258	0.796	0.798
(b)	75.23%	0.177	0.352	0.818	0.847

Table 5. Comparison of different backbones.

Method	RDS	MA@3	MA@10	MA@20
ResNet [47]	65.15%	0.112	0.512	0.662
ViT-S [48]	70.42%	0.155	0.605	0.754
ViT-B [48]	72.19%	0.163	0.648	0.799
VSS [19]	75.23%	0.177	0.661	0.818
LocalVim [49]	78.67%	0.182	0.716	0.863

Table 6. The impact of the mask radius on positioning accuracy. R is the radius of the circular mask, defined as the proportion of the UAV image’s width.

R	RDS	MA@3	MA@5	MA@10	MA@20
0%	72.49%	0.147	0.309	0.602	0.786
10%	74.24%	0.167	0.339	0.647	0.807
20%	75.23%	0.177	0.352	0.661	0.818
30%	74.50%	0.171	0.344	0.644	0.809

Table 7. The impact of the loss weight

α

on positioning accuracy. Bold denotes the best performance.

Table 7. The impact of the loss weight

α

on positioning accuracy. Bold denotes the best performance.

$α$	RDS	MA@3	MA@5	MA@10	MA@20
0.2	75.03%	0.173	0.347	0.648	0.821
0.3	75.23%	0.177	0.352	0.661	0.818
0.4	74.83%	0.172	0.344	0.648	0.814
0.5	74.64%	0.172	0.342	0.644	0.812
0.6	74.24%	0.166	0.329	0.632	0.808

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tian, L.; Shen, Q.; Gao, Y.; Wang, S.; Liu, Y.; Deng, Z. A Cross-Mamba Interaction Network for UAV-to-Satallite Geolocalization. Drones 2025, 9, 427. https://doi.org/10.3390/drones9060427

AMA Style

Tian L, Shen Q, Gao Y, Wang S, Liu Y, Deng Z. A Cross-Mamba Interaction Network for UAV-to-Satallite Geolocalization. Drones. 2025; 9(6):427. https://doi.org/10.3390/drones9060427

Chicago/Turabian Style

Tian, Lingyun, Qiang Shen, Yang Gao, Simiao Wang, Yunan Liu, and Zilong Deng. 2025. "A Cross-Mamba Interaction Network for UAV-to-Satallite Geolocalization" Drones 9, no. 6: 427. https://doi.org/10.3390/drones9060427

APA Style

Tian, L., Shen, Q., Gao, Y., Wang, S., Liu, Y., & Deng, Z. (2025). A Cross-Mamba Interaction Network for UAV-to-Satallite Geolocalization. Drones, 9(6), 427. https://doi.org/10.3390/drones9060427

Article Menu

A Cross-Mamba Interaction Network for UAV-to-Satallite Geolocalization

Abstract

1. Introduction

2. Related Work

2.1. Cross-View Geolocalization

2.2. Mamba in Computer Vision

3. Method

3.1. Overview

3.2. Siamese Feature Extraction Module

3.3. Local Cross-Attention Module

3.4. Training Details

4. Experiment

4.1. Experimental Settings

4.2. Comparison with the State-of-the-Art Methods

4.3. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI