Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models

Tao, Haojie; Wang, Shixin; Wang, Futao; Wang, Litao; Wang, Zhenqing; Wang, Zhaowei; Wang, Tianhao; Xiong, Chengyue; Nie, Ziqi

doi:10.3390/ai7040118

Open AccessArticle

Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models

by

Haojie Tao

^1,2,

Shixin Wang

^1,2,

Futao Wang

^1,2,*,

Litao Wang

^1,2,

Zhenqing Wang

^1,2,

Zhaowei Wang

^1,2,

Tianhao Wang

^1,2

,

Chengyue Xiong

^1,2 and

Ziqi Nie

^1,2

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

²

University of Chinese Academy of Sciences, Beijing 101408, China

^*

Author to whom correspondence should be addressed.

AI 2026, 7(4), 118; https://doi.org/10.3390/ai7040118

Submission received: 30 January 2026 / Revised: 15 March 2026 / Accepted: 18 March 2026 / Published: 30 March 2026

(This article belongs to the Special Issue Recent Advances in Deep Learning and Emerging Applications)

Download

Browse Figures

Versions Notes

Abstract

Cross-view geo-localization tries to find the matching place in large satellite or aerial pictures from photos taken at ground level, which is useful for applications like self-driving cars, flying drones, and adding virtual objects to real city scenes. However, the traditional deep learning hybrid CNN-Transformer architecture and complex geometric submodules result in a large computational overhead, making it difficult to apply in real-time on resource-constrained devices. To make it light, fast, and accurate, this paper suggests an effective way to make a state-space model for cross-view geo-localization tasks. The model replaces the traditional self-attention structure with a state-space vision backbone, lowering the sequence modeling complexity from quadratic to linear and greatly accelerating the inference process; it devises a channel-group aggregation strategy without any learnable parameters, producing a comprehensive yet lightweight representation, and introduces a dynamic difficulty-aware loss function that assigns varying weights to all negative samples within a batch according to their similarities, greatly improving the efficiency of hard-negative sample mining and the quality of convergence. The results on the authoritative public datasets CVUSA and CVACT indicate that our method has high accuracy and low computational complexity, providing a feasible approach for the lightweight design of more powerful cross-view geolocation models in the future.

Keywords:

cross-view geo-localization; state-space model; efficient retrieval; feature aggregation; hard-negative sample mining

1. Introduction

Cross-view image geolocation is a method of geolocating images from different perspectives, typically referring to the technology of using image data with known coordinates to query images with unknown coordinates in order to determine their location coordinates. It has broad application prospects in civilian and military fields, is a core task for autonomous driving [1] and outdoor robot technology [2], and can also locate target information in uninhabited areas, saving resources.

The development of deep learning has provided important data, models, and algorithm support for remote sensing image analysis and applications, and has contributed to significant progress in cross-view geolocation methods [3]. CNNs are widely used in the fields of object detection and recognition, and also estimate the distance and speed of moving targets [4]. For example, GeoNet, based on using CNNs to extract features, introduces capsule networks to model the spatial hierarchical information of scene objects in order to enhance the robustness of cross-view image matching [5]. Most deep learning-based geolocation methods use convolutional neural networks (CNNs) to extract image features and then estimate the location by matching and comparing the visual features of the query image with those of satellite images. However, CNN-based methods also have some shortcomings. CNNs are relatively weak in capturing contextual information, which may result in insufficient modeling of global relationships in cross-view geolocation tasks. At the same time, operations such as pooling and convolution in CNNs may reduce the resolution of images and disrupt the identifiable fine-grained information within them [6].

In recent years, transformers have been successfully applied to many CV tasks. The good context modeling ability of Transformers makes up for the shortcomings of CNNs. At present, the main method of using Transformer-based cross-view geo-localization technology is to take the Transformer encoder as the backbone for feature extraction, thereby improving the ability to capture contextual features. Some methods use ViT as the backbone to extract context-aware information so that it can better fit images [7].

Existing cross-view localization models have already achieved relatively good results in public datasets, and GeoDTR is currently one of the representative works in the field of cross-view geolocation [8]. It advocates addressing the problem of drastic viewpoint changes by explicitly decoupling ‘geometric information,’ employing a hybrid ResNet Transformer architecture, and designing a complex geometric layout extractor along with a counterfactual learning strategy. However, in-depth analysis revealed that GeoDTR and contemporary Transformer–CNN methods still face significant limitations in efficiency and optimization strategies [9].

Specifically, to achieve extreme matching accuracy, existing methods often rely on complex geometric decoupling modules and dense feature interaction mechanisms. While effective, these designs introduce substantial computational overhead, making real-time deployment on resource-constrained edge devices challenging. As the field moves toward practical applications, lightweight design and efficient deployment have become central to cross-view geolocation research. The goal is no longer limited to maintaining high accuracy, but also to achieve an optimal balance among inference speed, memory footprint, and energy consumption. Although general lightweight techniques such as pruning, quantization, and knowledge distillation have proven successful in many computer vision tasks, their direct application to cross-view geolocation often fails to reconcile two conflicting demands: preserving the geometric alignment capacity required for large viewpoint differences, and retaining fine-grained feature details. Consequently, designing a dedicated architecture that simultaneously supports strong cross-view representation and meets real-time lightweight constraints has become a pressing research problem [10,11].

In this paper, we take advantage of the latest developments of NLP and believe that a good architecture paradigm can be applied to vision tasks. Image patches have similar sequence dependencies as text tokens, and the feature maps generated by convolutional kernels show semantic correlation, where adjacent or grouped channels often represent similar visual features. In this feature space, samples near the decision boundary that are highly similar to positive samples are harder to separate, indicating that these negative samples should be given more importance during training.

Motivated by these observations, we propose a deep learning framework tailored for cross-view geolocation that balances accuracy and efficiency. Our work is as follows:

We introduce Vision Mamba into cross-view geolocation, offering a novel and efficient architectural solution for this task.

We design a grouped pooling strategy for feature aggregation, which avoids the information loss caused by global average pooling, preserves rich semantic information across channels, and maintains a lightweight structure.

A dynamically weighted batch-level loss function is created that makes use of information from all of the negative samples present in each batch. It speeds up the process of converging and improves the stability of the model when it comes to making fine-grained matches.

This article is organized into the following sections: In Section 2, we present the development of models for cross-view geolocation. In Section 3, we provide an overview of the encoding framework and our proposed components. In Section 4, we describe the experimental setup and results. In Section 5, we give the conclusion and a summary statement. Section 6 is the Discussion, which gives a full consideration and analysis of all the contexts that have been introduced in this study and related research.

2. Related Work

Early research on cross-view geolocation mainly relied on hand-crafted features and geometric constraints. For example, descriptors such as SIFT and HOG have been used to set up cross-view correspondences using orientation info or spatial relations. Effective under controlled circumstances, these hand-made representations frequently turn out to be unstable once they face the substantial shifts in the viewing angle and size that occur between ground, drone, and satellite images. This fragility basically restricts its precision and generalizability in actual, intricate surroundings.

Deep learning brought about a change in how cross-view geolocation was done. Researchers used CNNs for feature learning. Workman et al. were among the first to use CNNs for ground-to-satellite image matching, showing that deep features can work for this purpose [12]. Subsequent public benchmarks such as CVUSA and CVACT were introduced, which enabled even faster innovations. Since then, Siamese or dual-branch architectures, which are usually built on CNN backbones with shared or unshared weights and trained end-to-end using contrastive or triplet loss, have become the de facto standard in the field [13,14].

To solve the problem of spatial misalignment due to changes in perspective, researchers have examined ways to improve architecture and use geometric modeling techniques. One line of work tried to lessen differences between different viewpoints by changing ground panoramas into almost bird’s-eye views using polar coordinate transforms or clear viewpoint matching [15]. Another research direction included orientation-aware mechanisms and spatial attention to better focus on the discriminative areas. SAFA (Spatial-Aware Feature Aggregation) is one such representative work that learns spatial weight maps to adaptively aggregate CNN features and improves localization performance without making strong geometric assumptions [16]. Multi-scale fusion and attention mechanisms have also been adopted to make the features richer. However, even with these improvements, most CNN-based methods still have fundamental limitations due to their small local receptive fields, which restricts them from capturing global context across different views [17].

Transformers have made significant progress in computer vision in recent years. Based on self-attention, they have a natural advantage in modeling long-range dependencies and global context. Inspired by this, researchers started incorporating Transformers into cross-view geolocation [18,19,20,21]. Some use ViT as the backbone, breaking up images into patches for global modeling, while others use hybrid CNN–Transformer models to mix local texture coding with context reasoning. Experimental results show that Transformer-based methods can improve localization accuracy on several benchmarks, which proves the effectiveness of these methods for this task [7,22]. However, Transformer models usually have many parameters and need a significant amount of computing power, so they also need a large amount of training data. This makes them hard to use in real life [23].

Based on these advancements, GeoDTR puts forth a framework that distinctly separates the geometric and semantic information, marking considerable progress in this area. It uses a mix of ResNet and Transformer networks, along with a special part that examines how things are presented in space from different angles. This helps to reduce problems relating to certain views being favored over others [8]. GeoDTR received positive scores on many tests, making the field better at drawing shapes, but it has a rather complicated network design and training procedure, leading to high computing costs. Also, its performance heavily relies on the selection of feature aggregation and optimization approaches, limiting its efficiency and scalability to a certain degree [24].

Vision Mamba is a new direction in visual recognition. The main idea is to use state space models (SSMs), especially the Mamba model, which has performed well in NLP, for vision tasks in order to solve the computational and memory problems that current models have when dealing with high-resolution images or long sequences. Vim (Vision Mamba) was the first to show that SSMs could be used for vision, but its simple row- and column-wise scanning method can break up the normal way that pixels are next to each other in 2D pictures, so it does not do a good job of showing small, specific parts of the picture [11,25].

To summarize, as cross-view geolocation has progressed from traditional feature-based techniques to CNN-based deep learning, and then more recently to Transformer-integrated approaches that incorporate geometric reasoning explicitly, there have been consistent improvements in accuracy. But some problems still exist: the model is too complex, it is too expensive to use on the edge, fine details get lost when combining features, and most loss functions do not make good use of bad examples within a group of similar things [7,8,16,26,27,28]. Balancing the accuracy, efficiency, and optimization stability is also problematic; in this case, the careful inclusion of Vision Mamba provides a potential solution [3,27,29].

3. Methods

To address the aforementioned bottlenecks, this paper introduces GeoSSM, a unified framework designed to jointly tackle issues of architectural efficiency, feature aggregation, and loss formulation. Specifically, GeoSSM integrates three key components: (1) a state-space visual backbone that replaces heavy self-attention with more efficient sequence modeling; (2) a parameter-free channel group aggregation (CGA) module for lightweight global feature representation; and (3) a dynamic difficulty-aware loss (DDAL) that enhances the utilization of hard-negative samples at the batch level.

Figure 1 presents the overall feature encoding network structure of the GeoSSM cross-view geolocation framework proposed in this paper, providing a unified summary from task input to grouped pooling. The encoding process first passes the input RGB images (ground images of [B,3,128,512] and satellite images of [B,3,256,256]) through the Patch Embedding layer for image patching and feature projection, segmenting the images into multiple patches and projecting each patch into a 384-dimensional high-dimensional space, resulting in a patch feature sequence of [B,256,384]; next, the patch features undergo the first normalization (Norm) to standardize the feature distribution and stabilize subsequent training; then a learnable CLS Token (shape [B,1384]) is inserted in the middle of the sequence, and the sequence length is extended from 256 to 257 via a torch.cat concatenation operation, resulting in a feature sequence of [B,257,384]; subsequently, absolute positional encoding is added to provide the positional information for each patch and the CLS Token, enabling the model to understand the spatial relationships between patches; then, the feature sequence is sent into the Space-State Vision for deep feature extraction, including normalization, the core Mamba operations, and residual connections, with a final normalization applied afterward to standardize the [B,257,384] feature sequence; finally, the grouped pooling readout module splits the 384-dimensional features into 24 groups (16 dimensions per group), computes the mean of each group, and flattens them, compressing the [B,257,384] sequence features into a [B,6144] global feature vector, serving as the final output of the visual backbone for subsequent cross-view alignment tasks.

3.1. Overall Framework

GeoSSM employs a dual-branch setup to encode satellite view images and ground view images individually. Each branch has 3 parts: (i) a state-space visual backbone network that gets many-layered feature maps; (ii) a channel-group aggregation module that squeezes the feature maps into set-length worldwide indicators; and (iii) L2 normalization for retrieval and metric learning. In training, DDAL is applied to adjust the weights of positive and negative samples in each batch during the process of optimizing them.

The data flow starts from the input of ground images and satellite images. The images are divided into 256 patches through Patch Embedding and projected to 384 dimensions to obtain [B,256,384]. Then a CLS Token is inserted in the middle of the sequence, and position encoding is added to form [B,257,384]. Next, the sequence goes through the State Space Model (SSM) visual backbone module for deep feature extraction, and, after normalization, outputs sequence features of [B,257,384]. Subsequently, the features enter the Channel Group Aggregation (CGA) module, where the 384-dimensional channels are divided into 24 groups with 16 dimensions each, reshaped to [B,257,24,16]. Each group undergoes average pooling to obtain [B,257,24], which is finally flattened to a compact global feature of [B,6144]. Finally, the features enter the dynamic difficulty-aware loss (DDAL) module. The [B,6144] features of two views are L2-normalized and used to calculate a [B,B] cosine similarity matrix. Positive samples on the diagonal and negative off-diagonal samples are separated. Softmax is applied to dynamically assign weights to negative sample similarities (samples with higher similarity get higher weights) to compute a weighted contrastive loss. Gradients are backpropagated from DDAL through CGA to SSM, enabling end-to-end parameter updates.

3.2. State Space Vision Backbone (SSM)

To mitigate the scalability limitations imposed by the quadratic complexity O(N²) of self-attention, we adopt state space models (SSMs) for sequence modeling. The core intuition behind SSMs is to compress and propagate information through recursive states, which enables efficient modeling of long-range dependencies while reducing both computational and memory costs to approximately linear O(N). In our implementation, we build the visual encoder based on the Vision Mamba architecture: 2D feature maps are first flattened into a sequence for state-space updates, then remapped back to spatial representations. Compared to hybrid ResNet-Transformer designs, this backbone eliminates explicit self-attention computations, making it particularly suitable for high-resolution inputs and large-scale retrieval scenarios.

From a complexity standpoint, GeoSSM reaches O(N) linear scaling, as opposed to GeoDTR’s self-attention mechanism, which has an O(N²) complexity, meaning about double the speed for the same picture quality. Furthermore, SSM-based encoders can capture a wider range of global geometric contexts more easily compared to CNNs, thus solving one of the main issues with GeoDTR: its need for an explicit geometric extractor to connect local CNN features with overall spatial awareness.

Figure 2 displays the complete data flow of the Mamba SSM module from input to output.

The data flow of Mamba SSM starts with input features, which first go through an input projection to expand the dimensions and then through the feature extraction stage. This stage includes a split operation that divides the features into the SSM branch x and the gated branch z. The gated branch is used to control the flow of information and selectively modulate the final output. Next, the SSM branch undergoes causal convolution and activation to obtain features, which then enter the dynamic parameter generation stage. In this stage, features are expanded through linear projection and split into time-step parameters, input matrices, and output matrices. Parameter transformations are then performed: the time-step parameters undergo linear projection and activation to ensure positivity, and the learnable state matrix is exponentiated to obtain a negative state matrix. After that, the process enters the Selective Scan core computation, where the output is calculated using the state-space modeling formula by combining the input, time-step parameters, state matrix, input matrix, output matrix, skip connection parameters, and gating signal. Finally, the output of the Selective Scan is element-wise multiplied with the gated branch to achieve selective information flow. The features are then projected back to the original dimensions through the output projection, completing the entire computational process of Mamba SSM.

Unlike Transformers, which need global self-attention to model the interaction between every pair of positions, which leads to O(N²) complexity, SSMs pass context information along the sequence dimension via recursive state updates. This design allows for the computational and memory costs to increase linearly with respect to the length of the sequence, or O(N). This makes it much more efficient for high-resolution inputs. In short, the SSM-based backbone receives a large receptive field without spending too much money; it takes out important parts of the picture without getting stuck because it has to examine everything twice. Therefore, this efficient backbone can strike a good balance between efficiency and accuracy, keeping the capability to model cross-view global semantics and providing high-quality feature representations for further aggregation and matching.

3.3. Channel Group Aggregation (CGA)

To address the issue that ‘aggregation heads are either too complex (fully connected) or too simple (simple pooling),’ GeoSSM designs Channel Group Aggregation (CGA). Given the backbone output feature map

F \in R^{C \times H \times W}

, here, C represents the channel dimension, H represents the number of images, and W represents the number of image patches. We divide it into G groups along the channel dimension:

F = [F^{1}, \dots, F^{G}]

, with each group having C/G channels. Spatial average pooling is performed on each group separately to obtain a vector. Its calculation formula is shown in (1):

v^{g} = A v g P o o l (F^{g}) \in R^{\frac{C}{G}}

(1)

Then concatenate all groups to obtain the final descriptor, as in expression (2):

v = [v^{1}; \dots; v^{G}]

(2)

CGA does not introduce trainable parameters, resulting in very low computational overhead; at the same time, it avoids excessive compression from a single global pooling by using grouping, preserving richer channel semantics.

As shown in Figure 3, Channel Group Aggregation (CGA) is used to efficiently compress the feature map’s output by the backbone network into a global descriptor, specifically, the input feature map. Specifically, the input feature

F \in ℝ^{H \times W \times C}

is divided into G groups along the channel dimension, with each group corresponding to a subset of channel semantics; then, spatial average pooling is performed on each channel group separately (group-wise average pooling) to obtain G low-dimensional vector representations, and these vectors are concatenated in order to form the final flattened global feature descriptor. This process does not introduce additional learnable parameters, so the computational overhead is minimal, while the grouping mechanism preserves the intra-group semantic structure, avoiding excessive information compression caused by direct global pooling. Traditional complex feature extractors (such as fully connected layers or Transformer heads) usually have a large number of parameters and high computational costs, whereas CGA significantly improves the efficiency and deployability of feature aggregation while maintaining effective representational capability. This design completely eliminates additional learnable parameters and, through the grouping mechanism, effectively preserves rich channel semantics, achieving efficient and robust feature aggregation.

3.4. Dynamic Difficulty Awareness Loss (DDAL)

Anchor points are ground image feature vectors, positive samples are satellite image feature vectors that match the anchor points, and negative samples are satellite image feature vectors that do not match the anchor points.

To fully utilize the negative samples within a batch and focus on hard-negative samples, we propose DDAL. Suppose that, in a batch, the positive sample of an anchor sample a is p and the set of negative samples is

N = n_{i}

. We first calculate the similarity

s_{i} = s i m (a, n_{i})

using cosine similarity, then perform Softmax normalization on the negative sample similarity of each anchor point to obtain dynamic weights. Its calculation formula is shown in (3):

w_{i} = \frac{\exp (γ \cdot s_{i})}{Σ_{j} \exp (γ \cdot s_{j})}

(3)

The more similar a negative sample is, the more ‘difficult’ it is and the greater its weight. Then, construct batch-level contrastive objectives, its calculation formula is shown in (4):

L (a, p, N) = l o g (1 + Σ_{i} w_{i} \cdot e x p (s_{i} - s i m (a, p) + m))

(4)

This formulation can be seen as a smooth generalization of the typical triplet loss: instead of depending on just one most difficult negative sample, it changes and adds up the effects of every negative sample in a group, making training more steady and quicker to reach a good result. In the initial phase of training, DDAL automatically damps out the gradient interference caused by the many easy negatives. Training continues to increase the effect of hard negatives, which are near-positive examples, improving the model’s ability to distinguish small differences.

Figure 4 illustrates the proposed dynamic difficulty-aware loss (DDAL), which is designed to address the limitations of static weighting strategies in hard sample mining. Here, the anchor refers to the feature vector of a ground image, while positive samples correspond to satellite image features from the same geographic location and negative samples to those from different locations.

Given a set of query images and a set of reference images, DDAL creates an N × N similarity matrix. It is different from other methods, which consider all the negative samples equally; this method introduces a dynamic weight system. Through analyzing the distribution of similarities between negative samples and anchor samples, it will give more weight to the gradients for hard negatives (highly similar to the anchor) and less weight for easy negatives. This makes sure that GeoSSM keeps examining the hardest matching pairs during training so it gets better at telling small differences apart.

4. Experiments and Results

4.1. Datasets

We evaluated GeoSSM on two commonly used cross-view geolocation benchmarks, CVUSA and CVACT.

The CVUSA (Cross-View USA) dataset is one of the most representative public benchmarks in this field. Made up of Google Street View panoramas and satellite images that match them, it has street-level and aerial picture sets taken all over the US. Each sample is made up of a full 360-degree ground-level panorama and a satellite image of the same place, cropped to the same size and showing the surroundings. The dataset is mainly used for studying cross-modal matching with large changes in viewpoints, scales, and geometric structures.

The CVACT (Cross-View Canberra) dataset evaluates in another area: Canberra, Australia. It has a similar construction pipeline, using Google Street View panoramas as ground images and corresponding satellite imagery from the Google Maps API. The dataset gives 35,532 picture pairs for training and 8884 for validation, with recall@K being the main way to judge its performance. In addition to evaluating CVUSA, CVACT also contains a much larger test set of 92,802 image pairs for generalization.

4.2. Experimental Design and Implementation Details

To verify the effectiveness of GeoSSM, we evaluated it on the mainstream CVGL dataset (CVUSA,CVACT) using retrieval metrics such as Recall@K, Top-1 accuracy, and generalization performance under cross-region settings. In addition, we conducted the following ablation experiments: (i) replacing the SSM backbone with a Transformer–CNN backbone; (ii) replacing CGA with global average pooling or a fully connected head; and (iii) replacing DDAL with traditional triplet or InfoNCE and comparing convergence speed and the distribution of hard-negative samples. Finally, we compared the efficiency metrics (GFLOPs, params, inference time).

We implemented our model in PyTorch 2.1.0 and python3.1.0 and completed the training of our dataset using four NVIDIA RTX 4090. The training learning rate was set to 0.0001, the batch size was set to 32, the optimizer used was ASAM, weight decay was set to 0.03, the feature dimension was set to 6144, and a cosine annealing learning rate schedule was used.

4.3. Evaluation Metrics

We use R@1, R@5, R@10, and R@1% to demonstrate the model’s performance in the dataset. In the cross-view image geolocation task, R@K (Recall@K) is the most commonly used retrieval evaluation metric, used to measure the model’s ability to retrieve the correct geographic location from the candidate database. The definition of Recall@K is given in expression (5),

R @ K = \frac{1}{N} \sum_{i = 1}^{N} 1 [r a n k_{i} \leq K]

(5)

which is defined as the proportion of correct matches found in the top K retrieval results. Here, N is the total number of query images, rank i = the rank of the correct match of the i-th query in the retrieval results (starting from 1), and 1[⋅] = indicator function, returning 1 if the condition is true and otherwise returning 0.

The definition of Recall@1% is given in expression (6),

R @ 1 = \frac{1}{N} \sum_{i = 1}^{N} 1 [r a n k_{i} \leq 0.01 K \times M]

(6)

which is defined as the proportion of correct matches found in 1% of the reference images, where M is the total number of images in the reference image library, N is the total number of query images, rank i = the rank of the correct match of the i-th query in the retrieval results (starting from 1), and 1[⋅] = indicator function, returning 1 if the condition is true and otherwise 0.

4.4. Results Analysis

Table 1 gives a comparison of the performance of GeoSSM against the existing mainstream cross-view geolocation methods (SAFA, CDE, L2LTR, TransGeo, SEH, and GeoDTR) on the CVUSA dataset († denotes the results obtained using polar transformation). CVUSA is a standard suburban scene dataset, which is mainly used for evaluating the model’s ability to retrieve the panoramic and satellite images that are well aligned.

According to Table 1, GeoSSM has the best, or at least comparable, performance for all Recall@K metrics. In particular, GeoSSM’s R@1 accuracy was 96.02% and R@5 and R@10 were 98.95% and 99.26%, respectively. GeoSSM outperformed GeoDTR, which is based on ResNet–Transformer hybrid architecture, by improving R@1 by 0.59%. This shows that, by adding Vision Mamba as an effective backbone network along with the GCP feature aggregation method, GeoSSM can better obtain global geometric features.

To test how well the model works when there is a lot of interference, we examined its performance on the val and test parts of the CVACT dataset. Table 2 shows the comparison results between GeoSSM, GeoDTR, FRGeo, and Sample4Geo.

Under the CVACT val setting, GeoSSM obtained an R@1 of 87.53%. This was slightly less than Sample4Geo, but was better than GeoDTR (86.21%), so it had good retrieval results.

More importantly, on the large-scale CVACT test set with many more samples, GeoSSM showed great generalization ability. Existing methods such as GeoDTR had a considerable performance decrease due to the very large amount of data and complicated urban interference in the test set (R@1 fell to 64.52%). On the other hand, GeoSSM kept its R@1 accuracy at 76.35% on the test set, which was about 11.83% higher than GeoDTR. This shows that, by adding Vision Mamba as an effective backbone network and using the GCP feature fusion method, GeoSSM can better capture the overall geometric features, allowing for the model to still perform well even when there is a lot of interfering data.

4.5. Ablation Experiment

To test how much each new part helps GeoSSM work better, we performed “progressive ablation” tests on the CVUSA dataset. Results are presented in Table 3. The baseline model uses only Vision Mamba as the backbone network and a regular loss function.

Effectiveness of CGA: After adding CGA to the baseline, R@1 improved from 94.45% to 95.52%. This means that CGA performs better than global pooling or fully connected layer by keeping more semantic information in channels and reducing information loss when aggregating features.

Effectiveness of DDAL: If only DDAL is introduced, then R@1 increases to 94.81%. Although the improvement is a little less than that of CGA, this shows that dynamic weighting of difficult negative samples can improve the distribution of the feature space and the discriminative boundary of the model.

Synergistic effect: When CGA and DDAL are combined together to create a full GeoSSM system, the model’s performance is at its best, with an R@1 value of 96.02%. This shows that there is a good complementarity between the effective feature gathering structure (CGA) and the robust optimization method (DDAL); they jointly contribute to improving the model’s performance.

4.6. Analysis of Computational Complexity and Comparison of Parameters

In addition to retrieval accuracy, we also compared the model parameter counts (Params) of GeoSSM with the baseline method GeoDTR. As shown in Table 4, GeoDTR, due to its hybrid architecture combining ResNet and Transformer, has a parameter count of 48.51 M. In contrast, GeoSSM benefits from the linear complexity characteristics of Vision Mamba and the parameter-free CGA module, significantly reducing the parameter count to 35.24 M, a reduction of approximately 27% in model size.

Furthermore, the different methods are analyzed from the perspectives of computational complexity (FLOPs) and inference efficiency (Inference Time). As shown in the table, GeoDTR, due to its hybrid architecture combining ResNet and Transformer, introduces a large amount of global self-attention computation. Its FLOPs during the testing phase reach 39.9 G, and the inference time for a single image is 420.64 ms, resulting in considerable computational and latency costs. In contrast, TransGeo uses the small-DeiT structure, which reduces the parameter scale and significantly compresses FLOPs to 11.32 G, while the inference time is reduced to 99 ms. However, it still relies on the self-attention mechanism, making it difficult to further improve linear scalability. L2LTR, combining ResNet and ViT, achieves a certain balance in modeling local and global features, with a computational complexity of 18.7 G and an inference time of 156 ms, placing its overall efficiency between GeoDTR and TransGeo.

Compared to the methods mentioned above, GeoSSM, based on the Vision Mamba state-space modeling framework, fully leverages its linear time complexity advantage, significantly reducing computational overhead while maintaining effective long-range dependency modeling capabilities. Specifically, GeoSSM’s FLOPs are only 11.0 G, roughly on par with TransGeo, but it achieves higher computational efficiency with a smaller number of parameters; its inference time is 114 ms, significantly better than GeoDTR and L2LTR and close to lightweight Transformer methods. Overall, GeoSSM demonstrates a better trade-off among model parameters, computational complexity, and inference speed, proving the potential of state-space model-based cross-view geolocation methods to balance efficiency and performance, which is particularly suitable for deployment in applications with limited computing resources or high real-time requirements.

4.7. Lipschitz Stability Statement

This section analyzes the Lipschitz continuity of the mapping defined by the Mamba state space model (Selective State Space Model, SSM). We first clarify the input and output spaces and their norms, then analyze the Lipschitz constants of each functional component one by one, and finally provide an upper bound estimate of the overall Lipschitz constant, as well as discuss the necessary constraints to ensure stability.

Consider the Mamba module as a function

F : X \to Y

. The input

x \in X \subset R^{L \times d}

is a sequence of length L with feature dimension d, and the output

y \in Y \subset R^{L \times d}

is a sequence of the same dimension. The Frobenius norm is used in both the input space and the output space (equivalent to the ℓ2 norm after flattening the sequence into a vector), denoted as ∥ ⋅ ∥F. To simplify the analysis, we assume that the input data satisfies the normalization condition ∥x∥F ≤ 1 (which can be achieved through preprocessing such as LayerNorm), and that the norms of all intermediate variables are bounded. At the same time, all learnable parameters in the model are appropriately initialized and regularized to ensure that their norms are controllable during training.

The Mamba module mainly consists of normalization layers, linear projection layers, causal convolution layers, activation functions, selective scan, and gating mechanisms. The following formulae analyze their Lipschitz properties one by one.

LayerNorm

The calculation formula of the normalization layer (LayerNorm) is given in expression (7):

L N (z) = \frac{z - μ}{σ + ϵ} \cdot γ + β

(7)

Among them, μ and σ are the mean and standard deviation, ϵ > 0 is a small constant to prevent division by zero, and γ and β are learnable affine parameters. This mapping is Lipschitz-continuous in the region where σ > 0, and its Lipschitz constant is controlled by 1/(

σ_{\min}

+ ϵ), where

σ_{\min}

is the possible minimum standard deviation. In practical implementation, take ϵ = 10⁻⁵, and since the input features are constrained (previously normalized), the standard deviation will not approach zero. Therefore, the Lipschitz constant of the normalization layer can be considered finite. A rigorous proof requires assuming the existence of a constant σ₀ > 0 such that σ ≥ σ₀, in which case, we obtain expression (8):

L_{L N} \leq \frac{∥ γ ∥ \infty}{σ_{0} + ε}

(8)

In the Mamba implementation, γ is initialized to 1, so L_LN is of the order O(1).

2.: Linear layer

Linear layer f_lin(x) = Wx + b. The Lipschitz constant equals the spectral norm of the weight matrix W (the largest singular value), as in Equation (9):

L_{l i n} = ∥ W ∥_{2} = σ_{m a x} (W)

(9)

In Mamba, the input projection self.in_proj and output projection self.out_proj are both fully connected layers, and their weight matrices may have large spectral norms with default initialization.

3.: Causal Convolution Layer

Causal convolution is a linear time-invariant filter and can be represented as a matrix multiplication f_conv(x) = Cx, where C is a Toeplitz matrix constructed from the convolution kernel. Its Lipschitz constant equals the spectral norm of C,

∥ C ∥

₂. For depthwise separable convolution (groups = d_inner), C is a block diagonal matrix and the spectral norm equals the largest singular value of the convolution kernel matrix of each channel. Mamba uses lecun_normal_ initialization, making the expected norm of the convolution kernel controllable, but it may still grow after training.

4.: Activation Function SiLU

SiLU (Sigmoid Linear Unit) is defined as σSiLU(x) = x ⋅ sigmoid(x). Its derivative is shown in Equation (10):

σ^{'} S i L U (x) = s i g m o i d (x) + x \cdot s i g m o i d (x) \cdot (1 - s i g m o i d (x))

(10)

Through numerical calculation, we can obtain |σ′SiLU (x)| ≤ 1.1 (the maximum value is approximately attained at x ≈ 2.4); therefore, SiLU is Lipschitz-continuous.

5.: Selective Scanning (Core SSM)

The selective scanning module implements a discrete state space model. The expressions are shown in Equations (11) and (12):

h_{t} = {(I - Δ_{t} A)}^{- 1} h_{t - 1} + Δ_{t} B_{t} x_{t}

(11)

y_{t} = C_{t} h_{t} + D_{t} X_{t}

(12)

Among them, Δt ∈ ℝ^d is dynamically generated from the input, A ∈ ℝ^d×d is a learnable matrix (usually diagonalized), and B_t and C_t are also generated from the input.

The mapping from input x_t to output y_t involves the generation of B_t and C_t. B_t and C_t are usually obtained from the input through a linear layer and softmax normalization, so their norms are bounded. For example, B_t = softmax(W_Bx_t) satisfies

∥ B_{t} ∥

₂ ≤ 1. Δ_t is generated by softplus, and its value has an upper bound (depending on the input range). Therefore, the selective scanning module as a whole is Lipschitz-continuous, and its Lipschitz constant can be derived by composing the constants of its components.

6.: Gating mechanism

The gated operation is out = out ⋅ σ_SiLU(z), which is the element-wise product of two functions. Since the Lipschitz constant of SiLU is 1.1 and z has a normalized range that is controllable, the Lipschitz constant of this gated mapping satisfies Equation (13):

L_{g a t e} \leq ∥ σ_{S i L U} (z) ∥_{\infty} \cdot L_{o u t} + ∥ o u t ∥_{\infty} \cdot L_{S i L U} \cdot L_{z}

(13)

Here, L_out and L_z are the Lipschitz constants of the branches where out and z are located, respectively. If the intermediate feature norms are bounded (for example, ensured through normalization), then L_gate is bounded. In Mamba, z comes from another branch and is normalized, so

∥ z ∥

_∞ is finite, and thus L_gate = O(1).

7.: Residual connection

The form of the residual connection is output = residual drop_path(hidden). For any two input pairs (a, b) and (a′, b′), the mapping that adds them directly satisfies Equation (14):

∥ (a + b) - (a^{'} + b^{'}) ∥ \leq ∥ a - a^{'} ∥ + ∥ b - b^{'} ∥

(14)

Therefore, if the Lipschitz constant of the residual branch is L_res and the Lipschitz constant of the hidden branch is L_hid, then the overall Lipschitz constant after addition does not exceed L_res + L_hid. In a typical residual block, the residual branch is usually an identity mapping (i.e., L_res = 1), while the hidden branch is a nonlinear transformation. Thus, the overall Lipschitz constant satisfies L ≤ 1 + L_hid. DropPath randomly drops the hidden branch with probability p during training (output set to zero). Its expected effect is equivalent to scaling, but during inference, it acts as an identity mapping, and therefore does not affect the estimation of the upper bound of the Lipschitz constant.

According to the composition property of Lipschitz functions, the overall Lipschitz constant satisfies Equation (15):

L_{F} \leq L_{o u t} \cdot L_{g a t e} \cdot L_{s s m} \cdot L_{a c t} \cdot L_{c o n v} \cdot L_{i n} \cdot L_{n o r m}

(15)

According to the analysis, Lin and Lout may be large if unconstrained. Therefore, in subsequent improvements, spectral norm constraints will be applied to the linear projection layer to enhance the model’s robustness.

5. Conclusions

Overall, the model proposed in this paper shows some improvements in balancing the accuracy and computational efficiency of existing models, providing a feasible approach for the lightweight improvement of other high-precision visual task models in the future.

6. Discussion

To address the significant efficiency and optimization strategy bottlenecks of existing GeoDTR and similar methods in the cross-view geolocation domain, the GeoSSM framework proposed in this paper builds an end-to-end cross-view geolocation solution by organically integrating three novel modules: Vision Mamba, grouped channel pooling, and dynamic difficulty-aware loss. Vision Mamba breaks away from the local receptive field limitation of traditional CNNs and the quadratic complexity limitations of Transformers via a state-space model to achieve long-range dependency modeling with linear complexity. Its dynamic parameter generation mechanism can adjust the feature extraction strategy according to needs. Grouped channel pooling addresses the issue of too much information being compressed in conventional global pooling by keeping the difference in meaning among channel groups intact, thus creating a good fit with the various scale semantic features of Vision Mamba. Dynamic difficulty-aware loss causes the model to focus more on difficult samples due to its adaptive weighting system, creating a two-part self-adjusting learning system with Vision Mamba’s dynamic mechanism. Compared to other methods, GeoSSM has obvious advantages in terms of computational efficiency, feature representation ability, and optimization strategy. The three parts make up a full technical chain and deep level cooperation in feature extraction, gathering, and improvement. But there are problems, the limitations of this research are as follows: compared to the parallel calculation of Transformers, Mamba’s sequential scanning restricts parallelism, so it might not be as efficient for ultra-long sequences; there is no exchange of information between different channel groups, which may limit the model’s ability to learn complex cross-channel features; and if all negative samples in one batch are easy, it will not give enough gradients, so the model may receive an undesirable postponement. Also, this paper mainly discusses improvements in balancing model accuracy and efficiency, but robustness is also an important aspect of deep learning.

As for the limitations of this study, future research may continue along the following lines: Firstly, for alleviating the parallel efficiency problems brought about by Mamba’s sequential scanning, later studies could investigate hybrid architecture-based optimization plans. Secondly, in order to solve the problem of inadequate feature interaction due to the lack of communication among channels, subsequent works can design dynamic cross-channel fusion modules. Moreover, to address the vanishing gradient issue that arises from using merely negative examples, future work can incorporate adaptive hard example mining techniques. Also, it is useful to compare the inference throughput at different batch sizes because different situations have various speed needs in real life. On this basis, we can also focus on exploring multi-scale state space modeling, cross-domain self-supervised pre-training, and stronger cross-regional generalization evaluation, combined with a robustness view, such as studying the adversarial robustness or distribution generalization ability of methods so as to better promote the application of cross-view localization technology in complex real-world situations.

Author Contributions

Conceptualization, H.T. and Z.W. (Zhenqing Wang); methodology, H.T.; software, H.T. and Z.W. (Zhaowei Wang); validation, H.T.; formal analysis, H.T. and T.W.; investigation, H.T.; resources, H.T. and F.W.; data curation, H.T.; writing—original draft preparation, H.T.; writing—review and editing, S.W., L.W. and Z.W. (Zhaowei Wang); visualization, H.T.; supervision, C.X. and Z.N.; project administration, H.T.; funding acquisition, F.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key R&D Program of China (2021YFB3901201).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The publicly available datasets used in the experiment can be obtained through the following channels: CVUSA: Crossview USA (CVUSA) @ MVRL (accessed on 7 April 2025); CVACT: https://github.com/Liumouliu/OriCNN (accessed on 7 April 2025).

Acknowledgments

In the preparation of this work, the author used DeepSeek-V3.2 to assist in writing some summaries and commentary sections of the experimental results, and also used DeepSeek to help interpret some model structures. After using this tool, the author reviewed and edited the content as needed and takes full responsibility for the final published content. The model structure was interpreted with the assistance of DeepSeek in these places. (Lines 207–225, Page 5) (Lines 269–287, Page 7) summary and commentary fragments of experimental results using DeepSeek assisted in these places (Lines 475–492, Page 13) (Lines 520–530, Page 14) (Lines 635–652, Page 16–17).

Conflicts of Interest

The authors declare no conflict of interest.

References

Häne, C.; Heng, L.; Lee, G.H.; Fraundorfer, F.; Furgale, P.; Sattler, T.; Pollefeys, M. 3D Visual Perception for Self-Driving Cars using a Multi-Camera System: Calibration, Mapping, Localization, and Obstacle Detection. arXiv 2017, arXiv:1708.09839. [Google Scholar] [CrossRef]
McManus, C.; Churchill, W.; Maddern, W.; Stewart, A.D.; Newman, P. Shady dealings: Robust, long-term visual localisation using illumination invariance. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 901–906. [Google Scholar] [CrossRef]
Durgam, A.; Paheding, S.; Dhiman, V.; Devabhaktuni, V. Cross-view geo-localization: A survey. IEEE Access 2024, 12, 192028–192050. [Google Scholar] [CrossRef]
Delamou, M.; Bazzi, A.; Chafii, M.; Amhoud, E.M. Deep Learning-Based Estimation for Multitarget Radar Detection. arXiv 2023, arXiv:2305.05621. [Google Scholar] [CrossRef]
Zhu, Y.; Sun, B.; Lu, X.; Jia, S. Geographic Semantic Network for Cross-View Image Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4704315. [Google Scholar] [CrossRef]
Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2020, 13, 47. [Google Scholar] [CrossRef]
Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. arXiv 2022, arXiv:2204.00097. [Google Scholar] [CrossRef]
Zhang, X.; Li, X.; Sultani, W.; Zhou, Y.; Wshah, S. Cross-view Geo-localization via Learning Disentangled Geometric Layout Correspondence. arXiv 2023, arXiv:2212.04074. [Google Scholar] [CrossRef]
Ye, J.; Lv, Z.; Li, W.; Yu, J.; Yang, H.; Zhong, H.; He, C. Cross-View Image Geo-Localization with Panorama-BEV Co-Retrieval Network. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
Fervers, F.; Bullinger, S.; Bodensteiner, C.; Arens, M.; Stiefelhagen, R. Statewide Visual Geolocalization in the Wild. arXiv 2024, arXiv:2409.16763. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. arXiv 2015, arXiv:1510.03743. [Google Scholar] [CrossRef]
Tian, Y.; Chen, C.; Shah, M. Cross-View Image Matching for Geo-Localization in Urban Environments. arXiv 2017, arXiv:1703.07815. [Google Scholar] [CrossRef]
Zhu, S.; Yang, T.; Chen, C. VIGOR: Cross-View Image Geo-localization Beyond One-to-One Retrieval. arXiv 2021, arXiv:2011.12172. [Google Scholar] [CrossRef]
Lin, T.-Y.; Belongie, S.; Hays, J. Cross-View Image Geolocalization. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 891–898. [Google Scholar] [CrossRef]
Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-Aware Feature Aggregation for Image Based Cross-View Geo-Localization. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Available online: https://proceedings.neurips.cc/paper/2019/hash/ba2f0015122a5955f8b3a50240fb91b2-Abstract.html (accessed on 20 January 2026).
Castaldo, F.; Zamir, A.; Angst, R.; Palmieri, F.; Savarese, S. Semantic Cross-View Matching. arXiv 2015, arXiv:1511.00098. [Google Scholar] [CrossRef]
Montrezol, J.; Oliveira, H.S.; Oliveira, H.P. Decoding vision transformer variations for image classification: A guide to performance and usability. Mach. Learn. Appl. 2026, 23, 100844. [Google Scholar] [CrossRef]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See like Convolutional Neural Networks? arXiv 2022, arXiv:2108.08810. [Google Scholar] [CrossRef]
Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
Ye, P.; Lin, J.; Kang, Y.; Kaya, T.; Yildirim, K.; Baig, A.H.; Aydemir, E.; Dogan, S.; Tuncer, T. MobileTransNeXt: Integrating CNN, transformer, and BiLSTM for image classification. Alex. Eng. J. 2025, 123, 460–470. [Google Scholar] [CrossRef]
Yang, H.; Lu, X.; Zhu, Y. Cross-View Geo-Localization with Evolving Transformer. arXiv 2021, arXiv:2107.00842. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Kautz, J. MambaVision: A Hybrid Mamba-Transformer Vision Backbone. arXiv 2025, arXiv:2407.08083. [Google Scholar] [CrossRef]
Ye, J.; Lin, H.; Ou, L.; Chen, D.; Wang, Z.; Zhu, Q.; He, C.; Li, W. Where am I? Cross-View Geo-Localization with Natural Language Descriptions. arXiv 2025, arXiv:2412.17007. [Google Scholar] [CrossRef]
Rahman, M.M.; Tutul, A.A.; Nath, A.; Laishram, L.; Jung, S.K.; Hammond, T. Mamba in Vision: A Comprehensive Survey of Techniques and Applications. arXiv 2024, arXiv:2410.03105. [Google Scholar] [CrossRef]
Yang, H.; Lu, X.; Zhu, Y. Cross-View Geo-Localization with Layer-to-Layer Transformer. In Advances in Neural Information Processing Systems; 2021; Available online: https://openreview.net/forum?id=tQgj7CDTfKB (accessed on 20 January 2026).
Ju, C.; Xu, W.; Chen, N.; Zheng, E. An Efficient Pyramid Transformer Network for Cross-View Geo-Localization in Complex Terrains. Drones 2025, 9, 379. [Google Scholar] [CrossRef]
Guo, Y.; Choi, M.; Li, K.; Boussaid, F.; Bennamoun, M. Soft Exemplar Highlighting for Cross-View Image-Based Geo-Localization. IEEE Trans. Image Process. 2022, 31, 2094–2105. [Google Scholar] [CrossRef]
Bao, M.; Lyu, S.; Xu, Z.; Zhou, H.; Ren, J.; Xiang, S.; Li, X.; Cheng, G. Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook. arXiv 2025, arXiv:2505.00630. [Google Scholar] [CrossRef]
Zhang, Q.; Zhu, Y. Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination. Proc. AAAI Conf. Artif. Intell. 2024, 38, 7251–7259. [Google Scholar] [CrossRef]
Deuser, F.; Habel, K.; Oswald, N. Sample4Geo: Hard Negative Sampling for Cross-View Geo-Localisation. arXiv 2023, arXiv:2303.11851. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the GeoSSM framework. CLS: Classification. SSM: Core Mamba Block. C, H, and W represent Batch size, Number of Tokens, and Channel Dimension, respectively.

Figure 2. SSM module diagram. X and Z represent the SSM data branch and the gated data branch, respectively.

Figure 3. Schematic diagram of CGA structure. Different colors represent different channel groups, with each color representing a group of channels that have similar semantic features. Channels of the same color capture similar visual patterns or semantic information.

Figure 4. Schematic diagram of DDAL principle. A: Anchor; P: positive samples; N: negative samples. γ: temperature parameter, used to adjust the sharpness of the negative sample weight distribution; sim: similarity score between negative sample and query sample.

Table 1. Comparison of GeoSSM with existing mainstream methods on the CVUSA dataset.

Method	R@1	R@5	R@10	R@1%
SAFA ^† [16]	89.84	96.93	98.14	99.64
CDE ^† [8]	92.56	97.55	98.33	99.57
L2LTR ^† [26]	94.05	98.27	98.99	99.67
TransGeo [7]	94.08	98.36	99.04	99.77
SHE [28]	95.11	98.45	99.00	99.78
GeoDTR ^† [8]	95.43	98.86	99.34	99.86
GeoSSM	96.02	98.95	99.26	99.91

^† Results obtained using polar transformation.

Table 2. Performance comparison of GeoSSM with existing mainstream methods on the CVACT dataset.

Method	R@1	R@5	R@10	R@1%	Settings
GeoDTR ^† [8]	86.21	95.44	96.72	98.77	val
FRGeo [30]	90.35	96.45	97.25	98.74	val
Sample4Geo [31]	90.81	96.74	97.48	98.77	val
GeoSSM	87.53	96.05	96.81	98.86	val
GeoDTR ^† [8]	64.52	88.59	91.96	98.74	test
GeoSSM	76.35	90.72	93.12	98.03	test

^† Results obtained using polar transformation.

Table 3. Ablation experiment results.

Configuration	R@1	R@5	R@10	R@1%
Baseline	94.45%	98.21%	98.96%	99.42%
Baseline + CGA	95.52%	98.76%	99.13%	99.65%
Baseline + DDAL	94.81%	98.35%	98.95%	99.64%
GeoSSM (Base + CGA + DDAL)	96.02%	98.95%	99.26%	99.81%

Table 4. Analysis of computational complexity and comparison of parameters. In the CVUSA dataset, the results were obtained using four NVIDIA GeForce RTX 4090 GPUs. The ground input image resolution is 128 × 512, the satellite input image resolution is 256 × 256, and the batch size is 32.

Method	Backbone	Params (M)	Test Flops (G)	Inference Time (ms)
GeoDTR	ResNet + Trans	48.51	39.9	420.64
TransGeo	Pure Transformer	44.43	11.32	99
L2LTR	ResNet + ViT	76.35	18.7	156
GeoSSM (ours)	Vision Mamba (SSM)	35.24	11.0	114

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tao, H.; Wang, S.; Wang, F.; Wang, L.; Wang, Z.; Wang, Z.; Wang, T.; Xiong, C.; Nie, Z. Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models. AI 2026, 7, 118. https://doi.org/10.3390/ai7040118

AMA Style

Tao H, Wang S, Wang F, Wang L, Wang Z, Wang Z, Wang T, Xiong C, Nie Z. Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models. AI. 2026; 7(4):118. https://doi.org/10.3390/ai7040118

Chicago/Turabian Style

Tao, Haojie, Shixin Wang, Futao Wang, Litao Wang, Zhenqing Wang, Zhaowei Wang, Tianhao Wang, Chengyue Xiong, and Ziqi Nie. 2026. "Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models" AI 7, no. 4: 118. https://doi.org/10.3390/ai7040118

APA Style

Tao, H., Wang, S., Wang, F., Wang, L., Wang, Z., Wang, Z., Wang, T., Xiong, C., & Nie, Z. (2026). Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models. AI, 7(4), 118. https://doi.org/10.3390/ai7040118

Article Menu

Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models

Abstract

1. Introduction

2. Related Work

3. Methods

3.1. Overall Framework

3.2. State Space Vision Backbone (SSM)

3.3. Channel Group Aggregation (CGA)

3.4. Dynamic Difficulty Awareness Loss (DDAL)

4. Experiments and Results

4.1. Datasets

4.2. Experimental Design and Implementation Details

4.3. Evaluation Metrics

4.4. Results Analysis

4.5. Ablation Experiment

4.6. Analysis of Computational Complexity and Comparison of Parameters

4.7. Lipschitz Stability Statement

5. Conclusions

6. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI