Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models
Abstract
1. Introduction
2. Related Work
3. Methods
3.1. Overall Framework
3.2. State Space Vision Backbone (SSM)
3.3. Channel Group Aggregation (CGA)
3.4. Dynamic Difficulty Awareness Loss (DDAL)
4. Experiments and Results
4.1. Datasets
4.2. Experimental Design and Implementation Details
4.3. Evaluation Metrics
4.4. Results Analysis
4.5. Ablation Experiment
4.6. Analysis of Computational Complexity and Comparison of Parameters
4.7. Lipschitz Stability Statement
- LayerNorm
- 2.
- Linear layer
- 3.
- Causal Convolution Layer
- 4.
- Activation Function SiLU
- 5.
- Selective Scanning (Core SSM)
- 6.
- Gating mechanism
- 7.
- Residual connection
5. Conclusions
6. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Häne, C.; Heng, L.; Lee, G.H.; Fraundorfer, F.; Furgale, P.; Sattler, T.; Pollefeys, M. 3D Visual Perception for Self-Driving Cars using a Multi-Camera System: Calibration, Mapping, Localization, and Obstacle Detection. arXiv 2017, arXiv:1708.09839. [Google Scholar] [CrossRef]
- McManus, C.; Churchill, W.; Maddern, W.; Stewart, A.D.; Newman, P. Shady dealings: Robust, long-term visual localisation using illumination invariance. In Proceedings of the 2014 IEEE International Conference on Robotics and Automation (ICRA), Hong Kong, China, 31 May–7 June 2014; pp. 901–906. [Google Scholar] [CrossRef]
- Durgam, A.; Paheding, S.; Dhiman, V.; Devabhaktuni, V. Cross-view geo-localization: A survey. IEEE Access 2024, 12, 192028–192050. [Google Scholar] [CrossRef]
- Delamou, M.; Bazzi, A.; Chafii, M.; Amhoud, E.M. Deep Learning-Based Estimation for Multitarget Radar Detection. arXiv 2023, arXiv:2305.05621. [Google Scholar] [CrossRef]
- Zhu, Y.; Sun, B.; Lu, X.; Jia, S. Geographic Semantic Network for Cross-View Image Geo-Localization. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4704315. [Google Scholar] [CrossRef]
- Ding, L.; Zhou, J.; Meng, L.; Long, Z. A Practical Cross-View Image Matching Method between UAV and Satellite for UAV-Based Geo-Localization. Remote Sens. 2020, 13, 47. [Google Scholar] [CrossRef]
- Zhu, S.; Shah, M.; Chen, C. TransGeo: Transformer Is All You Need for Cross-view Image Geo-localization. arXiv 2022, arXiv:2204.00097. [Google Scholar] [CrossRef]
- Zhang, X.; Li, X.; Sultani, W.; Zhou, Y.; Wshah, S. Cross-view Geo-localization via Learning Disentangled Geometric Layout Correspondence. arXiv 2023, arXiv:2212.04074. [Google Scholar] [CrossRef]
- Ye, J.; Lv, Z.; Li, W.; Yu, J.; Yang, H.; Zhong, H.; He, C. Cross-View Image Geo-Localization with Panorama-BEV Co-Retrieval Network. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2025. [Google Scholar] [CrossRef]
- Fervers, F.; Bullinger, S.; Bodensteiner, C.; Arens, M.; Stiefelhagen, R. Statewide Visual Geolocalization in the Wild. arXiv 2024, arXiv:2409.16763. [Google Scholar] [CrossRef]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
- Workman, S.; Souvenir, R.; Jacobs, N. Wide-Area Image Geolocalization with Aerial Reference Imagery. arXiv 2015, arXiv:1510.03743. [Google Scholar] [CrossRef]
- Tian, Y.; Chen, C.; Shah, M. Cross-View Image Matching for Geo-Localization in Urban Environments. arXiv 2017, arXiv:1703.07815. [Google Scholar] [CrossRef]
- Zhu, S.; Yang, T.; Chen, C. VIGOR: Cross-View Image Geo-localization Beyond One-to-One Retrieval. arXiv 2021, arXiv:2011.12172. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Belongie, S.; Hays, J. Cross-View Image Geolocalization. In Proceedings of the 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, 23–28 June 2013; pp. 891–898. [Google Scholar] [CrossRef]
- Shi, Y.; Liu, L.; Yu, X.; Li, H. Spatial-Aware Feature Aggregation for Image Based Cross-View Geo-Localization. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Available online: https://proceedings.neurips.cc/paper/2019/hash/ba2f0015122a5955f8b3a50240fb91b2-Abstract.html (accessed on 20 January 2026).
- Castaldo, F.; Zamir, A.; Angst, R.; Palmieri, F.; Savarese, S. Semantic Cross-View Matching. arXiv 2015, arXiv:1511.00098. [Google Scholar] [CrossRef]
- Montrezol, J.; Oliveira, H.S.; Oliveira, H.P. Decoding vision transformer variations for image classification: A guide to performance and usability. Mach. Learn. Appl. 2026, 23, 100844. [Google Scholar] [CrossRef]
- Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See like Convolutional Neural Networks? arXiv 2022, arXiv:2108.08810. [Google Scholar] [CrossRef]
- Maurício, J.; Domingues, I.; Bernardino, J. Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review. Appl. Sci. 2023, 13, 5521. [Google Scholar] [CrossRef]
- Ye, P.; Lin, J.; Kang, Y.; Kaya, T.; Yildirim, K.; Baig, A.H.; Aydemir, E.; Dogan, S.; Tuncer, T. MobileTransNeXt: Integrating CNN, transformer, and BiLSTM for image classification. Alex. Eng. J. 2025, 123, 460–470. [Google Scholar] [CrossRef]
- Yang, H.; Lu, X.; Zhu, Y. Cross-View Geo-Localization with Evolving Transformer. arXiv 2021, arXiv:2107.00842. [Google Scholar] [CrossRef]
- Hatamizadeh, A.; Kautz, J. MambaVision: A Hybrid Mamba-Transformer Vision Backbone. arXiv 2025, arXiv:2407.08083. [Google Scholar] [CrossRef]
- Ye, J.; Lin, H.; Ou, L.; Chen, D.; Wang, Z.; Zhu, Q.; He, C.; Li, W. Where am I? Cross-View Geo-Localization with Natural Language Descriptions. arXiv 2025, arXiv:2412.17007. [Google Scholar] [CrossRef]
- Rahman, M.M.; Tutul, A.A.; Nath, A.; Laishram, L.; Jung, S.K.; Hammond, T. Mamba in Vision: A Comprehensive Survey of Techniques and Applications. arXiv 2024, arXiv:2410.03105. [Google Scholar] [CrossRef]
- Yang, H.; Lu, X.; Zhu, Y. Cross-View Geo-Localization with Layer-to-Layer Transformer. In Advances in Neural Information Processing Systems; 2021; Available online: https://openreview.net/forum?id=tQgj7CDTfKB (accessed on 20 January 2026).
- Ju, C.; Xu, W.; Chen, N.; Zheng, E. An Efficient Pyramid Transformer Network for Cross-View Geo-Localization in Complex Terrains. Drones 2025, 9, 379. [Google Scholar] [CrossRef]
- Guo, Y.; Choi, M.; Li, K.; Boussaid, F.; Bennamoun, M. Soft Exemplar Highlighting for Cross-View Image-Based Geo-Localization. IEEE Trans. Image Process. 2022, 31, 2094–2105. [Google Scholar] [CrossRef]
- Bao, M.; Lyu, S.; Xu, Z.; Zhou, H.; Ren, J.; Xiang, S.; Li, X.; Cheng, G. Vision Mamba in Remote Sensing: A Comprehensive Survey of Techniques, Applications and Outlook. arXiv 2025, arXiv:2505.00630. [Google Scholar] [CrossRef]
- Zhang, Q.; Zhu, Y. Aligning Geometric Spatial Layout in Cross-View Geo-Localization via Feature Recombination. Proc. AAAI Conf. Artif. Intell. 2024, 38, 7251–7259. [Google Scholar] [CrossRef]
- Deuser, F.; Habel, K.; Oswald, N. Sample4Geo: Hard Negative Sampling for Cross-View Geo-Localisation. arXiv 2023, arXiv:2303.11851. [Google Scholar] [CrossRef]




| Method | R@1 | R@5 | R@10 | R@1% |
|---|---|---|---|---|
| SAFA † [16] | 89.84 | 96.93 | 98.14 | 99.64 |
| CDE † [8] | 92.56 | 97.55 | 98.33 | 99.57 |
| L2LTR † [26] | 94.05 | 98.27 | 98.99 | 99.67 |
| TransGeo [7] | 94.08 | 98.36 | 99.04 | 99.77 |
| SHE [28] | 95.11 | 98.45 | 99.00 | 99.78 |
| GeoDTR † [8] | 95.43 | 98.86 | 99.34 | 99.86 |
| GeoSSM | 96.02 | 98.95 | 99.26 | 99.91 |
| Method | R@1 | R@5 | R@10 | R@1% | Settings |
|---|---|---|---|---|---|
| GeoDTR † [8] | 86.21 | 95.44 | 96.72 | 98.77 | val |
| FRGeo [30] | 90.35 | 96.45 | 97.25 | 98.74 | val |
| Sample4Geo [31] | 90.81 | 96.74 | 97.48 | 98.77 | val |
| GeoSSM | 87.53 | 96.05 | 96.81 | 98.86 | val |
| GeoDTR † [8] | 64.52 | 88.59 | 91.96 | 98.74 | test |
| GeoSSM | 76.35 | 90.72 | 93.12 | 98.03 | test |
| Configuration | R@1 | R@5 | R@10 | R@1% |
|---|---|---|---|---|
| Baseline | 94.45% | 98.21% | 98.96% | 99.42% |
| Baseline + CGA | 95.52% | 98.76% | 99.13% | 99.65% |
| Baseline + DDAL | 94.81% | 98.35% | 98.95% | 99.64% |
| GeoSSM (Base + CGA + DDAL) | 96.02% | 98.95% | 99.26% | 99.81% |
| Method | Backbone | Params (M) | Test Flops (G) | Inference Time (ms) |
|---|---|---|---|---|
| GeoDTR | ResNet + Trans | 48.51 | 39.9 | 420.64 |
| TransGeo | Pure Transformer | 44.43 | 11.32 | 99 |
| L2LTR | ResNet + ViT | 76.35 | 18.7 | 156 |
| GeoSSM (ours) | Vision Mamba (SSM) | 35.24 | 11.0 | 114 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Tao, H.; Wang, S.; Wang, F.; Wang, L.; Wang, Z.; Wang, Z.; Wang, T.; Xiong, C.; Nie, Z. Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models. AI 2026, 7, 118. https://doi.org/10.3390/ai7040118
Tao H, Wang S, Wang F, Wang L, Wang Z, Wang Z, Wang T, Xiong C, Nie Z. Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models. AI. 2026; 7(4):118. https://doi.org/10.3390/ai7040118
Chicago/Turabian StyleTao, Haojie, Shixin Wang, Futao Wang, Litao Wang, Zhenqing Wang, Zhaowei Wang, Tianhao Wang, Chengyue Xiong, and Ziqi Nie. 2026. "Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models" AI 7, no. 4: 118. https://doi.org/10.3390/ai7040118
APA StyleTao, H., Wang, S., Wang, F., Wang, L., Wang, Z., Wang, Z., Wang, T., Xiong, C., & Nie, Z. (2026). Balancing Precision and Efficiency: Cross-View Geo-Localization with Efficient State Space Models. AI, 7(4), 118. https://doi.org/10.3390/ai7040118

