MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation
Highlights
- MSMamba introduces a Mamba-driven multi-semantic framework for referring remote sensing image segmentation.
- The proposed Visual–Text Fine-grained Block (VTFB) and multi-scale fusion decoder (MSFD) significantly enhance fine-grained language grounding and scale-aware boundary refinement.
- State-space modeling provides an efficient, scalable alternative to heavy attention mechanisms for long-range context modeling in large-scale remote sensing imagery.
- Experiments on four public benchmarks show consistent improvements in segmentation accuracy, especially with long-text and cluttered scene descriptions.
Abstract
1. Introduction
- We explore a Mamba-driven framework for referring remote sensing image segmentation and design a Visual–Text Fine-grained Block (VTFB) based on state space modeling to improve fine-grained vision–language alignment under long and complex descriptions.
- We design an enhanced multi-scale fusion decoder (MSFD) that strengthens cross-scale feature interaction and improves segmentation of small objects and boundary regions.
- Extensive experiments on four public benchmarks (RRSIS-D, RRSIS-HR, RefSegRS, and RISBench) demonstrate the effectiveness and state-of-the-art performance of MSMamba. Further analysis shows more evident advantages in long-text scenarios.
2. Related Works
2.1. Referring Image Segmentation
2.2. Referring Remote Sensing Image Segmentation
3. Proposed Methodology
3.1. Preliminaries: State Space Model (SSM)
3.2. Architecture Overview
3.3. Visual–Text Fine-Grained Block (VTFB)
3.3.1. Word-Level Semantic Processor
3.3.2. Global–Local Cross-Modal Alignment
3.4. Multi-Scale Fusion Decoder (MSFD)
3.4.1. Multi-Branch Cross-Scale Parsing
3.4.2. Selective Kernel Scale-Aware Gate
4. Experiments
4.1. Metrics and Datasets
4.1.1. RefSegRS
4.1.2. RRSIS-D
4.1.3. RISBench
4.1.4. RRSIS-HR
4.2. Implementation Details
4.3. Performance Comparison
4.3.1. Quantitative Evaluations on RefSegRS
4.3.2. Quantitative Evaluations on RRSIS-D
4.3.3. Quantitative Evaluations on RISBench
4.3.4. Quantitative Evaluations on RRSIS-HR
4.3.5. Computational Efficiency Analysis
4.3.6. Qualitative Comparison
4.4. Ablation Study
4.4.1. Evaluation of GL-CA and Bridge Feature Design
4.4.2. Evaluation of Multi-Scale Fusion Decoder
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Yuan, Z.; Mou, L.; Hua, Y.; Zhu, X.X. Rrsis: Referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
- Zhang, T.; Wen, Z.; Kong, B.; Liu, K.; Zhang, Y.; Zhuang, P.; Li, J. Referring Remote Sensing Image Segmentation via Multi-Scale Spatially-Guided Joint Prediction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 19, 2796–2811. [Google Scholar] [CrossRef]
- Liu, S.; Ma, Y.; Zhang, X.; Wang, H.; Ji, J.; Sun, X.; Ji, R. Rotated multi-scale interaction network for referring remote sensing image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26658–26668. [Google Scholar]
- Sun, Y.; Wang, D.; Li, L.; Ning, R.; Yu, S.; Gao, N. Application of remote sensing technology in water quality monitoring: From traditional approaches to artificial intelligence. Water Res. 2024, 267, 122546. [Google Scholar] [CrossRef] [PubMed]
- Yang, H.; Jiang, Z.; Zhang, Y.; Wu, Y.; Luo, H.; Zhang, P.; Wang, B. A high-resolution remote sensing land use/land cover classification method based on multi-level features adaptation of segment anything model. Int. J. Appl. Earth Obs. Geoinf. 2025, 141, 104659. [Google Scholar] [CrossRef]
- Harb, M.M.; Dell’Acqua, F. Remote sensing in multirisk assessment: Improving disaster preparedness. IEEE Geosci. Remote Sens. Mag. 2017, 5, 53–65. [Google Scholar] [CrossRef]
- Coutts, A.M.; Harris, R.J.; Phan, T.; Livesley, S.J.; Williams, N.S.; Tapper, N.J. Thermal infrared remote sensing of urban heat: Hotspots, vegetation, and an assessment of techniques for use in urban planning. Remote Sens. Environ. 2016, 186, 637–651. [Google Scholar] [CrossRef]
- Wang, H.; Zhuang, P.; Zhang, X.; Li, J. DBMGNet: A Dual-Branch Mamba-GCN Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4410517. [Google Scholar] [CrossRef]
- Zhuang, P.; Zhang, X.; Wang, H.; Zhang, T.; Liu, L.; Li, J. FAHM: Frequency-Aware Hierarchical Mamba for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6299–6313. [Google Scholar] [CrossRef]
- Li, J.; Wang, H.; Zhang, X.; Wang, J.; Zhang, T.; Zhuang, P. DVR: Towards Accurate Hyperspectral Image Classifier via Discrete Vector Representation. Remote Sens. 2025, 17, 351. [Google Scholar] [CrossRef]
- Li, J.; Wen, Z.; Zhang, Y.; Wang, W.; Cai, Y.; Zhang, T.; He, X.; Liu, J. Generalized referring expression segmentation driven by instance-oriented queries. Pattern Recognit. 2025, 172, 112524. [Google Scholar] [CrossRef]
- Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from natural language expressions. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 October 2016; pp. 108–124. [Google Scholar]
- Liu, C.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Yuille, A. Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 1271–1280. [Google Scholar]
- Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5745–5753. [Google Scholar]
- Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1307–1315. [Google Scholar]
- Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-word-aware network for referring expression image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 38–54. [Google Scholar]
- Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 July 2019; pp. 10502–10511. [Google Scholar]
- Margffoy-Tuay, E.; Pérez, J.C.; Botero, E.; Arbeláez, P. Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 630–645. [Google Scholar]
- Liu, S.; Hui, T.; Huang, S.; Wei, Y.; Li, B.; Li, G. Cross-modal progressive comprehension for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4761–4775. [Google Scholar] [CrossRef] [PubMed]
- Ding, H.; Liu, C.; Wang, S.; Jiang, X. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16321–16330. [Google Scholar]
- Kim, N.; Kim, D.; Lan, C.; Zeng, W.; Kwak, S. Restr: Convolution-free referring image segmentation using transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18145–18154. [Google Scholar]
- Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2021; pp. 15506–15515. [Google Scholar]
- Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18155–18165. [Google Scholar]
- Ouyang, S.; Wang, H.; Xie, S.; Niu, Z.; Tong, R.; Chen, Y.W.; Lin, L. SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation. In Proceedings of the IJCAI, Macao, China, 19–25 August 2023; pp. 1294–1302. [Google Scholar]
- Zhang, Z.; Zhu, Y.; Liu, J.; Liang, X.; Ke, W. Coupalign: Coupling word-pixel with sentence-mask alignments for referring image segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 14729–14742. [Google Scholar]
- Liu, S.A.; Zhang, Y.; Qiu, Z.; Xie, H.; Zhang, Y.; Yao, T. CARIS: Context-aware referring image segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 779–788. [Google Scholar]
- Chng, Y.X.; Zheng, H.; Han, Y.; Qiu, X.; Huang, G. Mask grounding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26573–26583. [Google Scholar]
- Hu, Y.; Wang, Q.; Shao, W.; Xie, E.; Li, Z.; Han, J.; Luo, P. Beyond one-to-one: Rethinking the referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4067–4077. [Google Scholar]
- Lei, S.; Xiao, X.; Zhang, T.; Li, H.-C.; Shi, Z.; Zhu, Q. Exploring fine-grained image-text alignment for referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5604611. [Google Scholar] [CrossRef]
- Shi, L.; Zhang, J. Multimodal-aware fusion network for referring remote sensing image segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8001805. [Google Scholar] [CrossRef]
- Dong, Z.; Sun, Y.; Liu, T.; Zuo, W.; Gu, Y. Cross-modal bidirectional interaction model for referring remote sensing image segmentation. arXiv 2024, arXiv:2410.08613. [Google Scholar]
- Liu, M.; Jiang, X.; Zhang, X. CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14557–14569. [Google Scholar] [CrossRef]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 29 March–9 July 2024. [Google Scholar]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
- Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5633314. [Google Scholar] [CrossRef]
- Ma, X.; Zhang, X.; Pun, M.O. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
- Zhu, Q.; Fang, Y.; Cai, Y.; Chen, C.; Fan, L. Rethinking scanning strategies with vision mamba in semantic segmentation of remote sensing imagery: An experimental study. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18223–18234. [Google Scholar] [CrossRef]
- Altinok, D. Mastering spaCy: An End-to-End Practical Guide to Implementing NLP Applications Using the Python Ecosystem; Packt Publishing Ltd.: Birmingham, UK, 2021. [Google Scholar]
- Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]









| Method | Visual Encoder | Text Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | oIoU | mIoU | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | |||
| BRINet † | R-101 | LSTM | 36.86 | 20.72 | 35.53 | 14.26 | 19.93 | 9.87 | 10.66 | 2.98 | 2.84 | 1.14 | 61.59 | 58.22 | 38.73 | 31.51 |
| LSCM † | R-101 | LSTM | 56.82 | 31.54 | 41.24 | 20.41 | 21.85 | 9.51 | 12.11 | 5.29 | 2.51 | 0.84 | 62.82 | 61.27 | 40.59 | 35.54 |
| RRN † | R-101 | LSTM | 55.43 | 30.26 | 42.98 | 23.01 | 23.11 | 14.87 | 13.72 | 7.17 | 2.64 | 0.98 | 69.24 | 65.06 | 50.81 | 41.88 |
| ETRIS † | R-101 | CLIP | 54.99 | 35.77 | 35.03 | 23.00 | 25.06 | 13.98 | 12.53 | 6.44 | 1.62 | 1.10 | 72.89 | 65.96 | 54.03 | 43.11 |
| CRIS † | R-101 | CLIP | 53.13 | 35.77 | 36.19 | 24.11 | 24.36 | 14.36 | 11.83 | 6.38 | 2.55 | 1.21 | 72.14 | 65.87 | 53.74 | 43.26 |
| CrossVLT † | Swin-B | BERT | 67.52 | 41.94 | 43.85 | 25.43 | 25.99 | 15.19 | 14.62 | 3.71 | 1.87 | 1.76 | 76.12 | 69.73 | 55.27 | 42.81 |
| LAVT † | Swin-B | BERT | 80.97 | 51.84 | 58.70 | 30.27 | 31.09 | 17.34 | 15.55 | 9.52 | 4.64 | 2.09 | 78.50 | 71.86 | 61.53 | 47.40 |
| RIS-DMMI † | Swin-B | BERT | 86.17 | 63.89 | 74.71 | 44.30 | 38.05 | 19.81 | 18.10 | 6.49 | 3.25 | 1.00 | 74.02 | 68.58 | 65.72 | 52.15 |
| MAFN | Swin-B | BERT | 93.74 | 70.72 | 86.54 | 55.37 | 71.00 | 31.37 | 28.77 | 11.45 | 7.42 | 2.31 | 80.17 | 72.03 | 72.88 | 57.56 |
| LGCE | Swin-B | BERT | 92.58 | 74.79 | 90.49 | 62.30 | 80.05 | 38.69 | 41.07 | 16.68 | 12.76 | 4.73 | 85.80 | 77.28 | 75.67 | 60.10 |
| CADFormer | Swin-B | BERT | 89.79 | 75.34 | 81.67 | 63.18 | 62.65 | 38.97 | 23.43 | 15.63 | 5.57 | 2.97 | 79.11 | 74.08 | 70.46 | 60.88 |
| RMSIN | Swin-B | BERT | 90.72 | 77.44 | 86.31 | 65.44 | 71.93 | 42.43 | 29.70 | 17.39 | 7.19 | 3.08 | 79.30 | 74.53 | 72.26 | 62.04 |
| FIANet | Swin-B | BERT | 95.82 | 82.61 | 92.34 | 73.75 | 87.24 | 54.71 | 54.52 | 25.21 | 10.67 | 4.62 | 83.94 | 76.62 | 78.28 | 66.12 |
| MSMamba (ours) | VMamba | BERT | 95.36 | 83.65 | 94.90 | 78.48 | 90.49 | 68.74 | 83.06 | 50.74 | 45.01 | 18.66 | 86.85 | 79.45 | 85.11 | 72.77 |
| Method | Visual Encoder | Text Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | oIoU | mIoU | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | |||
| RRN † | R-101 | LSTM | 51.09 | 51.07 | 42.47 | 42.11 | 33.04 | 32.77 | 20.80 | 21.57 | 6.14 | 6.37 | 66.53 | 66.43 | 46.06 | 45.64 |
| BRINet † | R-101 | LSTM | 58.79 | 56.90 | 49.54 | 48.77 | 39.65 | 38.61 | 28.21 | 27.03 | 9.19 | 8.93 | 70.73 | 69.68 | 51.41 | 49.45 |
| LSCM † | R-101 | LSTM | 57.12 | 56.02 | 48.04 | 46.25 | 37.87 | 37.70 | 26.35 | 25.28 | 7.93 | 7.86 | 69.28 | 69.10 | 50.36 | 49.92 |
| CRIS † | R-101 | CLIP | 56.44 | 54.84 | 47.87 | 46.77 | 39.77 | 38.06 | 29.31 | 28.15 | 11.84 | 11.52 | 70.98 | 70.46 | 50.75 | 49.69 |
| ETRIS † | R-101 | CLIP | 62.10 | 61.07 | 53.73 | 50.99 | 43.12 | 40.94 | 30.79 | 29.30 | 12.90 | 11.43 | 72.75 | 71.06 | 55.21 | 54.21 |
| LAVT † | Swin-B | BERT | 65.23 | 63.98 | 58.79 | 57.57 | 50.29 | 49.30 | 40.11 | 38.06 | 23.05 | 22.29 | 76.27 | 76.16 | 57.72 | 56.82 |
| CrossVLT † | Swin-B | BERT | 67.07 | 66.42 | 59.54 | 59.41 | 50.80 | 49.76 | 40.57 | 38.67 | 23.51 | 23.30 | 76.25 | 75.48 | 59.78 | 58.48 |
| RIS-DMMI † | Swin-B | BERT | 70.40 | 68.74 | 63.05 | 60.96 | 54.14 | 50.33 | 41.95 | 38.38 | 23.85 | 21.63 | 77.01 | 76.20 | 61.70 | 60.25 |
| LGCE | Swin-B | BERT | 71.32 | 69.69 | 64.54 | 63.49 | 55.06 | 53.49 | 43.51 | 41.11 | 25.40 | 23.96 | 77.46 | 76.84 | 62.29 | 60.60 |
| FIANet | Swin-B | BERT | 73.33 | 73.43 | 66.21 | 66.96 | 55.46 | 54.64 | 42.47 | 41.20 | 23.68 | 23.01 | 77.04 | 76.16 | 63.35 | 62.99 |
| RMSIN | Swin-B | BERT | 73.39 | 72.16 | 66.26 | 65.96 | 56.84 | 54.75 | 43.45 | 41.54 | 24.60 | 23.96 | 77.03 | 76.32 | 64.28 | 63.06 |
| CADFormer | Swin-B | BERT | 75.57 | 74.72 | 67.87 | 67.74 | 55.92 | 56.10 | 43.39 | 41.71 | 24.31 | 23.67 | 77.47 | 77.24 | 65.12 | 64.23 |
| MAFN * | Swin-B | BERT | 76.32 | 75.27 | 69.31 | 68.14 | 58.33 | 56.79 | 44.54 | 43.49 | 24.71 | 23.76 | 78.33 | 77.41 | 66.03 | 64.76 |
| MSMamba (ours) | VMamba | BERT | 77.24 | 77.68 | 71.72 | 71.76 | 59.94 | 60.01 | 46.55 | 45.16 | 28.16 | 26.26 | 77.86 | 77.54 | 66.93 | 66.22 |
| Method | Visual Encoder | Text Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | oIoU | mIoU | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | Val | Test | |||
| BRINet † | R-101 | LSTM | 52.11 | 52.87 | 45.17 | 45.39 | 37.98 | 38.64 | 30.88 | 30.79 | 10.28 | 11.86 | 46.27 | 48.73 | 41.54 | 42.91 |
| RRN † | R-101 | LSTM | 54.62 | 55.04 | 46.88 | 47.31 | 39.57 | 39.86 | 32.64 | 32.58 | 11.57 | 13.24 | 47.28 | 49.67 | 42.65 | 43.18 |
| LSCM † | R-101 | LSTM | 55.87 | 55.26 | 47.24 | 47.14 | 40.22 | 40.10 | 33.55 | 33.29 | 12.78 | 13.91 | 47.99 | 50.08 | 43.21 | 43.69 |
| ETRIS † | R-101 | CLIP | 59.87 | 60.98 | 49.91 | 51.88 | 35.88 | 39.87 | 20.10 | 24.49 | 8.54 | 11.18 | 64.09 | 67.61 | 51.13 | 53.06 |
| CRIS † | R-101 | CLIP | 63.42 | 63.67 | 54.32 | 55.73 | 41.15 | 44.42 | 24.66 | 28.80 | 10.27 | 13.27 | 66.26 | 69.11 | 53.64 | 55.18 |
| LAVT † | Swin-B | BERT | 68.27 | 69.40 | 62.71 | 63.66 | 54.46 | 56.10 | 43.13 | 44.95 | 21.61 | 25.21 | 69.39 | 74.15 | 60.45 | 61.93 |
| CrossVLT † | Swin-B | BERT | 70.05 | 70.62 | 64.29 | 65.05 | 56.97 | 57.40 | 44.49 | 45.80 | 21.47 | 26.10 | 69.77 | 74.33 | 61.54 | 62.84 |
| LGCE | Swin-B | BERT | 71.01 | 71.45 | 65.62 | 66.30 | 58.66 | 58.77 | 47.05 | 47.43 | 24.41 | 27.84 | 69.39 | 73.50 | 62.86 | 63.67 |
| RIS-DMMI † | Swin-B | BERT | 71.27 | 72.05 | 66.02 | 66.48 | 58.22 | 59.07 | 45.57 | 47.16 | 22.43 | 26.57 | 70.58 | 74.82 | 62.62 | 63.93 |
| FIANet | Swin-B | BERT | 75.70 | 75.51 | 70.89 | 70.73 | 64.37 | 63.59 | 53.22 | 52.52 | 30.12 | 31.96 | 69.86 | 74.30 | 67.19 | 67.44 |
| CADFormer | Swin-B | BERT | 75.94 | 76.27 | 70.90 | 71.35 | 64.17 | 64.21 | 52.62 | 52.91 | 29.07 | 31.59 | 70.23 | 74.34 | 66.96 | 67.80 |
| RMSIN | Swin-B | BERT | 75.86 | 76.37 | 70.88 | 71.17 | 64.32 | 64.17 | 53.34 | 53.22 | 30.44 | 32.50 | 70.81 | 75.51 | 67.44 | 68.32 |
| MAFN | Swin-B | BERT | 76.87 | 76.98 | 72.32 | 72.46 | 65.57 | 65.73 | 54.47 | 54.77 | 31.11 | 33.09 | 70.90 | 74.90 | 67.95 | 68.79 |
| MSMamba (ours) | VMamba | BERT | 77.34 | 77.56 | 73.30 | 73.38 | 68.00 | 67.60 | 59.48 | 58.78 | 40.16 | 40.53 | 71.93 | 75.29 | 69.66 | 70.22 |
| Method | Visual Encoder | Text Encoder | Pr@0.5 | Pr@0.6 | Pr@0.7 | Pr@0.8 | Pr@0.9 | oIoU | mIoU |
|---|---|---|---|---|---|---|---|---|---|
| LAVT † | Swin-B | BERT | 23.11 | 20.08 | 13.64 | 5.30 | 0.38 | 27.94 | 22.78 |
| FIANet † | Swin-B | BERT | 31.06 | 27.65 | 22.35 | 15.15 | 1.89 | 28.89 | 27.13 |
| LGCE † | Swin-B | BERT | 35.98 | 31.06 | 23.86 | 15.15 | 3.79 | 38.20 | 33.48 |
| RMSIN † | Swin-B | BERT | 50.00 | 46.97 | 39.77 | 29.92 | 6.44 | 45.97 | 43.70 |
| MAFN | Swin-B | BERT | 46.21 | 44.32 | 40.53 | 32.95 | 17.05 | 46.18 | 44.34 |
| CADFormer | Swin-B | BERT | 63.26 | 57.95 | 46.21 | 34.09 | 9.09 | 53.64 | 54.88 |
| MSMamba (ours) | VMamba | BERT | 66.67 | 61.74 | 57.58 | 43.18 | 20.83 | 59.05 | 57.90 |
| Method | Params (M) | FLOPs (G) | GPU Memory (GB) | Inference Time (ms) |
|---|---|---|---|---|
| RMSIN | 240.04 | 433.02 | 14.08 | 37.0 |
| FIANet | 256.17 | 435.87 | 12.97 | 44.7 |
| MAFN | 350.06 | 450.66 | 22.74 | 85.4 |
| CADFormer | 359.25 | 466.28 | 17.81 | 41.8 |
| MSMamba (Ours) | 264.77 | 416.80 | 19.54 | 59.1 |
| Global | Local | Bridge Feature | mIoU | oIoU |
|---|---|---|---|---|
| ✓ | × | without | 65.78 | 76.67 |
| × | ✓ | without | 65.58 | 76.56 |
| ✓ | ✓ | without | 66.02 | 77.78 |
| ✓ | ✓ | with | 66.93 | 77.86 |
| Attribute Level | Entity Level | mIoU | oIoU | Pr@0.8 | Pr@0.9 |
|---|---|---|---|---|---|
| × | × | 66.02 | 77.78 | 45.80 | 26.60 |
| ✓ | × | 66.93 | 77.86 | 46.55 | 28.16 |
| × | ✓ | 66.06 | 77.84 | 45.34 | 27.07 |
| ✓ | ✓ | 66.23 | 77.26 | 46.09 | 26.67 |
| Position | Metrics | ||||
|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | mIoU | oIoU |
| V | 66.21 | 77.44 | |||
| V | 66.93 | 77.86 | |||
| V | 65.48 | 76.71 | |||
| V | 65.46 | 77.58 | |||
| Config | mIoU | oIoU | Pr@0.8 | Pr@0.9 |
|---|---|---|---|---|
| NONE | 66.24 | 77.70 | 45.80 | 28.22 |
| FPN | 65.61 | 77.22 | 46.49 | 28.39 |
| CIM | 66.43 | 77.58 | 46.55 | 27.87 |
| MSFD (Ours) | 66.93 | 77.86 | 46.55 | 28.16 |
| Fusion Strategy | Pr@0.8 | Pr@0.9 | mIoU | oIoU |
|---|---|---|---|---|
| Add | 45.86 | 27.36 | 65.24 | 76.21 |
| Concat | 45.34 | 27.59 | 65.14 | 76.16 |
| SKSAG (Ours) | 46.55 | 28.16 | 66.93 | 77.86 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhang, T.; Li, J.; Feng, Y.; Wen, Z.; Liu, L.; Li, J. MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation. Remote Sens. 2026, 18, 1949. https://doi.org/10.3390/rs18121949
Zhang T, Li J, Feng Y, Wen Z, Liu L, Li J. MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation. Remote Sensing. 2026; 18(12):1949. https://doi.org/10.3390/rs18121949
Chicago/Turabian StyleZhang, Tianxiang, Junbai Li, Yanqiang Feng, Zhaokun Wen, Li Liu, and Jiangyun Li. 2026. "MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation" Remote Sensing 18, no. 12: 1949. https://doi.org/10.3390/rs18121949
APA StyleZhang, T., Li, J., Feng, Y., Wen, Z., Liu, L., & Li, J. (2026). MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation. Remote Sensing, 18(12), 1949. https://doi.org/10.3390/rs18121949

