OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation
Abstract
1. Introduction
- We introduce OpenMamba, a novel method for extracting high-level guidance maps for CLIP feature pooling in order to tackle open-vocabulary semantic segmentation. Its U-net-like architecture is fully based on State Space Models, guaranteeing increased performance while reducing memory usage during training. To our knowledge, this is the first use of State Space Models for the open-vocabulary semantic segmentation task.
- We propose InfoNCE as our contrastive-based loss in order to encourage similar proposal masks to obtain similar CLIP embeddings. The InfoNCE loss provides a stronger contrastive learning signal by simultaneously pulling together positive pairs and pushing apart hard negatives.
- We integrate an ATSS matcher that adaptively selects high-quality proposals using dynamic thresholds.
2. Related Work
2.1. Vision–Language Pre-Training
2.2. Open-Vocabulary Semantic Segmentation
2.3. State Space Models
3. Method
3.1. Problem Definition
3.2. Method Overview
3.3. OpenMamba
3.4. Mamba Cross-Attention
3.5. Contrastive Loss
3.6. Dynamic IoU Matcher
4. Experiments
4.1. Datasets
4.2. Implementation Details
4.3. Standard Evaluation
- Evaluation on standard open-vocabulary benchmarks
- Memory usage and time evaluation
4.4. Ablation Study
- Ablation study on different changes performed
- Ablation study on top-k selection in the ATSS matcher
- Ablation study on temperature parameter for contrastive loss
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Fedushko, S.; Shumyliak, L.; Cibák, L.; Sierova, M.-O. Image Processing Application Development: A New Approach and Its Economic Profitability. In Data-Centric Business and Applications; Lecture Notes on Data Engineering and Communications Technologies; Štarchoň, P., Fedushko, S., Gubíniová, K., Eds.; Springer Nature: Cham, Switzerland, 2024; Volume 208. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
- Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. Segformer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Cho, S.; Shin, H.; Hong, S.; Arnab, A.; Seo, P.H.; Kim, S. Catseg: Cost aggregation for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4113–4123. [Google Scholar]
- Zhou, Z.; Lei, Y.; Zhang, B.; Liu, L.; Liu, Y. Zegclip: Towards adapting clip for zero-shot semantic segmentation. arXiv 2022, arXiv:2212.03588. [Google Scholar]
- Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; Bai, X. A simple baseline for open vocabulary semantic segmentation with pre-trained vision language model. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 736–753. [Google Scholar]
- Jiao, S.; Zhu, H.; Huang, J.; Zhao, Y.; Wei, Y.; Shi, H. Collaborative vision-text representation optimizing for open-vocabulary segmentation. arXiv 2024, arXiv:2408.00744. [Google Scholar]
- Shan, X.; Wu, D.; Zhu, G.; Shao, Y.; Sang, N.; Gao, C. Open-vocabulary semantic segmentation with image embedding balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28412–28421. [Google Scholar]
- Ding, J.; Xue, N.; Xia, G.S.; Dai, D. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11583–11592. [Google Scholar]
- Han, C.; Zhong, Y.; Li, D.; Han, K.; Ma, L. Open-vocabulary semantic segmentation with decoupled one-pass network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1086–1096. [Google Scholar]
- Li, Y.; Cheng, T.; Feng, B.; Liu, W.; Wang, X. Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation. arXiv 2024, arXiv:2412.04533. [Google Scholar]
- Shi, Y.; Dong, M.; Li, M.; Xu, C. VSSD: Vision Mamba with Non-Causal State Space Duality. arXiv 2024, arXiv:2407.18559. [Google Scholar]
- Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
- Zhang, X.; Chi, C.; Jinnai, Y.; Li, Y.; Zhang, X.; Wei, Y.; Sun, J. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 11–15 June 2020; pp. 9759–9768. [Google Scholar]
- Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. 2019, 127, 302–321. [Google Scholar] [CrossRef]
- Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.G.; Lee, S.W.; Fidler, S.; Urtasun, R.; Yuille, A. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 891–898. [Google Scholar]
- Chen, Y.-C.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXX 30. Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–120. [Google Scholar]
- Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 121–137. [Google Scholar]
- Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. Regionclip: Region based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16793–16803. [Google Scholar]
- Bucher, M.; Vu, T.H.; Cord, M.; Perez, P. Zero-shot semantic segmentation. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2945–2954. [Google Scholar]
- Yu, Q.; He, J.; Deng, X.; Shen, X.; Chen, L.C. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2085–2094. [Google Scholar]
- Ghiasi, G.; Gu, X.; Cui, Y.; Lin, T.Y. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 540–557. [Google Scholar]
- Liang, F.; Wu, B.; Dai, X.; Li, K.; Zhao, Y.; Zhang, H.; Zhang, P.; Vajda, P.; Marculescu, D. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7061–7070. [Google Scholar]
- Xu, J.; Liu, S.; Vahdat, A.; Byeon, W.; Wang, X.; Mello, S.D. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2955–2966. [Google Scholar]
- Wang, X.; He, W.; Xuan, X.; Sebastian, C.; Ono, J.P.; Li, X.; Behpour, S.; Doan, T.; Gou, L.; Shen, H.W.; et al. USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation. arXiv 2024, arXiv:2406.05271. [Google Scholar] [CrossRef]
- Chng, Y.X.; Qiu, X.; Han, Y.; Ding, K.; Ding, W.; Huang, G. Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation. arXiv 2024, arXiv:2409.16278. [Google Scholar]
- Ding, Z.; Wang, J.; Tu, Z. Open vocabulary universal image segmentation with maskclip. arXiv 2022, arXiv:2208.08984. [Google Scholar]
- Xie, B.; Cao, J.; Xie, J.; Khan, F.S.; Pang, Y. Sed: A simple encoder-decoder for open vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3426–3436. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
- Zhu, Y.; Zhu, B.; Chen, Z.; Xu, H.; Tang, M.; Wang, J. Mrovseg: Breaking the resolution curse of vision-language models in open-vocabulary semantic segmentation. arXiv 2024, arXiv:2408.14776. [Google Scholar]
- Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining Recurrent, Convolutional, and Continuous-time Models with Linear StateSpace Layers. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar]
- Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. HiPPO: Recurrent Memory with Optimal Polynomial Projections. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020. [Google Scholar]
- Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
- Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
- Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
- Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing longrange dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
- Liang, D.; Zhou, X.; Wang, X.; Zhu, X.; Xu, W.; Zou, Z.; Ye, X.; Bai, X. Pointmamba: A simple state space model for point cloud analysis. arXiv 2024, arXiv:2402.10739. [Google Scholar] [CrossRef]
- Zhang, T.; Li, X.; Yuan, H.; Ji, S.; Yan, S. Point could mamba: Point cloud learning via state space model. arXiv 2024, arXiv:2403.00762. [Google Scholar] [CrossRef]
- Lin, B.; Jiang, W.; Chen, P.; Zhang, Y.; Liu, S.; Chen, Y.-C. MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 290–307. [Google Scholar]
- Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
- Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
- Caesar, H.; Uijlings, J.; Ferrari, V. Cocostuff: Thing and stuff classes in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar]
- Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Xu, J.; Mello, S.D.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. GroupViT: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18134–18144. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
- Thomee, B.; Shamma, D.A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; Li, L.J. YFCC100M: The new data in multimedia research. Commun. ACM 2016, 59, 64–73. [Google Scholar] [CrossRef]
- Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollar, P. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9404–9413. [Google Scholar]
- Liu, Y.; Bai, S.; Li, G.; Wang, Y.; Tang, Y. Open-Vocabulary Segmentation with Semantic-Assisted Calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 3491–3500. [Google Scholar]
- Peng, Z.; Xu, Z.; Zeng, Z.; Wang, Y.; Shen, W. Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation. arXiv 2024, arXiv:2405.18840. [Google Scholar]
Method | VLM | Training Dataset | ADE-847 | PC-459 | ADE-150 | PC-59 | VOC-20 |
---|---|---|---|---|---|---|---|
LSeg+ [31] | ALIGN RN101 | COCO Stuff | 2.5 | 5.2 | 13.0 | 36.0 | 59.0 |
GroupViT [53] | CLIP ViT-S/16 | GCC [54] + YFCC [55] | 4.3 | 4.9 | 10.6 | 25.9 | 50.7 |
MaskCLIP [36] | CLIP ViT-L/14 | COCO Panoptic [56] | 8.2 | 10.0 | 23.7 | 45.9 | - |
OpenSeg [31] | ALIGN RN101 | COCO Stuff | 4.4 | 7.9 | 17.5 | 40.1 | 63.8 |
OVSeg [32] | CLIP ViT-B/16 | COCO Stuff | 7.1 | 11.0 | 24.8 | 53.3 | 92.6 |
ZegFormer [13] | CLIP ViT-B/16 | COCO Stuff | 5.6 | 10.4 | 18.0 | 45.5 | 89.5 |
SAN [28] | CLIP ViT-B/16 | COCO Stuff | 10.7 | 13.7 | 28.9 | 55.4 | 94.6 |
ODISE [33] | Stable Diffusion | COCO Panoptic | 11.1 | 14.5 | 29.9 | 55.3 | - |
CAT-Seg [8] | CLIP ViT-B/16 | COCO Stuff | 12.0 | 19.0 | 31.8 | 57.5 | 94.6 |
SCAN [57] | CLIP ViT-B/16 | COCO Stuff | 10.8 | 13.2 | 30.8 | 58.4 | 97 |
SED [37] | CLIP ConvNeXt-B | COCO Stuff | 11.4 | 18.7 | 31.6 | 57.3 | 94.4 |
HCLIP [58] | CLIP ViT-B/16 | COCO Stuff | 12.5 | 19.4 | 32.4 | 57.9 | 95.2 |
MROVSeg [39] | CLIP ViT-B/16 | COCO Stuff | 12.1 | 19.6 | 32.0 | 58.5 | 95.5 |
MaskAdapter [15] | CLIP ConvNeXt-B | COCO Stuff | 14.2 | 17.9 | 35.6 | 58.4 | 95.1 |
OpenMamba & MAFT+ | CLIP ConvNeXt-B | COCO Stuff | 14.2 | 18.6 | 36.1 | 58.5 | 95.3 |
Method | VLM Size | Batch Size | Memory (MB) | ADE20K (mIoU) |
---|---|---|---|---|
Mask-Adapter (Baseline) | ConvNeXt Base | 2 | 6760 | 35.6 |
OpenMamba (Ours) | ConvNeXt Base | 2 | 4592 | 36.1 |
Method | Training Time (h) | Inference Time (ADE-150) (ms) | Inference Time (PC-59) (ms) |
---|---|---|---|
Mask-Adapter (Baseline) | 39 | 207 | 178.3 |
OpenMamba (Ours) | 46.5 | 172.5 | 136.7 |
Method | ADE20K |
---|---|
MaskAdapter (Baseline) | 35.60 |
VSSD Backbone (Ours) | 35.59−0.01 |
+MCA | 35.70+0.1 |
+InfoNCE Loss | 35.81+0.21 |
+ATSS Matcher | 36.14+0.54 |
Different Top-k Values | ADE20K |
---|---|
35.7 | |
36.0 | |
36.1 | |
35.9 |
Different Temperature Values | ADE20K |
---|---|
35.8 | |
35.5 | |
35.4 | |
34.9 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ungur, V.; Popa, C.-A. OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation. Appl. Sci. 2025, 15, 9087. https://doi.org/10.3390/app15169087
Ungur V, Popa C-A. OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation. Applied Sciences. 2025; 15(16):9087. https://doi.org/10.3390/app15169087
Chicago/Turabian StyleUngur, Viktor, and Călin-Adrian Popa. 2025. "OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation" Applied Sciences 15, no. 16: 9087. https://doi.org/10.3390/app15169087
APA StyleUngur, V., & Popa, C.-A. (2025). OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation. Applied Sciences, 15(16), 9087. https://doi.org/10.3390/app15169087