OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation

Ungur, Viktor; Popa, Călin-Adrian

doi:10.3390/app15169087

Open AccessArticle

OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation

by

Viktor Ungur

and

Călin-Adrian Popa

^*

Department of Computers and Information Technology, Politehnica University of Timișoara, 300223 Timișoara, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9087; https://doi.org/10.3390/app15169087

Submission received: 8 June 2025 / Revised: 14 July 2025 / Accepted: 18 July 2025 / Published: 18 August 2025

(This article belongs to the Special Issue Application of Machine Learning to Image Classification and Image Segmentation)

Download

Browse Figures

Versions Notes

Abstract

Open-vocabulary semantic segmentation aims to label each pixel of an image based on text descriptions provided at inference time. Recent approaches for this task are based on methods which require two stages: the first one uses a mask generator to generate mask proposals, while the other one deals with segment classification using a pre-trained vision–language model, such as CLIP. However, since CLIP is pre-trained on natural images, the model struggles with segmentation masks because of their abstract nature. In this paper, we introduce OpenMamba, a novel approach to creating high-level guidance maps to assist in extracting CLIP features within the masked regions for classification. High-level guidance maps are generated by leveraging both visual and textual modalities and introducing State Space Duality (SSD) as an efficient way to tackle the open-vocabulary semantic segmentation task. Also, we propose a new matching technique for the mask proposals, based on IoU with a dynamic threshold conditioned by mask quality and we introduce a contrastive-based loss to assure that similar mask proposals achieve similar CLIP embeddings. Comprehensive experiments across open-vocabulary benchmarks show that our method can achieve superior performance compared to other approaches while managing to reduce memory consumption.

Keywords:

open-vocabulary; semantic segmentation; vision–language model; state space duality

1. Introduction

Image semantic segmentation is a widely used computer vision task that aims to create meaningful segments within the image and assign category labels to them. This task has many real-world applications, such as robotics and autonomous driving, which could lead to economic profitability [1]. Despite the success of previous works [2,3,4,5], these methods are limited to segmenting only those categories seen during training, making them unsuitable for open-world scenarios where new classes may emerge. Open-vocabulary semantic segmentation (OVSS) enhanced the standard image segmentation models and enabled them to segment pixels belonging to arbitrary classes beyond the scope of those predefined in the dataset.

Recently, vision–language models (VLMs), such as CLIP [6] and ALIGN [7], have achieved great performance in zero-shot image classification due to their training on large-scale image-text pairs. This encouraged researchers to leverage this family of models for the open-vocabulary semantic segmentation task. VLMs are kept frozen during training in order to preserve their zero-shot capabilities and to avoid the high computational cost of adapting. VLMs primarily lacked proficiency in addressing pixel-level semantic requirements, different works [8,9] tried to remedy this by removing CLIP’s final pooling layer in order to obtain dense category embeddings. Another line of research, such as [10,11,12,13,14], adopts the idea of using mask generators in order to produce class-agnostic masks, with the aforementioned frozen CLIP model subsequently used for extracting mask embeddings and matching them with text embeddings. However, this approach is limited because CLIP is more aligned with natural images than with class-agnostic masks. More recently, Ref. [15] proposed extracting semantic activation maps from mask proposals and CLIP features to create a better pooling guidance for CLIP dense features and introduced a mask consistency loss in order to obtain similar CLIP embeddings for similar mask proposals.

Our work tends to extend the exploration of two-stage methods and address their bottlenecks. We provide a novel framework, dubbed OpenMamba, for extracting high-level guidance maps, which can be seen in Figure 1. These high-level guidance maps will be used for CLIP feature pooling. In order to enhance the quality of the maps, we argue that relying solely on visual modalities, such as CLIP visual embeddings and mask proposals, is not sufficient to effectively capture meaningful semantic groupings. To this end, we propose employing the text embeddings from CLIP in a module named Mamba Cross-Attention to further enhance the semantic representation of the masked regions, utilizing State Space Duality. The general OpenMamba architecture has a U-net-like structure containing a VSSD-based [16] backbone followed by a Mamba Cross-Attention module and an expanding path to create the final high-level guidance maps. All modules are based on the Mamba architecture, leading to reduced memory usage during training without any loss of accuracy. We also employ a loss function which encourages similar proposal masks to obtain similar CLIP mask embeddings like in [15], but we use a contrastive-based loss instead [17] for more robust and discriminative mask embeddings and for leveraging both positive and negative samples. Additionally, we explore different matching strategies and find that an ATSS-based matcher [18] best suits our needs. Unlike the Hungarian matcher, which enforces a strict one-to-one assignment, our matcher allows for more flexible many-to-one associations while also employing a dynamic quality-aware threshold instead of a fixed one, leading to more reliable and context-sensitive mask selection. By conducting extensive experiments on standard open-vocabulary benchmarks, including the ADE20K [19], PASCAL-VOC [20], and PASCAL-Context [21] datasets, and leveraging the Base ConvNeXt version of CLIP, we manage to obtain state-of-the-art results on the mIoU metric. Code is available at https://github.com/ViktorUngur002/openmamba.git, accessed on 18 July 2025. Our main contributions can be summarized as follows:

We introduce OpenMamba, a novel method for extracting high-level guidance maps for CLIP feature pooling in order to tackle open-vocabulary semantic segmentation. Its U-net-like architecture is fully based on State Space Models, guaranteeing increased performance while reducing memory usage during training. To our knowledge, this is the first use of State Space Models for the open-vocabulary semantic segmentation task.
We propose InfoNCE as our contrastive-based loss in order to encourage similar proposal masks to obtain similar CLIP embeddings. The InfoNCE loss provides a stronger contrastive learning signal by simultaneously pulling together positive pairs and pushing apart hard negatives.
We integrate an ATSS matcher that adaptively selects high-quality proposals using dynamic thresholds.

2. Related Work

2.1. Vision–Language Pre-Training

Vision–language models (VLMs) have shown a great success in connecting visual concepts with textual description, with peerless zero-shot and few-shot capabilities particularly for image classification. Their architecture is mainly based on an image encoder, a text encoder, and a module that fuses information from both encoders. Early approaches in this field [22,23,24,25] struggled to achieve good performance due to small-scale datasets. In contrast, recent approaches such as CLIP [6] and ALIGN [7] were trained on large-scale, noisy datasets, which contain image–text pairs scraped from the web. Contrastive Language–Image Pre-Training (CLIP) uses its contrastive learning approach to distinguish the correct image–text pair in each training batch, which makes CLIP suitable for many computer vision tasks, such as object detection [26], image segmentation [10,27,28,29], and image generation [30]. Our work further explores CLIP and employs its classification power for the OVSS task.

2.2. Open-Vocabulary Semantic Segmentation

Open-vocabulary semantic segmentation aims to segment novel concepts in a given image breaking the category restrictions. Two main approaches to this area of computer vision have emerged over the years: two-stage mask proposal-based methods and one-stage per-pixel classification networks. Two-stage methods often employ a mask generator to generate class-agnostic masks and then use CLIP [6] for classification. OpenSeg [31], a pioneering approach to this method, facilitates a mask pooling strategy to aggregate region features for mask classification. ZegFormer [13] and ZSseg [10] also follow the two-stage approach. OVSeg [32] introduces region–text pairs. ODISE [33] leverages a text-to-image diffusion model for obtaining aligned image–text features. USE [34] takes segments of various granularities as input and generates the corresponding embeddings. FISA [35] integrates text information early in the visual encoding process. MaskCLIP [36] progressively refines the predicted masks via the CLIP encoder. Recently, MaskAdapter [15] achieved better mask and CLIP alignment by extracting semantic activation maps. One-stage approaches employ CLIP for both mask generation and classification, reducing computation expenses, such as FC-CLIP [29]. LSeg [31] adopts pixel-level learning. SAN [28] and ZegCLIP [9] leverage a side adapter network that helps with segmentation. CAT-Seg [8] aggregates cost maps at the spatial level and class level, while SED [37] aggregates cost maps at different resolutions. EBSeg [12] fuses the Segment Anything Model [38] in the CLIP-based pipeline. MROVSeg [39] proposes a multiresolution training framework. MAFT+ [11] takes advantage of the content-dependent transfer to better align text and image features. Our approach follows the two-stage method architectures and leverages State Space Models to create high-level guidance maps similar to [15].

2.3. State Space Models

State Space Models (SSMs) were initially derived from control theory and used for Natural Language Processing (NLP) tasks. Early works, such as LSSL [40], involved the use of SSMs in combination with HiPPO [41] for modeling long-range dependencies. However, this method required high computational power, which made it unfeasible. Recently, Mamba [42] introduced the S6 block, achieving performance close to or better than transformers while scaling linearly with sequence length. This inspired future works to explore ways of adapting the S6 block for computer vision tasks. Vim [43] and VMamba [44] have shown promising results for main image tasks such as classification, detection, and segmentation. U-Mamba [45] employed a hybrid CNN-SSM architecture for biomedical image segmentation. Subsequent works [46,47] have explored the inclusion of Mamba for point cloud analysis. MTMamba [48] introduced Mamba-based decoders for multitask scene understanding. Mamba 2 [49] identified SSMs and structured masked attention as duals of each other and introduced the concept of State Space Duality. Based on this observation, VSSD [16] managed to adapt the causal-based Mamba block for non-causal image data. Our work leverages VSSD and explores its adaptation for image–text cross-attention.

3. Method

3.1. Problem Definition

The open-vocabulary semantic segmentation task involves segmenting an input image

I \in R^{3 \times H \times W}

based on arbitrary textual descriptions, including new classes unseen during training. For the task at hand, we define two category sets

C_{t r a i n}

and

C_{t e s t}

. During training, the closed set

C_{t r a i n}

is provided, while during inference, categories from both the open set

C_{t e s t}

and the closed set

C_{t r a i n i n g}

need to be segmented. In an open-vocabulary setting,

C_{t e s t}

contains novel categories that were not seen during training, meaning

C_{t r a i n} \neq C_{t e s t}

.

3.2. Method Overview

Our approach, dubbed OpenMamba, has the task of extracting high-level guidance maps and can be used as a part of two-stage methods for open-vocabulary semantic segmentation. The main idea is leveraging visual features from CLIP, class-agnostic masks, and CLIP textual embeddings to obtain high-level guidance maps that will guide the pooling of CLIP dense features. The extracted CLIP features will later be compared with the text embeddings to produce classification logits. Our training framework follows the setting of [15], the first stage is a warm-up training that utilizes ground-truth masks, while the second stage is a mixed-mask training that takes a combination of both the ground-truth and the masks predicted by the mask generator. The overall architecture can be seen in Figure 2.

3.3. OpenMamba

Given an image

I \in R^{3 \times H \times W}

, we employ CLIP as our dense feature extractor by utilizing its visual encoder. The extracted visual features

F_{v}

are being fed to a mask generator (e.g., Mask2Former [50]) in order to produce N class-agnostic masks

M \in {0, 1}^{N \times H \times W}

. Our class-agnostic masks are going through two

3 \times 3

convolutional layers in order to produce mask features

F_{m}

. The next step is an addition that represents the fusion of CLIP visual features

F_{v}

and class-agnostic masks features

F_{m}

:

F_{f u s e d} = F_{v} + F_{m} .

(1)

Our main goal is to enhance the semantic information of the

F_{f u s e d}

. In order to do so, we leverage a U-Net-like structure based on the Mamba 2 architecture that best captures long-range dependencies. The previously computed

F_{f u s e d}

will serve as input to our VSSD [16] backbone network. The main drawback of State Space Models when employed for vision tasks is that their initial structure was designed to handle 1D causal data such as text, therefore limiting the flow of information for non-causal 2D data. VSSD manages to overcome this shortcoming by introducing the shared hidden state H for all visual tokens while also having a linear attention-like structure. This backbone will progressively downsample the spatial resolution of our

F_{f u s e d}

by a factor of 2, producing multiscale feature maps as follows:

{B_{i}}_{i = 0}^{n} = B a c k b o n e_{V S S D} (F_{f u s e d}) .

(2)

However, we find that using only the visual modalities is not enough to emphasize the semantic information needed to better extract the CLIP features. In order to solve this, we leverage text embeddings

F_{t}

from CLIP’s textual encoder. These embeddings will serve as input to our Mamba Cross-Attention along with the multiscale outputs from different stages of our backbone. Mamba Cross-Attention will inject prior semantic knowledge by attending each backbone output

B_{i}

to CLIP’s text embeddings

F_{t}

utilizing a Mamba-based cross-attention, with the result being a set of cross-attended maps

{\hat{B_{i}}}

:

{\hat{B_{i}}}_{i = 0}^{n} = M a m b a C r o s s A t t e n t i o n ({B_{i}}_{i = 0}^{n}, F_{t}) .

(3)

This module will allow our model to associate fine-grained and high-level visual patterns with corresponding textual concepts across scales. Resulting semantic maps of different resolution are fed to a Mamba-based decoder, like in [48], in order to restore their initial resolution and are processed by a final convolutional layer in order to produce K high-level guidance maps

M_{g}

for each mask. The architecture of the OpenMamba module can be seen in Figure 3.

Our next goal is to utilize these high-level guidance maps

M_{g}

for pooling CLIP features, as described in the following equation:

F_{p o o l e d} = \frac{1}{K} \sum_{k = 1}^{K} M_{g}^{(k)} * F_{v} .

(4)

The resulting pooled features

F_{p o o l e d}

are used later for the final mask classification by comparing them to the text embeddings via cosine similarity.

3.4. Mamba Cross-Attention

We introduce a Mamba Cross-Attention (MCA) module that takes advantage of the semantic information embedded in textual features from CLIP and tends to enhance the visual features by leveraging that information. The module takes as input text embeddings

F_{t} \in R^{B \times T \times C}

(T denotes the number of prompt tokens) from CLIP’s textual encoder and normalized multiscale visual features

X_{i} \in R^{B \times N \times C}

(with

N = H \times W

) from the backbone and passes them through L blocks of Mamba Cross-Attention modules, followed by a Multi-Head Cross-Attention block, as in [16]. The final goal is to integrate spatial and semantic information. Following [49], we align the Mamba architecture more closely with the conventional attention mechanism by interpreting the parameters C, B, and X in Mamba 2 as the query, key, and value vectors, respectively. In order to apply cross-attention between the two modalities, we project the keys and values from the textual embeddings, while the queries along with gate Z and time constant

Δ t

are projected from the visual features as follows:

Z, C, Δ t = P r o j_{v i s u a l} (X),

(5)

B, V = P r o j_{t e x t} (F_{t}) .

(6)

Next, we define the head-wise SSD kernel L used to impose a learned, exponentially decaying weighting over past tokens. Since L is symmetric, it lets every token attend both forward and backward, making it more appropriate for non-causal data. Moreover, it enables each visual token to attend to every textual token:

L_{n, m}^{(h)} = exp (- \sum_{i = m + 1}^{n} Δ t_{i, h} A_{h}) .

(7)

where m and n denote two different tokens, and h indexes the attention head. Finally, we can compute linear cross-attention between the two modalities using Mamba 2 principles as follows:

Y^{(h)} = (L^{(h)} \circ (C^{(h)} B^{(h) ⊤})) X^{(h)} + D_{h} X^{(h)},

(8)

where

D_{h} X^{(h)}

represents the skip connection.

Y^{(h)}

will then be passed through a feedforward network. The pipeline can be seen in Figure 4.

3.5. Contrastive Loss

In order to enforce similar mask proposals to obtain similar CLIP embeddings, as proposed in [15], we need to apply a loss over the embeddings of the predicted and ground-truth masks obtained from the matcher. However, instead of a cosine similarity loss, we find that a contrastive-based loss better fits the task at hand. Contrastive-based loss functions promote high similarity between positive pairs while minimizing the similarity between negative pairs. We compute the InfoNCE loss on the ground-truth mask embeddings

M_{e}^{g t}

and predicted mask embeddings

M_{e}^{p r e d}

as follows:

L_{InfoNCE} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{exp (sim (M_{e}^{pred, (i)}, M_{e}^{gt, (i)}) / τ)}{\sum_{j = 1}^{N} exp (sim (M_{e}^{pred, (i)}, M_{e}^{gt, (j)}) / τ)},

(9)

where

sim (a, b) = \frac{a^{⊤} b}{∥ a ∥ ∥ b ∥},

(10)

τ

represents the temperature parameter, which adjusts the sharpness of the similarity scores, and

sim (\cdot)

represents the cosine similarity. The InfoNCE loss is a better choice because the cosine similarity loss only looks upon the positive pairs without pushing apart the negative ones. This could also lead to mapping everything to a constant unit vector and achieving a zero cosine loss. The following equation denotes our total loss computation:

L = λ_{c e} * L_{c e} + λ_{I n f o N C E} * L_{I n f o N C E},

(11)

where

L_{c e}

denotes the cross-entropy loss used for mask classification, while

λ_{c e}

and

λ_{I n f o N C E}

are the coefficients for the respective losses.

3.6. Dynamic IoU Matcher

We adopt an Adaptive Training Sample Selection (ATSS) matcher to further diversify our training pairs. Specifically, for each ground-truth mask

m_{i}

we rank all N candidate predictions

\hat{m_{j}}

by their IoU scores, then we dynamically compute a per-object threshold

≪_{i} = μ_{i} + σ_{i}

, where

μ_{i}

and

σ_{i}

are the mean and standard deviation of the top-k IoUs for that object. All predictions with

I o U (m_{i}, \hat{m_{j}}) \geq ≪_{i}

are retained as positives, up to a fixed value

M_{m a x}

. This dynamic threshold matcher is sensitive to different segmentation qualities. Smaller and visually ambiguous objects automatically receive a looser threshold, while well-segmented, larger masks drive the threshold higher, sharpening the focus on only the most precise proposals. This richer and representative sampling yields steadier gradients during training and achieves better generalization to unseen classes.

4. Experiments

We evaluate our method by integrating it with MAFT+ [11] and by leveraging Mask2Former [50] as our class-agnostic mask generator. Training is performed on the COCO-Stuff [51] dataset, which contains 164K images with annotation for 171 classes. Evaluation of the method is performed across mainstream open-vocabulary datasets [19,20,21] in a zero-shot manner, using mean Intersection over Union as the performance indicator and the CLIP ConvNeXt-B as our VLM model.

4.1. Datasets

In order to evaluate our method, we use the well-known benchmarks for open-vocabulary semantic segmentation: ADE20K [19], PASCAL-VOC [20], and PASCAL-Context [21]. ADE20K has a total of 20K images and a validation set of 2K images; it also has two sets of annotations with 150 and 847 categories, respectively, which divides this dataset into two separate benchmarks: ADE-150 and ADE-847. PASCAL-Context has a total of 5K images for training and validation, and it also has two sets of annotations, one with 59 categories denoting the PC-59 benchmark and the full version with 459 categories denoting the PC-459 benchmark. PASCAL-VOC has a total of 20 annotated categories over the 1.5K images for training and validation.

4.2. Implementation Details

Our training follows the strategy described in [15], which occurs in two phases. The first phase is a warm-up training that has the goal of enhancing our generalization power, and it utilizes ground-truth masks, the second phase mixes the predicted masks and the ground-truth for training. We train with a random flip and a crop size of

1024 \times 1024

, while the evaluation is performed on images of size

896 \times 896

. For the optimizer, we use AdamW [52], as in [15,29,50], with an initial learning rate of

5 \times 10^{- 5}

, and for the second phase of training we adopt a WarmupPolyLR scheduler. We set the k values of our ATSS matcher to 9, while the temperature

τ

for our InfoNCE loss is set to 0.03. Since our total loss is calculated as a combination of the cross-entropy and InfoNCE losses, their weights are set correspondingly to

λ_{c e} = 2.0

and

λ_{I n f o N C E} = 5.0

. For the backbone part, we leverage the Micro version of VSSD [16]. We adopt geometric ensembling, as in [29], in order to obtain our final classification results. Training was conducted on a single NVIDIA RTX 4070Ti GPU.

4.3. Standard Evaluation

Evaluation on standard open-vocabulary benchmarks

We compare our method with other state-of-the-art methods on standard open-vocabulary semantic segmentation benchmarks; the results achieved can be seen in Table 1. We achieve state-of-the-art results when utilizing our method with MAFT+ on ADE-150, ADE-847, and PC-59 when comparing to other methods in a setting where a base VLM is used. On ADE-150 and ADE-847, we receive better results than the current state-of-the-art method MROVSeg [39], improving mIoU by 4.1 and 2.1, respectively, while receiving the same results on the PC-59 dataset. Compared to our baseline, MaskAdapter, we achieve better results on ADE-150, PC-59, PC-459, and VOC-20 with an increase of 0.5, 0.7, 0.1, and 0.2 in mIoU, respectively.

Memory usage and time evaluation

In Table 2, we can see the comparison of our method with the baseline method on memory usage. Both experiments were conducted on an NVIDIA RTX 4070Ti GPU using the same ConvNeXt-Base vision–language model, with an identical image per batch setting and a fixed image resolution of

1024 \times 1024

. The results present the memory utilized during training of the second phase and the evaluation results on ADE-150. We can state that our method not only outperforms the baseline method but it also achieves lower memory usage. In addition, Table 3 compares the two methods in terms of total training time and inference time on the PC-59 and ADE-150 datasets using the same experimental setting as for memory evaluation. While our model requires more time to train (7.5 h), it achieves significantly lower inference time on a single image compared to the baseline, namely a 20% reduction on ADE-150 and a 24% reduction on PC-59.

4.4. Ablation Study

All our ablation experiments were conducted while utilizing the ConvNeXt-Base version of CLIP as our VLM model.

Ablation study on different changes performed

In Table 4, we can see the growth in performance of our method as we introduce new changes. Performance was measured on the ADE20K dataset. Firstly, we introduced the VSSD backbone and the Mamba-based decoder in order to create our high-level guidance maps, and this gave us similar results as the baseline in terms of mIoU. Then we found that the multiscale features extracted by our backbone can be further enhanced by leveraging semantic information from textual embeddings, so we introduced the Mamba Cross-Attention (MCA) module and we gained 0.1 mIoU compared to our baseline. The next step was introducing the InfoNCE loss instead of the cosine similarity loss in order to bring close together positive pairs and push away negative pairs; this increased our mIoU by 0.21 mIoU compared to the baseline method. Finally, we argue that a matcher with a dynamic IoU threshold will better extract masks based on their quality, and this final change gave us an improvement of 0.54 compared to the baseline.

Ablation study on top-k selection in the ATSS matcher

The k parameter controls how many of the highest IoU predictions are considered when determining the dynamic threshold. We performed an ablation study varying the k parameter and found that a value of

k = 9

yields the best performance with an mIoU of 36.14 on the ADE20K dataset. Larger values of the k parameter can include low-quality proposals, which results in lower mIoU by reducing the precision of the selected matches. Lower k values risked missing good candidates by being overly selective. Choosing value 9 provided a balanced threshold yielding the best results, which can be seen in Table 5.

Ablation study on temperature parameter $τ$ for contrastive loss

As discussed, we use the contrastive-based InfoNCE loss instead of the simple cosine similarity loss for ensuring that similar mask proposals obtain similar CLIP embeddings. The temperature parameter

τ

plays a critical role in adjusting the sharpness of the model’s discrimination between positive and negative pairs. Lower temperature values will result in sharper similarity distribution, which means that our model becomes more sensitive to differences between embeddings. Higher temperature values can lead to less precise alignment between the pairs, as can be seen in Table 6, disabling the model’s ability to make fine-grained distinctions. We found that a value of 0.03 for our temperature parameter gives us the best results when using mIoU as the performance metric.

5. Conclusions

In this work, we propose OpenMamba, a novel approach to two-stage open-vocabulary semantic segmentation methods. OpenMamba uses both CLIP visual and textual embeddings as well as mask proposals in order to produce high-level guidance maps. These will be used to pool the exact features of CLIP that represent the segment which needs to be classified. Our method is the first one to utilize State Space Models for the open-vocabulary semantic segmentation task; this not only increases the model’s performance but also reduces the memory usage. The findings of the experiments conducted on well-established benchmarks show the effectiveness of our method. We hope that our method can be a good starting point for future research.

Author Contributions

Conceptualization, V.U. and C.-A.P.; methodology, V.U. and C.-A.P.; software, V.U.; validation, V.U.; formal analysis, V.U.; investigation, V.U.; resources, C.-A.P.; data curation, V.U.; writing—original draft preparation, V.U.; writing—review and editing, C.-A.P.; visualization, V.U.; supervision, C.-A.P.; project administration, C.-A.P.; funding acquisition, C.-A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The APC was funded by Politehnica University of Timișoara.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fedushko, S.; Shumyliak, L.; Cibák, L.; Sierova, M.-O. Image Processing Application Development: A New Approach and Its Economic Profitability. In Data-Centric Business and Applications; Lecture Notes on Data Engineering and Communications Technologies; Štarchoň, P., Fedushko, S., Gubíniová, K., Eds.; Springer Nature: Cham, Switzerland, 2024; Volume 208. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 7262–7272. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. Segformer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.S.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 6881–6890. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.-T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Cho, S.; Shin, H.; Hong, S.; Arnab, A.; Seo, P.H.; Kim, S. Catseg: Cost aggregation for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4113–4123. [Google Scholar]
Zhou, Z.; Lei, Y.; Zhang, B.; Liu, L.; Liu, Y. Zegclip: Towards adapting clip for zero-shot semantic segmentation. arXiv 2022, arXiv:2212.03588. [Google Scholar]
Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; Bai, X. A simple baseline for open vocabulary semantic segmentation with pre-trained vision language model. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 736–753. [Google Scholar]
Jiao, S.; Zhu, H.; Huang, J.; Zhao, Y.; Wei, Y.; Shi, H. Collaborative vision-text representation optimizing for open-vocabulary segmentation. arXiv 2024, arXiv:2408.00744. [Google Scholar]
Shan, X.; Wu, D.; Zhu, G.; Shao, Y.; Sang, N.; Gao, C. Open-vocabulary semantic segmentation with image embedding balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 28412–28421. [Google Scholar]
Ding, J.; Xue, N.; Xia, G.S.; Dai, D. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11583–11592. [Google Scholar]
Han, C.; Zhong, Y.; Li, D.; Han, K.; Ma, L. Open-vocabulary semantic segmentation with decoupled one-pass network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 1086–1096. [Google Scholar]
Li, Y.; Cheng, T.; Feng, B.; Liu, W.; Wang, X. Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation. arXiv 2024, arXiv:2412.04533. [Google Scholar]
Shi, Y.; Dong, M.; Li, M.; Xu, C. VSSD: Vision Mamba with Non-Causal State Space Duality. arXiv 2024, arXiv:2407.18559. [Google Scholar]
Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Zhang, X.; Chi, C.; Jinnai, Y.; Li, Y.; Zhang, X.; Wei, Y.; Sun, J. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 11–15 June 2020; pp. 9759–9768. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. 2019, 127, 302–321. [Google Scholar] [CrossRef]
Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.G.; Lee, S.W.; Fidler, S.; Urtasun, R.; Yuille, A. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 891–898. [Google Scholar]
Chen, Y.-C.; Li, L.; Yu, L.; Kholy, A.E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: UNiversal Image-TExt Representation Learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXX 30. Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–120. [Google Scholar]
Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXX 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 121–137. [Google Scholar]
Zhong, Y.; Yang, J.; Zhang, P.; Li, C.; Codella, N.; Li, L.H.; Zhou, L.; Dai, X.; Yuan, L.; Li, Y.; et al. Regionclip: Region based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 16793–16803. [Google Scholar]
Bucher, M.; Vu, T.H.; Cord, M.; Perez, P. Zero-shot semantic segmentation. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2945–2954. [Google Scholar]
Yu, Q.; He, J.; Deng, X.; Shen, X.; Chen, L.C. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip. In Proceedings of the Advances in Neural Information Processing Systems 36 (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 2085–2094. [Google Scholar]
Ghiasi, G.; Gu, X.; Cui, Y.; Lin, T.Y. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 540–557. [Google Scholar]
Liang, F.; Wu, B.; Dai, X.; Li, K.; Zhao, Y.; Zhang, H.; Zhang, P.; Vajda, P.; Marculescu, D. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7061–7070. [Google Scholar]
Xu, J.; Liu, S.; Vahdat, A.; Byeon, W.; Wang, X.; Mello, S.D. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2955–2966. [Google Scholar]
Wang, X.; He, W.; Xuan, X.; Sebastian, C.; Ono, J.P.; Li, X.; Behpour, S.; Doan, T.; Gou, L.; Shen, H.W.; et al. USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation. arXiv 2024, arXiv:2406.05271. [Google Scholar] [CrossRef]
Chng, Y.X.; Qiu, X.; Han, Y.; Ding, K.; Ding, W.; Huang, G. Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation. arXiv 2024, arXiv:2409.16278. [Google Scholar]
Ding, Z.; Wang, J.; Tu, Z. Open vocabulary universal image segmentation with maskclip. arXiv 2022, arXiv:2208.08984. [Google Scholar]
Xie, B.; Cao, J.; Xie, J.; Khan, F.S.; Pang, Y. Sed: A simple encoder-decoder for open vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 3426–3436. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Zhu, Y.; Zhu, B.; Chen, Z.; Xu, H.; Tang, M.; Wang, J. Mrovseg: Breaking the resolution curse of vision-language models in open-vocabulary semantic segmentation. arXiv 2024, arXiv:2408.14776. [Google Scholar]
Gu, A.; Johnson, I.; Goel, K.; Saab, K.; Dao, T.; Rudra, A.; Ré, C. Combining Recurrent, Convolutional, and Continuous-time Models with Linear StateSpace Layers. In Proceedings of the Advances in Neural Information Processing Systems 34 (NeurIPS 2021), Online, 6–14 December 2021. [Google Scholar]
Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. HiPPO: Recurrent Memory with Optimal Polynomial Projections. In Proceedings of the Advances in Neural Information Processing Systems 33 (NeurIPS 2020), Virtual, 6–12 December 2020. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Liu, Y. Vmamba: Visual state space model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Ma, J.; Li, F.; Wang, B. U-mamba: Enhancing longrange dependency for biomedical image segmentation. arXiv 2024, arXiv:2401.04722. [Google Scholar]
Liang, D.; Zhou, X.; Wang, X.; Zhu, X.; Xu, W.; Zou, Z.; Ye, X.; Bai, X. Pointmamba: A simple state space model for point cloud analysis. arXiv 2024, arXiv:2402.10739. [Google Scholar] [CrossRef]
Zhang, T.; Li, X.; Yuan, H.; Ji, S.; Yan, S. Point could mamba: Point cloud learning via state space model. arXiv 2024, arXiv:2403.00762. [Google Scholar] [CrossRef]
Lin, B.; Jiang, W.; Chen, P.; Zhang, Y.; Liu, S.; Chen, Y.-C. MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 290–307. [Google Scholar]
Dao, T.; Gu, A. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv 2024, arXiv:2405.21060. [Google Scholar] [CrossRef]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Caesar, H.; Uijlings, J.; Ferrari, V. Cocostuff: Thing and stuff classes in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar]
Loshchilov, I. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Xu, J.; Mello, S.D.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. GroupViT: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18134–18144. [Google Scholar]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar]
Thomee, B.; Shamma, D.A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; Li, L.J. YFCC100M: The new data in multimedia research. Commun. ACM 2016, 59, 64–73. [Google Scholar] [CrossRef]
Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollar, P. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9404–9413. [Google Scholar]
Liu, Y.; Bai, S.; Li, G.; Wang, Y.; Tang, Y. Open-Vocabulary Segmentation with Semantic-Assisted Calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 3491–3500. [Google Scholar]
Peng, Z.; Xu, Z.; Zeng, Z.; Wang, Y.; Shen, W. Parameter-efficient Fine-tuning in Hyperspherical Space for Open-vocabulary Semantic Segmentation. arXiv 2024, arXiv:2405.18840. [Google Scholar]

Figure 1. Visualization of the high-level guidance maps. Here we present the high level guidance maps produced by our OpenMamba module, these maps will assist in extracting CLIP visual features.

Figure 2. Two-stage method pipeline with OpenMamba. The overall architecture utilizes a CLIP image and textual encoder to produce image and textual embeddings and a Mask Generator to produce class-agnostic masks, all having frozen weights. Then the data is fed to OpenMamba to create high-level guidance maps that will help segment classification by pooling dense CLIP features.

Figure 3. OpenMamba architecture. Binary mask features and CLIP features will be fused through addition and fed to a State Space Duality backbone to extract semantic features form different stages. These multiscale visual features are further refined by the Mamba Cross-Attention (MCA) module, which computes a Mamba 2 cross-attention between visual features and text embeddings. Next, the decoder will restore the initial resolution and produce the final high-level guidance maps.

Figure 4. Pipeline of the Mamba Cross-Attention module. Each multiscale visual feature is processed by L Mamba Cross-Attention blocks, followed by a Multi-Head Cross-Attention block in order to enhance semantic representations.

Table 1. Open-vocabulary semantic segmentation comparison with other methods. The best results are highlighted in bold, while our method’s results are highlighted with a gray background. Mean Intersection over Union was used as the performance metric.

Method	VLM	Training Dataset	ADE-847	PC-459	ADE-150	PC-59	VOC-20
LSeg+ [31]	ALIGN RN101	COCO Stuff	2.5	5.2	13.0	36.0	59.0
GroupViT [53]	CLIP ViT-S/16	GCC [54] + YFCC [55]	4.3	4.9	10.6	25.9	50.7
MaskCLIP [36]	CLIP ViT-L/14	COCO Panoptic [56]	8.2	10.0	23.7	45.9	-
OpenSeg [31]	ALIGN RN101	COCO Stuff	4.4	7.9	17.5	40.1	63.8
OVSeg [32]	CLIP ViT-B/16	COCO Stuff	7.1	11.0	24.8	53.3	92.6
ZegFormer [13]	CLIP ViT-B/16	COCO Stuff	5.6	10.4	18.0	45.5	89.5
SAN [28]	CLIP ViT-B/16	COCO Stuff	10.7	13.7	28.9	55.4	94.6
ODISE [33]	Stable Diffusion	COCO Panoptic	11.1	14.5	29.9	55.3	-
CAT-Seg [8]	CLIP ViT-B/16	COCO Stuff	12.0	19.0	31.8	57.5	94.6
SCAN [57]	CLIP ViT-B/16	COCO Stuff	10.8	13.2	30.8	58.4	97
SED [37]	CLIP ConvNeXt-B	COCO Stuff	11.4	18.7	31.6	57.3	94.4
HCLIP [58]	CLIP ViT-B/16	COCO Stuff	12.5	19.4	32.4	57.9	95.2
MROVSeg [39]	CLIP ViT-B/16	COCO Stuff	12.1	19.6	32.0	58.5	95.5
MaskAdapter [15]	CLIP ConvNeXt-B	COCO Stuff	14.2	17.9	35.6	58.4	95.1
OpenMamba & MAFT+	CLIP ConvNeXt-B	COCO Stuff	14.2	18.6	36.1	58.5	95.3

Table 2. Results on memory usage compared to baseline method. These results highlight the memory efficiency of our approach over the baseline, while using the same settings.

Method	VLM Size	Batch Size	Memory (MB)	ADE20K (mIoU)
Mask-Adapter (Baseline)	ConvNeXt Base	2	6760	35.6
OpenMamba (Ours)	ConvNeXt Base	2	4592	36.1

Table 3. Results on time compared to the baseline method. The results show our method’s efficiency regarding the inference time while introducing a small bottleneck for training time.

Method	Training Time (h)	Inference Time (ADE-150) (ms)	Inference Time (PC-59) (ms)
Mask-Adapter (Baseline)	39	207	178.3
OpenMamba (Ours)	46.5	172.5	136.7

Table 4. Ablation studies on different changes applied. Here we denote the various changes that we applied in comparison to the baseline that benefited our model. MCA denotes Mamba Cross-Attention module. All results are presented in terms of mIoU (Green indicates better performance than baseline and red indicates worse performance than baseline).

Method	ADE20K
MaskAdapter (Baseline)	35.60
VSSD Backbone (Ours)	35.59_−0.01
+MCA	35.70_+0.1
+InfoNCE Loss	35.81_+0.21
+ATSS Matcher	36.14_+0.54

Table 5. Ablation studies with different k values. The results show different mIoU performance with different values for the top-k selection for our ATSS matcher. The value 9 provides a balanced threshold without introducing excessive noise.

Different Top-k Values	ADE20K
$k = 5$	35.7
$k = 7$	36.0
$k = 9$	36.1
$k = 15$	35.9

Table 6. Ablation study with different temperature $τ$ values. The following results show the results using mIoU as metric for different values of our InfoNCE loss temperature parameter. Higher temperature values lead to less precise alignment of the embeddings.

Different Temperature Values	ADE20K
$τ = 0.03$	35.8
$τ = 0.05$	35.5
$τ = 0.1$	35.4
$τ = 0.2$	34.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ungur, V.; Popa, C.-A. OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation. Appl. Sci. 2025, 15, 9087. https://doi.org/10.3390/app15169087

AMA Style

Ungur V, Popa C-A. OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation. Applied Sciences. 2025; 15(16):9087. https://doi.org/10.3390/app15169087

Chicago/Turabian Style

Ungur, Viktor, and Călin-Adrian Popa. 2025. "OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation" Applied Sciences 15, no. 16: 9087. https://doi.org/10.3390/app15169087

APA Style

Ungur, V., & Popa, C.-A. (2025). OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation. Applied Sciences, 15(16), 9087. https://doi.org/10.3390/app15169087

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OpenMamba: Introducing State Space Models to Open-Vocabulary Semantic Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Vision–Language Pre-Training

2.2. Open-Vocabulary Semantic Segmentation

2.3. State Space Models

3. Method

3.1. Problem Definition

3.2. Method Overview

3.3. OpenMamba

3.4. Mamba Cross-Attention

3.5. Contrastive Loss

3.6. Dynamic IoU Matcher

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Standard Evaluation

4.4. Ablation Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI