CLIP-HBD: Hierarchical Boundary-Constrained Decoding for Open-Vocabulary Semantic Segmentation

Wang, Jing; Zhou, Quan; Yang, Anyi; Lin, Junyu

doi:10.3390/computers15050318

Open AccessArticle

CLIP-HBD: Hierarchical Boundary-Constrained Decoding for Open-Vocabulary Semantic Segmentation

School of Communications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(5), 318; https://doi.org/10.3390/computers15050318 (registering DOI)

Submission received: 10 April 2026 / Revised: 9 May 2026 / Accepted: 12 May 2026 / Published: 15 May 2026

(This article belongs to the Special Issue Advanced Image Processing and Computer Vision (3rd Edition))

Download

Browse Figures

Versions Notes

Abstract

Open-vocabulary semantic segmentation (OVSS) aims to achieve pixel-level object segmentation guided by arbitrary natural language descriptions. Although pre-trained vision–language models (VLMs) have significantly advanced the development of OVSS, their reliance on the Vision Transformer (ViT) architecture imposes a fundamental constraint on dense prediction. Specifically, the absence of hierarchical downsampling in ViT-based VLM results in single-scale representations that trade spatial localization for global semantics. To address these issues, this paper proposes a hierarchical boundary-constrained decoding network for OVSS, called CLIP-HBD. Our approach leverages VLM semantic priors to reconstruct multi-scale features and introduces a boundary-constrained decoding strategy to refine edge details. Specifically, CLIP-HBD leverages a ConvNeXt-based backbone alongside a hierarchical adaptation mechanism to fuse multi-layer VLM features, generating a comprehensive multi-scale representation. To address the issue of boundary inaccuracy, we perform explicit boundary prediction based on multi-scale representations, where the resulting boundary maps are subsequently transformed into structural constraints to steer the decoder’s focus toward boundary regions. By integrating structural constraints with hierarchical features, the decoding process effectively maintains semantic consistency and restores precise object boundaries. Extensive experiments demonstrate that CLIP-HBD achieves superior performance in both segmentation precision and boundary quality across multiple benchmarks.

Keywords:

open-vocabulary semantic segmentation; vision–language models; hierarchical feature adaptation; boundary-constrained decoding

1. Introduction

Semantic segmentation serves as a cornerstone of computer vision, focusing on assigning a categorical label to each individual pixel. While conventional approaches [1,2,3,4] have achieved significant progress, they are typically constrained to a predefined and static label set, limiting their generalization to novel classes. To overcome this limitation, open-vocabulary semantic segmentation (OVSS) has gained increasing attention. Specifically, OVSS aims to transcend the constraints of predefined training categories, allowing models to generalize to arbitrary concepts defined by textual descriptions. The core objective lies in reconciling the inherent modality gap between dense, pixel-level visual features and discrete, high-level textual semantics by establishing an effective cross-modal connection, which enables the model to leverage linguistic knowledge to assign accurate semantic labels to each individual pixel.

Early research [5,6] in OVSS primarily focused on bridging the gap between visual features and textual semantics through vision–semantic space mapping. These approaches almost all followed two strategies. The first, characterized by structural substitution [5], involves replacing fixed classifiers in traditional segmentation models with pre-trained word embeddings while incorporating class-agnostic localization branches to enhance generalization to novel categories. The second strategy focuses on designing specialized mapping modules and optimization objectives [6], such as ranking or cross-modal alignment losses, to facilitate efficient visual-to-semantic transformation. Although these pioneering methods struggled to fully bridge the modality gap between pixel-level features and high-level semantics, they established the methodological foundation for subsequent advancements in the field.

The emergence of powerful vision–language model (VLM) like CLIP [7] and ALIGN [8] has brought OVSS into a new stage centered on pre-trained foundation models. Leveraging the cross-modal alignment capabilities of CLIP, numerous methodologies have emerged, among which the two-stage framework proposed by MaskCLIP [9] has become a representative baseline. In this classic two-stage method, a class-agnostic generator first produces a mask proposal, which is then classified by a frozen CLIP encoder. To further improve segmentation accuracy, researchers have introduced several optimization strategies: OVSeg [10] introduces mask-prompt tuning to adapt the CLIP model for the specific task of mask proposal classification. FreeSeg [11] expands the text space by employing unified text representations to handle diverse concepts. Recent works like ODISE [12] and DiffSegmenter [13] incorporate diffusion models to extract high-quality proposals and localization information. Despite their strong performance, these two-stage methods face notable bottleneck: (1) The separation of mask generation and classification hinders end-to-end optimization. (2) Cropping the masked regions for independent classification inevitably disrupts the global contextual information of the image, which limits the model’s discriminative performance. (3) The classification of multiple mask proposals using VLM incurs substantial computational overhead.

To overcome the limitations of two-stage methods, one-stage methods strive to enable end-to-end training through fine-tuning of the VLM model. LSeg [14] represents one of the early explorations, achieving cross-category segmentation by aligning pixel-level visual features with pre-trained text embeddings. To preserve CLIP’s open-vocabulary capability while improving its adaptability to the segmentation task, SAN [15] introduced a side-adapter network to efficiently fine-tune CLIP, while ZegCLIP [16] employed a deep prompt tuning strategy to enhance the alignment between text embeddings and pixel-level features. In recent years, cost volume-based methods have gained significant attention due to their superior open-vocabulary generalization. CAT-Seg [17] is the first to introduce this approach, where cost volumes are aggregated separately along the spatial and categorical dimensions to enhance segmentation results. Building upon this, SED [18] introduces a simple encoder–decoder structure that fuses encoder’s features to better recover image details. Furthermore, several studies have begun integrating additional foundation models to compensate for CLIP’s deficiency in capturing spatial information. For instance, EBSeg [19] and Trident [20] both leverage the Segment Anything Model (SAM) [21], either its features or its high-resolution encoder, to provide the essential spatial information required for precise segmentation. In addition, some studies focus on cross-modal interaction. BBN [22] proposes a bidirectional bridging network that employs optimal transport to purify text embeddings before guiding visual–semantic aggregation. Similarly, ITA [23] emphasizes the role of text features, utilizing class mining and detail enhancement modules to introduce image–textual correlations.

Despite these advances, most one-stage OVSS methods devote the majority of their design effort to harnessing the cross-modal alignment capabilities of ViT-based VLM [24], while offering comparatively little architectural support for the decoding process itself. This oversight manifests in two interrelated limitations. First, the decoding stage typically operates on the single-scale output of ViT-based VLM, which lacks the multi-scale spatial granularity essential for recovering fine details. Second, the decoding stage lacks explicit boundary constraints to regulate feature aggregation, directly resulting in inaccurate boundary localization and blurred object contours. Together, these deficiencies lead to missed small objects and inaccurate boundary localization, as illustrated by the concrete example in Figure 1.

To address these issues, we propose CLIP-HBD, a hierarchical boundary-constrained decoding network. Our approach reconstructs multi-scale features by leveraging the rich semantic priors of VLM and introduces a novel boundary-constrained decoding strategy to recover intricate edge details. Specifically, CLIP-HBD employs a ConvNeXt-based backbone together with a hierarchical adaptation mechanism that fuses multi-layer VLM features, producing a comprehensive multi-scale representation. To overcome the challenge of inaccurate boundaries, we perform explicit boundary prediction from the multi-scale features and transform the resulting boundary maps into structural constraints that guide the decoder to focus on boundary regions. By integrating these structural constraints with hierarchical features, the decoding process preserves semantic consistency while restoring precise object boundaries. Extensive experiments on multiple benchmarks demonstrate that CLIP-HBD achieves superior performance in both segmentation accuracy and boundary quality. The main contributions of this paper are summarized as follows:

1.: CLIP-HBD overcomes the single-scale limitation of ViT and develops an effective approach to construct multi-scale features enriched with spatial priors, compensating for the weak spatial detail perception of vision–language model and providing a solid feature foundation for the subsequent decoding process.
2.: CLIP-HBD introduces a boundary-constrained decoding strategy that converts predicted boundary maps into spatial attention priors, steering semantic feature propagation and boundary reconstruction to enforce geometric constraints and significantly improve mask boundary quality.
3.: Extensive experiments on multiple benchmarks demonstrate that CLIP-HBD achieves superior performance in both segmentation accuracy and boundary quality, validating the effectiveness of our method.

The remainder of this paper is organized as follows: Section 2 reviews the related work in vision–language models, boundary-aware segmentation, and the evolution of open-vocabulary semantic segmentation. Section 3 details the proposed CLIP-HBD framework, covering hierarchical feature construction, boundary and cost volume generation, boundary-constrained decoding, and the multi-task hybrid loss. Section 4 presents the experimental results, comparative analysis with state-of-the-art methods, and extensive ablation studies across multiple benchmarks, including ADE20K [25], Pascal Context [26], and Pascal VOC [27]. Finally, Section 5 concludes the paper and discusses the implications of our findings for future research in open-vocabulary semantic segmentation.

2. Related Works

2.1. Vision–Language Models

Vision–language models aim to establish deep semantic associations between visual content and linguistic context through large-scale multimodal learning. Early efforts [28,29] primarily focused on image captioning using encoder–decoder frameworks, where pre-trained CNNs were used as a visual encoder to extract global image features, which were then fed into an RNN decoder to generate natural language descriptions. Inspired by the success of the Transformer [30] in NLP, researchers began to apply it to multimodal learning. ViLBERT [31] implements a Bert-based dual-flow architecture that processes image and text information separately and fuses it through an interaction layer, which performs well in visual and linguistic tasks. As pre-training techniques matured and large-scale image–text datasets became available, the focus moved toward learning universal representations from massive data. Models like CLIP [7] and ALIGN [8] established a new technical foundation. CLIP leverages 400 million image–text pairs from the internet, using contrastive learning to map images and texts into a shared semantic space. At inference, zero-shot prediction is performed by computing similarity between image features and text features derived from class name prompts. Recently, with the advancement of large language models (LLMs), methods that adopt pre-trained LLMs as backbones have surpassed conventional vision–language models on many tasks. Flamingo [32] introduced the perception resampler to bridge a pre-trained vision encoder with an LLM, enabling processing of interleaved visual and textual sequences. LLaVA [33] generates multimodal instruction-following data using a language-only model and then performs visual instruction tuning, connecting a pre-trained CLIP vision encoder to an LLM via a simple trainable projection matrix. Benefiting from the development of vision–language models and their open-vocabulary recognition capability, researchers have further introduced them into various downstream tasks, including open-vocabulary semantic segmentation.

2.2. Boundary-Aware Segmentation

Boundary-aware segmentation aims to mitigate the blurring of object contours, a common issue caused by downsampling in deep convolutional or Transformer-based backbones. Early research primarily focused on multi-task learning frameworks that integrate edge detection with semantic segmentation. For instance, GSCNN [34] utilized a gated dual-stream architecture to process shape and appearance information separately, explicitly supervising the model with boundary predictions. Similarly, DeepLabV3+ [35] introduced an effective decoder that recovers spatial details by fusing low-level features with high-level semantics. Another line of research explores refinement-based strategies to enhance mask sharpness. PointRend [36] treats segmentation as a rendering problem, performing point-based feature extraction and refinement at uncertain boundary regions. To leverage structural priors, some methods incorporate conditional random fields [37] or bilateral filtering as post-processing steps to align segmentation maps with image edges. Recently, loss-driven constraints have also gained prominence. Boundary loss [38] penalizes misalignments between predicted and ground-truth boundaries, forcing the network to attend to fine-grained structural details during training.

In the context of open-vocabulary segmentation, the reliance on image classification pre-trained VLM backbones often results in unsatisfactory segmentation due to the lack of spatial information. While current methods attempt to bridge this gap using SAM-based priors, maintaining high-frequency boundary details remains a significant challenge. Our work builds upon these boundary-aware concepts, introducing a boundary-constrained decoding strategy to restore precise object contours while preserving semantic consistency.

2.3. Open-Vocabulary Semantic Segmentation

Open-vocabulary semantic segmentation (OVSS) aims to assign pixel-level labels to arbitrary text-defined categories, transcending the fixed label set of traditional segmentation. To achieve this, the model must bridge the gap between dense visual features and open-ended textual descriptions, which is referred to as vision–text alignment. This alignment is essential because OVSS relies on text to define categories. Without a shared space where pixel features and class names are directly comparable, the model cannot recognize unseen concepts. Therefore, the core challenge of this task is to effectively align these two modalities, enabling language-guided pixel-level recognition. Initially, research [5,6] focused on aligning visual features with word embeddings by developing methods for learning feature mapping. With the advent and success of large-scale vision–language models [7,8], OVSS methods began exploring the cross-modal alignment abilities of these models. Among them, two-stage methods exemplified by MaskCLIP [9] serve as a representative baseline. In such classic two-stage approaches, a class-agnostic generator first produces mask proposals, which are then classified by a frozen CLIP encoder. To further improve segmentation accuracy, several optimization strategies have been introduced. OVSeg [10] incorporates mask-prompt tuning to adapt CLIP for mask proposal classification. FreeSeg [11] expands the text space by employing unified text representations to handle diverse concepts. MAFTP [39] proposes a point-level contrastive learning framework that explicitly aligns visual features with text embeddings via uncertainty modeling and memory-based negative mining, boosting open-vocabulary recognition and segmentation of unseen categories. More recent works, such as ODISE [12] and DiffSegmenter [13], integrate diffusion models to extract high-quality proposals and localization information. Despite their strong performance, two-stage methods face several notable limitations: (1) The separation of mask generation and classification hinders end-to-end optimization. (2) Cropping masked regions for independent classification inevitably destroys global contextual information, limiting discriminative performance. (3) Classifying multiple mask proposals using VLM incurs substantial computational overhead. To overcome these limitations, one-stage methods enable end-to-end training by fine-tuning the VLM backbone. LSeg [14] aligns pixel-level features with text embeddings. SAN [15] and ZegCLIP [16] adapt CLIP efficiently via side-adapter networks or deep prompt tuning. Recently, cost volume-based methods have gained attention for their strong open-vocabulary generalization. From an architectural standpoint, FC-CLIP [40] unifies mask generation and open-vocabulary classification in a single forward pass using a shared frozen convolutional CLIP backbone. CAT-Seg [17] first aggregates cost volumes along spatial and categorical dimensions, while SED [18] adds a simple encoder–decoder to recover image details. Furthermore, several studies integrate other foundation models to compensate for CLIP’s spatial deficiency. EBSeg [19] and Trident [20] leverage SAM [21] features or encoder for precise segmentation. Other works focus on cross-modal interaction: BBN [22] uses optimal transport to purify text embeddings, and ITA [23] introduces image–textual correlations via class mining and detail enhancement.

In contrast to prior methods, our approach primarily focuses on addressing two key challenges: (1) overcoming the single-scale limitation of ViT by reconstructing hierarchical features to provide a multi-scale foundation for decoding; and (2) enhancing boundary precision through an explicit boundary-constrained strategy that restores intricate edge details.

3. Methods

Figure 2 illustrates the overall architecture of our open-vocabulary semantic segmentation model CLIP-HBD. In this framework, we first extract fundamental image features using a ConvNeXt [41] backbone, while integrating semantic priors via the deep reuse of CLIP’s features. This design effectively synergizes CLIP’s rich semantics with the structured spatial representations of ConvNeXt, fully leveraging the latent capabilities of pre-trained CLIP models. Building on this, CLIP-HBD introduces an explicit boundary prediction branch paired with a boundary-constrained decoding strategy. By utilizing predicted boundary maps as spatial attention priors, the network dynamically steers the decoding process to enhance both semantic consistency and boundary sharpness, ultimately yielding the final high-resolution semantic segmentation result. During the training phase, a multi-task hybrid loss function is employed for training, ensuring joint optimization for semantic mask accuracy and geometric boundary precision.

3.1. Multi-Scale Hierarchical Feature Construction

A persistent challenge in semantic segmentation is that single-scale feature representations often fail to capture global structures and local details at the same time. This is because deep, low-resolution features excel at capturing overall contours via rich semantics, while shallow, high-resolution features are indispensable for recovering fine-grained boundary details. Leveraging the intrinsic complementarity between these features through multi-scale modeling allows the network to capture both global context and local details simultaneously, leading to superior object boundary reconstruction and enhanced overall segmentation accuracy. Driven by this motivation, we select ConvNeXt as the backbone network to construct multi-scale information extraction. Through its hierarchical convolutional architecture, ConvNeXt generates features with progressively decreasing resolutions, providing a solid foundation for capturing rich geometric information from holistic contours to local boundaries. To further empower the model’s semantic understanding in open-vocabulary scenarios, we integrate the corresponding deep semantic features from the CLIP encoder at each feature extraction level via a custom-designed adaptation mechanism.

Given an input image

I \in R^{H \times W \times 3}

with height H, width W, and three color channels (where

H = 384

and

W = 384

in our default setting), the ConvNeXt encoder generates hierarchical features

F_{b}^{l} \in R^{H_{l} \times W_{l} \times D_{l}}

, with progressively decreasing spatial resolutions, a process that naturally preserves rich multi-scale spatial details and local geometric information essential for dense prediction. Here,

l \in {1, 2, 3, 4}

denotes the hierarchical stage index, with the spatial dimensions scaled as

(H_{l}, W_{l}) = (H / 2^{l + 1}, W / 2^{l + 1})

relative to the original image dimensions

(H, W)

, while

D_{l}

represents the channel dimension at stage l, which doubles progressively at each successive level. Simultaneously, given the CLIP image Transformer encoder, we extract intermediate semantic representations from three distinct layers, denoted as

F_{c}^{k} \in R^{\frac{H}{16} \times \frac{W}{16} \times D}

. Here, the spatial resolution is fixed at

H / 16 \times W / 16

, corresponding to the number of image patches defined by the ViT-B architecture, while D represents the CLIP latent dimension. Here,

k \in {1, 2, 3}

refers to the index of the Transformer blocks from shallow to deep levels. These representations serve as robust semantic priors, empowering the multi-scale feature reconstruction process to better handle the complexities of open-vocabulary scenarios. To more clearly illustrate these dimensional relationships, we detail the feature dimensions of the encoder architectures in Table 1.

In order to fuse the complementary feature representations from both the CLIP and ConvNeXt encoders, we introduce a Feature Fusion Module at the ConvNeXt stage. This module first performs a two-step alignment process on the CLIP features to ensure architectural compatibility: bi-linear upsampling is applied to expand the spatial resolution, while a

1 \times 1

convolution is utilized to project the CLIP latent channels into the corresponding ConvNeXt stage’s dimension. These aligned features are then integrated using a learnable spatial-adaptive weighting mechanism to generate the enhanced multi-scale features, formulated as:

{\hat{F}}_{b}^{l} = α_{l} \cdot F_{b}^{l} + (1 - α_{l}) \cdot {Conv}_{1 \times 1}^{D \to D_{l}} (UpSample (F_{c}^{k}))

(1)

where

{\hat{F}}_{b}^{l} \in R^{H_{l} \times W_{l} \times D_{l}}

denotes the enhanced multi-scale feature for the l-th stage of the ConvNeXt encoder, and

α_{l} \in [0, 1]

is a learnable weight for the l-th stage’s fusion that dynamically adjusts the relative contribution of the structural features from ConvNeXt and the semantic priors from CLIP.

UpSample (\cdot)

denotes the upsampling operation. Different upsampling factors are applied at different stages of the ConvNeXt. At the l-th stage, it upsamples a feature map of spatial size

\frac{H}{16} \times \frac{W}{16}

to the target resolution

H_{l} \times W_{l}

.

F_{c}^{k} \in R^{\frac{H}{16} \times \frac{W}{16} \times D}

denotes the original feature map extracted from the k-th specific layer of the CLIP Transformer encoder. Specifically, we define

k = l

, establishing a direct layer-to-stage correspondence between the k-th specific layer of the CLIP and the l-th stage of the ConvNeXt backbone. This symmetrical mapping ensures that both the localized structural details and global semantic information are aligned and fused at homologous levels of their respective hierarchies. Notably, in certain stages where the spatial resolution or channel dimension already aligns between the two encoders, the corresponding adjustment operations are bypassed, allowing for direct weighted fusion to preserve the original feature integrity. Additionally,

{Conv}_{1 \times 1}^{D \to D_{l}} (\cdot)

denotes a

1 \times 1

convolution layer employed for channel dimension projection.

Finally, the resulting enhanced multi-scale features

{\hat{F}}_{b}^{l}

are propagated as inputs to the subsequent ConvNeXt stage. Through this iterative refinement of representations, the network continuously enriches the feature hierarchy, culminating in high-quality multi-scale feature maps that are well-optimized for dense prediction.

3.2. Boundary and Cost Volume Generation

In this section, we detail the generation of explicit boundary maps from multi-scale features, as well as the construction of cost volumes from CLIP features, both of which serve as foundational primitives to guide the subsequent constrained decoding process.

To begin with, we employ a lightweight boundary prediction head to predict the corresponding boundary map from the extracted multi-scale features. Specifically, to maintain a lightweight design,

1 \times 1

convolutions are utilized to project all hierarchical features to a uniform channel dimension

C_{m i d}

, significantly cutting down the floating-point operations. The projected features

F_{b}^{l}

at the l-th stage are formulated as follows:

{\bar{F}}_{b}^{l} = {Conv}_{1 \times 1}^{D_{l} \to C_{m i d}} (F_{b}^{l}), l \in {1, 2, 3, 4}

(2)

where

{Conv}_{1 \times 1}^{D_{l} \to C_{m i d}} (\cdot)

denotes the

1 \times 1

convolution operation with input channels

D_{l}

(the original channel depth of the l-th stage) and output channels

C_{b d}

(the unified dimension for the boundary head).

Subsequently, we perform a hierarchical top-down fusion starting from the deepest stage (Stage 4) towards the highest resolution level. To provide a clearer visualization of the architectural flow, the details of this fusion process are illustrated in Figure 3. Let

{\bar{F}}_{b}^{4} \in R^{H_{4} \times W_{4} \times C_{b d}}

be the initial fusion feature, denoted as

F_{f u s e}^{4} = {\bar{F}}_{b}^{4}

. For each subsequent layer

l = 3, 2, 1

, the fused feature from the previous level

F_{f u s e}^{l + 1} \in R^{H_{l + 1} \times W_{l + 1} \times C_{b d}}

is bi-linearly upsampled by a factor of 2 to match the spatial resolution of the current stage feature

{\bar{F}}_{b}^{l} \in R^{H_{l} \times W_{l} \times C_{b d}}

. These features are then concatenated along the channel dimension, followed by a

3 \times 3

convolution layer for information aggregation to generate the current level’s fused feature

F_{f u s e}^{l} \in R^{H_{l} \times W_{l} \times C_{b d}}

. The process is formulated as:

F_{f u s e}^{l} = {Conv}_{3 \times 3}^{2 C_{m i d} \to C_{m i d}} (Concat [UpSample (F_{f u s e}^{l + 1}), {\bar{F}}_{b}^{l}]), l = 3, 2, 1

(3)

Finally, based on the final fused feature

F_{f u s e}^{1} \in R^{H_{1} \times W_{1} \times C_{b d}}

, we employ two cascaded

3 \times 3

transposed convolution layers to gradually restore the spatial resolution to match the original input image. Notably, this operation inherently performs a smoothing effect that suppresses potential checkerboard artifacts arising from the upsampling process [42]. A

1 \times 1

convolution is then used to compress the channel dimension to 1, followed by a Sigmoid function to normalize the output into the range

[0, 1]

. This yields the final boundary probability map

P_{b} \in R^{H \times W \times 1}

, which is of the same resolution as the input image:

P_{b} = σ ({Conv}_{1 \times 1}^{C_{m i d} / 4 \to 1} ({TransConv}_{3 \times 3}^{C_{m i d} / 2 \to C_{m i d} / 4} ({TransConv}_{3 \times 3}^{C_{m i d} \to C_{m i d} / 2} (F_{f u s e}^{1}))))

(4)

where

σ (\cdot)

denotes the Sigmoid activation function, and

{TransConv}_{3 \times 3} (\cdot)

denotes a

3 \times 3

transposed convolution operation, which are employed to achieve a

2 \times

upsampling. The value of

P_{b}

represents the probability of a pixel belonging to an object boundary. This boundary probability map serves as an explicit geometric constraint to guide the semantic flow during the subsequent decoding process, forcing the network to focus on the reconstruction of boundary regions.

Following the previous open-vocabulary segmentation method [17,18], we generate the image–text cost volume from the output of CLIP’s image encoder. The output feature map

F_{d e n s e} \in R^{\frac{H}{16} \times \frac{W}{16} \times D}

serves as the final aligned visual representation, where D represents the CLIP latent dimension. Meanwhile, given an arbitrary set of C category names, we adopt the prompt template strategy to generate textual descriptions for each category. Feeding these descriptions into the CLIP text encoder yields the text embeddings

E \in R^{C \times P \times D}

, where P denotes the number of prompt templates per category.

To compute the similarity between dense visual features and text embeddings, we calculate the cosine similarity between

F_{d e n s e}

and E at each spatial location

(i, j)

:

C (i, j, c, p) = \frac{F_{d e n s e} (i, j) \cdot E (c, p)}{∥ F_{d e n s e} (i, j) ∥ ∥ E (c, p) ∥}

(5)

where c and p represent the indices of the category and the prompt template, respectively. Consequently, we obtain the initial multi-template cost volume

C \in R^{\frac{H}{16} \times \frac{W}{16} \times C \times P}

. To condense this representation for the subsequent decoding layers, a convolution layer is employed to aggregate information along the template dimension, projecting it into a structured feature representation

F_{c v} \in R^{\frac{H}{16} \times \frac{W}{16} \times C \times D_{c v}}

, where

D_{c v}

denotes the channel dimension of the cost volume feature, and C denotes the number of open-vocabulary categories.

3.3. Boundary-Constrained Decoding Strategy

In semantic segmentation, boundary information plays a critical role in improving the discriminability of object contours. However, within the context of open-vocabulary scenarios, capitalizing on boundary information to dynamically steer feature fusion during decoding is still in its infancy. Therefore, in this section, in addition to incorporating the multi-scale features extracted in the previous section, we further propose a boundary-constrained decoding strategy, which converts explicitly predicted boundary map into spatial attention, which guides the decoder’s focus toward boundary regions during decoding, thus achieving accurate boundary reconstruction.

The detailed architecture of our boundary-constrained decoder is visualized in Figure 4. The decoding process commences from the most semantically enriched image–text alignment feature

F_{c v}^{1} = F_{c v} \in R^{\frac{H}{16} \times \frac{W}{16} \times C \times D_{c v}}

, progressively restoring the spatial resolution through multiple decoding stages. Formally, let

F_{c v}^{i} \in R^{h_{i} \times w_{i} \times C \times d_{i}}

denote the input feature at the i-th decoding stage, where

i \in {1, 2, 3}

. For the first stage (

i = 1

),

F_{c v}^{1}

is initialized from the cost volume feature

F_{c v}

. Across the three stages, the spatial resolution is progressively restored such that

h_{i} = H / 2^{5 - i}

and

w_{i} = W / 2^{5 - i}

, while the channel dimension is halved at each subsequent stage:

d_{i + 1} = d_{i} / 2

.

To extract fine-grained textures and high-frequency geometric details of object boundaries, we first employ a depthwise separable convolution to perform independent spatial filtering on the input features. This operation decouples standard convolution into depthwise and pointwise components, significantly reducing computational overhead while forcing each channel to focus on local spatial structures:

F_{d w}^{i} = DWConv (F_{c v}^{i})

(6)

where

F_{d w}^{i} \in R^{h_{i} \times w_{i} \times C \times d_{i}}

represents the result after depthwise separable convolution, and

DWConv (\cdot)

denotes the depthwise separable convolution operation, and the experiments regarding the kernel size setting can be found in Section 4.3.3.

Subsequently, to enhance semantic representation while maintaining the spatial structure, a Feed-Forward Network (MLP) is utilized for non-linear cross-channel information fusion. The MLP consists of two linear transformation layers and a GELU activation function. Specifically, the first linear layer maps the features into a higher-dimensional space with non-linear activation, followed by a second linear layer that restores the original dimensions:

F_{m l p}^{i} = {Linear}^{4 d_{i} \to d_{i}} (GELU ({Linear}^{d_{i} \to 4 d_{i}} (F_{d w}^{i})))

(7)

where

F_{m l p}^{i} \in R^{h_{i} \times w_{i} \times C \times d_{i}}

denotes the output feature map of the MLP module,

{Linear}^{d_{i} \to 4 d_{i}} (\cdot)

and

{Linear}^{4 d_{i} \to d_{i}} (\cdot)

represent the linear transformations, and

GELU (\cdot)

denotes the GELU activation function.

Simultaneously, the original boundary probability map

P_{b} \in R^{H \times W \times 1}

is adjusted to the resolution of the current decoding stage via downsampling. It is then transformed into a spatial attention weight map

A_{b}^{i} \in R^{h_{i} \times w_{i} \times 1}

through a

3 \times 3

convolution and Sigmoid normalization:

A_{b}^{i} = σ ({Conv}_{3 \times 3}^{1 \to 1} (DownSample (P_{b})))

(8)

where

DownSample (\cdot)

represents the downsampling operation. At each stage, a different downsampling factor is adopted: at the i-th stage, it downsamples H and W to the corresponding resolutions

h_{i}

and

w_{i}

, respectively.

This attention map is subsequently broadcast and applied to the original input features via element-wise multiplication within a residual connection. This mechanism implements a selective gating strategy to constrain the semantic flow using boundary priors:

F_{g a t e d}^{i} = F_{c v}^{i} + Repeat (A_{b}^{i}) ⊙ F_{m l p}^{i}

(9)

where

F_{g a t e d}^{i} \in R^{h_{i} \times w_{i} \times C \times d_{i}}

denotes the boundary-constrained gated feature map, ⊙ denotes element-wise multiplication, and

Repeat (\cdot)

signifies the dimension broadcasting operation.

Subsequently, the multi-scale encoder feature

F_{b}^{4 - i} \in R^{H_{4 - i} \times W_{4 - i} \times D_{4 - i}}

is first aligned in resolution and the channel dimension to produce

{\tilde{F}}_{b}^{4 - i} \in R^{h_{i + 1} \times w_{i + 1} \times C \times d_{i + 1}}

. This feature is then concatenated with the upsampled gated representation and fused via a

3 \times 3

convolution, achieving an effective fusion of local spatial details and global semantic context:

{\tilde{F}}_{b}^{4 - i} = Repeat (UpSample ({Conv}_{1 \times 1}^{D_{4 - i} \to d_{i + 1}} (F_{b}^{4 - i})))

(10)

F_{c v}^{i + 1} = {Conv}_{3 \times 3}^{2 d_{i + 1} \to d_{i + 1}} (Concat [{TransConv}_{3 \times 3}^{d_{i} \to d_{i + 1}} (F_{g a t e d}^{i}), {\tilde{F}}_{b}^{4 - i}])

(11)

where

UpSample (\cdot)

denotes bi-linear interpolation upsampling with a factor of 2 (

2 \times

spatial upsampling),

{TransConv}_{3 \times 3}^{d_{i} \to d_{i + 1}} (\cdot)

denotes a

3 \times 3

transposed convolution with stride = 2 that also performs

2 \times

upsampling, and

Repeat (\cdot)

signifies the dimension broadcasting operation.

These operations are executed iteratively across the three decoding stages (

i = 1, 2, 3

), where feature resolution and semantic discriminability are concurrently enhanced. Upon completion of these stages, the top-layer feature

F_{c v}^{4} \in R^{\frac{W}{2} \times \frac{W}{2} \times C \times d_{4}}

reaches half of the original image resolution. Finally, a lightweight segmentation head maps these features into the pixel-level semantic prediction map

Y_{p r e d} \in R^{H \times W \times C}

:

Y_{p r e d} = UpSample ({Conv}_{1 \times 1}^{d_{4} \to 1} (F_{c v}^{4}))

(12)

3.4. Multi-Task Hybrid Loss

To facilitate end-to-end optimization, we formulate a multi-task hybrid loss function that jointly supervises semantic region generation and geometric boundary delineation. The overarching objective is to harmonize semantic fidelity and boundary crispness, fostering a mutually beneficial learning paradigm between the two sub-tasks.

For region-level semantic consistency, the segmentation prediction is penalized by a combined loss

L_{s e g}

. Specifically, a standard pixel-wise cross-entropy loss

L_{C E}

is applied to secure local classification accuracy. Concurrently, a Dice loss

L_{D i c e}

is integrated to maximize the Intersection-over-Union between the prediction and the ground-truth. This effectively safeguards the topological completeness of the extracted regions and mitigates the severe foreground–background class imbalance problem. The compound segmentation loss is formulated as:

L_{s e g} = L_{C E} (Y_{p r e d}, Y_{g t}) + L_{D i c e} (Y_{p r e d}, Y_{g t})

(13)

where

Y_{p r e d}

and

Y_{g t}

represent the predicted segmentation mask and the corresponding ground-truth label, respectively.

Meanwhile, to sharpen the object contours and enforce boundary continuity, we impose a dedicated boundary loss

L_{b d}

. Given that boundary pixels constitute only a fraction of the entire image, a weighted binary cross-entropy loss is deployed to prevent the optimization from being dominated by massive background pixels:

L_{b d} = - \frac{1}{N} \sum_{j = 1}^{N} [ω_{p o s} G_{j} log (P_{b}^{j}) + ω_{n e g} (1 - G_{j}) log (1 - P_{b}^{j})]

(14)

where

G_{j} \in {0, 1}

denotes the binary edge map derived from

Y_{g t}

, and N is the total pixel count. The balancing weights

ω_{p o s}

and

ω_{n e g}

ensure that the network remains highly sensitive to sparse boundary signals.

By aggregating these task-specific constraints, the total loss function for training our CLIP-HBD is defined as:

L_{t o t a l} = L_{s e g} + λ L_{b d},

(15)

where

λ

is a hyperparameter balancing the primary segmentation task and the auxiliary boundary detection task. This joint optimization strategy circumvents the limitations of single supervision information, leveraging both semantic and geometric dimensions to collaboratively drive the network toward learning more robust and spatially precise feature representations. A detailed analysis of this hybrid loss is provided in Section 4.3.4.

4. Experiments and Analyses

To evaluate the efficacy of our proposed CLIP-HBD, we train the model on the COCO-Stuff [43] training set and evaluate its generalization capability across several widely adopted open-vocabulary segmentation benchmarks, including ADE20K-150 [25], ADE20K-847 [25], Pascal Context-59 [26], Pascal Context-459 [26], and Pascal VOC [27]. Specifically, this section is organized as follows: (1) Experimental Dataset and Settings; (2) Comparison with State-of-the-Art Methods; and (3) Ablation Studies.

4.1. Experimental Dataset and Settings

4.1.1. Datasets

Following prior open-vocabulary segmentation works [17,18,19], we train our model on COCO-Stuff [43], an extension of the MS COCO dataset with dense pixel-level annotations for both thing and stuff classes. It comprises 164K images covering 171 categories (80 thing and 91 stuff classes), providing a diverse basis for learning open-vocabulary semantic correspondences.

For evaluation, we consider five widely used benchmarks. ADE20K-150 [25] and ADE20K-847 [25] are derived from the ADE20K dataset. ADE20K-150 contains 150 frequent categories with 20,210 training and 2000 validation images, serving as a standard closed-set benchmark. ADE20K-847 extends this to a long-tailed vocabulary of 847 categories, including rare and fine-grained concepts, making it suitable for evaluating generalization to unseen classes.

Pascal Context-59 and Pascal Context-459 [26] share the same set of approximately 5000 images with resolutions ranging from 240 × 320 to 516 × 775 pixels. Pascal Context-59 provides dense annotations for 59 semantic categories, while Pascal Context-459 expands the vocabulary to 459 classes, enabling assessment of model performance under varying vocabulary scales.

Pascal VOC [27] is a classic benchmark containing 20 object classes with pixel-level annotations. It consists of 1464 training and 1449 validation images; following standard practice, we use the validation set to assess the model’s ability to segment common object categories.

4.1.2. Evaluation Metric

To ensure a comprehensive comparison with existing methods [17,18], we adopt mean Intersection-over-Union (mIoU) as the primary metric for region-level semantic consistency. For each semantic class c, the Intersection-over-Union (IoU) is defined as the ratio of the overlap area between the predicted mask and the ground-truth to the area of their union. The mIoU is calculated as the average across all C categories:

mIoU = \frac{1}{C} \sum_{c = 1}^{C} \frac{{TP}_{c}}{{TP}_{c} + {FP}_{c} + {FN}_{c}},

(16)

where

{TP}_{c}

,

{FP}_{c}

, and

{FN}_{c}

denote the number of true positive, false positive, and false negative pixels for class c, respectively.

To specifically evaluate the precision of the object contours produced by our CLIP-HBD, we introduce the boundary F-score. Unlike mIoU, which emphasizes area-based accuracy, F-score evaluates the alignment between the predicted boundary map

P_{b}

and the ground-truth edges within a specific distance threshold. It is defined as the harmonic mean of precision (

P r e

) and recall (

R e c

):

F - score = \frac{2 \cdot P r e \cdot R e c}{P r e + R e c}

(17)

Specifically, let

P

and

G

denote the sets of predicted and ground-truth boundary pixels, respectively.

P r e

measures the proportion of predicted boundary pixels that fall within a specified distance threshold

τ

from the ground-truth boundary (in this paper,

τ = 2

), defined as

P r e = \frac{| {p \in P ∣ d (p, G) < τ} |}{| P |}

. Conversely,

R e c

denotes the fraction of ground-truth boundary pixels that are successfully captured by the predicted boundary, expressed as

R e c = \frac{| {g \in G ∣ d (g, P) < τ} |}{| G |}

. By integrating these two complementary dimensions, the boundary F-score provides a more reliable and multi-dimensional assessment of the model’s contour localization capability, effectively reflecting its ability to generate sharp and spatially precise object boundaries.

4.1.3. Implement Details

For the open-vocabulary vision–language foundational model, we utilize the pre-trained CLIP [7] models (ViT-B/16 and ViT-L/14 variants) released by OpenAI as the visual and textual encoders. By default, both the number of stages for our ConvNeXt network and the corresponding decoding layers are fixed at 3. When employing the CLIP Base model, the intermediate semantic representations are derived from the third, sixth, and ninth layers of the CLIP image encoders. For the Large model variant, these features are extracted from the sixth, 12th, and 18th layers to accommodate its increased depth. To facilitate category alignment, the number of prompt templates P is set to 80. To balance the primary semantic segmentation task and the auxiliary boundary detection task, the balancing hyperparameter

λ

is set to

0.2

based on empirical validation. The loss weights for positive and negative boundary samples are fixed at

ω_{p o s} = 10.0

and

ω_{n e g} = 1.0

, respectively, to emphasize the rare boundary signals and prevent the auxiliary task from being suppressed during the joint optimization.

Our implementation is built upon PyTorch 1.7 [44] and Detectron2. Specifically, all input images are uniformly resized to

384 \times 384

during both training and inference. The entire network is optimized end-to-end using our proposed multi-task hybrid loss. We adopt the AdamW optimizer and a mini-batch size of 4 per GPU. To preserve the rich cross-modality knowledge of CLIP, we employ decoupled learning rates, where the CLIP backbone is fine-tuned at a cautious

2 \times 10^{- 6}

, while the remaining components are trained with a base learning rate of

2 \times 10^{- 4}

. To ensure training stability, we employ a linear warm-up strategy for the first 1500 iterations, during which the learning rates linearly increase from zero to their initial values. Subsequently, both learning rates are decayed following a cosine annealing schedule, and the weight decay is uniformly set to

10^{- 4}

for all trainable parameters. All models are trained using 4 NVIDIA RTX 3090 GPUs for a total of

80 K

iterations.

4.2. Comparison with State-of-the-Art Methods

To comprehensively evaluate the effectiveness of CLIP-HBD, we systematically compare it with current mainstream open-vocabulary semantic segmentation methods. Evaluations are conducted under both Base and Large model settings. This section aggregates the results of mainstream methods in recent years across multiple standard benchmarks. Additionally, for works that further boost performance by incorporating extra models (e.g., SAM [21]), additional datasets, or extra annotations, we explicitly denote them in the tables to ensure a comprehensive overview.

As presented in Table 2, early methods lacking vision–language models (e.g., SPNet [6]) yielded suboptimal performance, whereas recent advanced approaches like OVSeg [10] and EBSeg [19] rely on extra training data or auxiliary large models (e.g., SAM [21]). In contrast, under the exact equivalent setting (using CLIP ViT-B/16 and trained solely on COCO-Stuff), our CLIP-HBD achieves the best overall performance. In contrast, without introducing any additional data or external models, our CLIP-HBD achieves the best overall performance, reaching 12.8%, 20.1%, 33.5%, 58.4%, and 97.1% mIoU across the five datasets, respectively. This convincingly proves that our performance gains stem from the intrinsic superiority of the architecture of the network rather than expansion of data or parameters.

Under the stronger Large backbone setting (Table 3), CLIP-HBD further widens its performance lead, achieving mIoUs of 16.6%, 24.5%, 38.5%, 63.9%, and 97.6% across five datasets. Specifically, it outperforms the second-best methods BBN [22] and GBA [57] by +1.5%, +1.0%, and +1.4% on A-847, PC-459, and A-150. These results demonstrate CLIP-HBD’s strong generalization capability in open-vocabulary scenarios.

4.3. Ablation Studies

4.3.1. Effectiveness of Different Components

To verify the effectiveness of the proposed components in CLIP-HBD, we conduct a progressive ablation study as summarized in Table 4. Our baseline model is established by fine-tuning CLIP with a decoder that fuses CLIP’s features from different layers, achieves 10.1%, 16.4%, 28.5%, 53.5%, and 93.6% mIoU on A-847, PC-459, A-150, PC-59, and PAS-20, respectively. After incorporating the Multi-Scale Hierarchical Feature Construction (MHFC) module into the baseline, performance improves substantially, particularly on challenging benchmarks: A-847 reaches 12.3% (+2.2%), PC-459 reaches 19.7% (+3.3%), and A-150 reaches 32.4% (+3.9%). This demonstrates that establishing multi-scale semantic priors is crucial for interpreting complex scenes with large vocabularies. Finally, by further adding the boundary-constrained decoding strategy (BCD), the model achieves its optimal performance: 12.8% on A-847 (+0.5%), 20.1% on PC-459 (+0.4%), and 33.5% on A-150 (+1.1%). These results indicate that MHFC and BCD are highly complementary. MHFC effectively enhances dense multi-scale semantic alignment, while BCD acts as a vital geometric regularizer that prevents semantic leakage and ensures sharp object boundaries.

Table 5 demonstrates the impact of each proposed module on the model’s overall performance and computational cost. Initially, the integration of the Multi-Scale Hierarchical Feature Construction (MHFC) module into the baseline results in a substantial performance leap, increasing the mIoU from 28.5% to 32.4%. Remarkably, this +3.9% improvement is achieved with relatively stable computational complexity, as the GFLOPs only rise slightly from 452.8 to 479.2. This indicates that the hierarchical feature extraction effectively enriches semantic representations without incurring a prohibitive cost in processing efficiency. Subsequently, the addition of the boundary-constrained decoding (BCD) strategy further elevates the performance to 33.5% mIoU. While this decoding process significantly refines object contours and spatial accuracy, it also introduces a more pronounced increase in computational complexity, with GFLOPs reaching 659.5 and memory usage rising to 23.1 GiB. This surge is a direct consequence of the open-vocabulary setting; because the decoding and classification must be performed across an exceptionally large category space, maintaining high-resolution feature maps for precise boundary recovery necessitates higher latency and memory consumption. Given the exceptionally large category space involved in the task, maintaining high-resolution feature maps for precise boundary recovery is essential to achieving superior segmentation results, thus justifying the additional latency and memory consumption as a necessary investment for performance.

To intuitively understand the individual contribution of each module, we provide the visualization results of the ablation study in Figure 5. From left to right, the columns sequentially display the input images, ground-truth annotations (GT), results of the baseline model, results after introducing multi-scale feature extraction, and the final results after further incorporating the boundary-constrained decoding strategy. When relying solely on the baseline model, the network can localize the primary object regions but suffers from missed segmentations of small objects and insufficient boundary precision. For instance, the airplane in the sky in the first row is completely missed, and the boundaries between the wall and the cabinet in the fourth row are heavily intermingled, yielding a relatively coarse segmentation. Upon introducing multi-scale features, the network is empowered to simultaneously capture high-level semantics and low-level spatial details. This effectively improves the intra-object semantic consistency and the initial localization of edges. Specifically, the airplane in the first row is preliminarily recognized, and the misclassified labels for the cabinet in the second row are successfully corrected. However, certain deviations still persist at the object boundaries, particularly in complex regions where the foreground and background interlace. Finally, our full model achieves optimal performance in terms of boundary sharpness and detail restoration. By explicitly modeling edge information to guide the decoding process, our boundary-constrained decoding strategy significantly enhances the accuracy of segmentation boundaries. As observed in the fifth column, the object contours align highly with the ground-truth. The boundary intermingling issues and isolated spurious regions present in the fourth column are successfully eliminated, making the final segmentation results visually almost identical to the ground truth.

4.3.2. Ablation Studies of Multi-Scale Feature Extraction

Table 6 illustrates the ablation results of multi-scale feature construction across three representative backbone architectures: standard Transformer, Swin Transformer, and ConvNeXt. We systematically evaluate their performance before and after fusing CLIP features across five open-vocabulary segmentation benchmarks and a boundary-specific metric (F-score). Experimental results demonstrate that incorporating CLIP features consistently enhances both segmentation accuracy and boundary quality across all backbones. This improvement validates the critical role of the vision–language model’s semantic priors in enriching multi-scale feature construction for open-vocabulary tasks. Notably, the ConvNeXt configuration, when integrated with CLIP, achieves state-of-the-art performance across all evaluation metrics, with the boundary F-score reaching 34.8%, significantly surpassing other combinations. We observe distinct variations in how different backbones leverage CLIP features. While Swin Transformer exhibits the most substantial gain in F-score, followed by ConvNeXt, the standard Transformer shows the most marginal improvement. This disparity suggests that hierarchical architectures are more receptive to global semantic priors. Although the Swin Transformer exhibits a slightly more pronounced margin of improvement, ConvNeXt consistently yields the highest absolute F-score and peak accuracy across the majority of segmentation benchmarks, owing to its superior feature extraction capabilities. In conclusion, choosing ConvNeXt as the backbone architecture for multi-scale feature extraction not only enables more effective utilization of the semantic guidance provided by vision–language models but also leverages the advantages of its hierarchical structure in spatial feature extraction, thereby achieving collaborative optimization of segmentation accuracy and boundary quality in open-vocabulary scenarios.

4.3.3. Ablation Studies of Boundary-Constrained Decoding

As presented in Table 7, we systematically evaluate the core design elements of our decoder across three key dimensions: the receptive field configuration, the source of multi-scale features, and the impact of the boundary constraint mechanism. First, regarding the receptive field configuration in (a), we investigate the optimal kernel size k for depthwise separable convolution. The results indicate that a moderate kernel size of

k = 9

achieves peak performance across multiple benchmarks, such as A-847 and PC-459. Smaller kernels (

k = 7

) suffer from limited receptive fields, failing to adequately integrate spatial context. Conversely, excessively large kernels (

k = 11

) introduce redundant distant spatial information that may interfere with the precise reconstruction of local details. These results confirm that a moderate, focused local receptive field is essential for balancing detailed reconstruction with contextual fusion during boundary-guided decoding. Subsequently, (b) compares the contributions of different multi-scale feature sources. Relying solely on a single-scale deep semantic feature yields the lowest performance, exposing its limitations in describing complex scenes. While integrating features from different CLIP encoder layers improves results, our proposed strategy of using hierarchical features achieves comprehensive superiority. This improvement demonstrates that hierarchical networks provide more robust geometric structures and edge responses than pure ViT-based encoders, thereby offering stronger spatial localization guidance for segmentation. Finally, the efficacy of the boundary constraint is validated in (c), where its introduction consistently outperforms the unconstrained baseline across all datasets. This gain is not merely due to additional supervision; rather, it stems from the boundary’s role in guiding semantic flow, focusing the network on critical regions near object contours while enhancing semantic propagation within the object interior. Overall, the consistent performance gains across all results validate that each proposed component is indispensable for enhancing both semantic alignment and boundary localization.

4.3.4. Analysis of Loss Functions and Hyperparameters

Table 8 presents the ablation results of the components of the overall loss function. When employed independently, both

L_{C E}

and

L_{D i c e}

yield acceptable performance. Specifically,

L_{D i c e}

shows a slight advantage in region-based alignment on datasets such as A-150 and PAS-20. Integrating these two loss functions leads to a noticeable performance boost; for instance, the mIoU on PAS-20 improves from 95.8% to 96.5%. This improvement indicates that simultaneously supervising both pixel-level classification and region-level consistency is inherently better suited for the open-vocabulary segmentation task. The most significant performance gains are observed with the inclusion of the boundary loss (

L_{b d}

). As shown in the final row, incorporating

L_{b d}

consistently elevates the mIoU across all five datasets, specifically reaching 20.1% on PC-459 and 33.5% on A-150. More importantly, it provides the essential boundary supervision required to generate clear edges, offering robust support for the subsequent boundary-constrained decoding process. By penalizing boundary deviations, this supervision encourages the model to produce sharper and more precise contours, which is critical for handling the diverse and intricate object shapes in open-vocabulary scenarios. Overall, these three loss functions exhibit a clear synergistic complementarity, collectively ensuring both semantic accuracy and structural integrity.

Table 9 (bottom section) evaluates the impact of different formulations of the boundary loss

L_{b d}

on segmentation performance. Compared to the baseline without boundary supervision (None), incorporating

L_{b d}

leads to a consistent improvement in mIoU across all benchmarks. For instance, the mIoU on PAS-20 increases from 96.5% to 96.8%, while A-150 rises from 33.1% to 33.2%. This improvement confirms that explicit boundary-related reconstruction serves as a critical auxiliary task, helping the model to better delineate object extents and resolve ambiguities in complex open-vocabulary scenes. Furthermore, the implementation of weighted cross-entropy (

L_{W C E}

) yields a superior F-score compared to the standard cross-entropy formulation. This superiority stems from its capacity to prevent the gradients from being dominated by the vast majority of non-boundary pixels. By prioritizing sparse boundary signals through

L_{W C E}

, the boundary reconstruction process is forced to achieve more rigorous geometric alignment. Consequently, the clear boundary maps provide robust spatial constraints, ultimately facilitating improvement in segmentation performance across diverse benchmarks.

Table 10 illustrates the performance variations across five benchmarks and the boundary F-score as the positive sample weight (

ω_{p o s}

) is adjusted. When the weights for positive and negative samples are equal (

ω_{p o s} = ω_{n e g} = 1

), representing standard binary cross-entropy, the boundary F-score is limited to 34.5%, and mIoU scores across all datasets remain at relatively low levels. This quantitative result confirms the inherent limitations of standard cross-entropy in scenarios with extremely sparse boundary supervision, where the loss function is dominated by the vast number of negative samples, thereby suppressing the critical boundary signals. As

ω_{p o s}

is incrementally increased from 1 to 10, a consistent upward trend is observed in both mIoU and boundary F-score across all datasets: A-847 improves from 12.6% to 12.8%, PC-459 from 19.9% to 20.1%, and A-150 from 33.2% to 33.5%, while the F-score rises from 34.5 to 34.8. These results indicate that appropriately increasing the weight of positive samples effectively compensates for the sparsity of boundary pixels, allowing the model to focus more intently on vital contour signals. However, further increasing the weight to 15 or 20 leads to performance divergence and eventual degradation. While the F-score peaks at 35.1 when

ω_{p o s} = 15

, the mIoU for A-847 and A-150 begins to decline. At

ω_{p o s} = 20

, nearly all metrics except the F-score show significant regression, with A-847 dropping to 12.5% and A-150 to 33.0%. This phenomenon suggests that excessive weight on positive samples causes the boundary loss to over-dominate the optimization process, weakening the constraints on semantic consistency provided by the segmentation loss. Consequently, the model may over-predict boundary pixels to minimize loss, which inadvertently harms the precision of semantic discrimination within object regions. Based on the comprehensive evaluation of segmentation accuracy and boundary quality,

ω_{p o s} = 10

and

ω_{n e g} = 1

are selected as the default configuration, achieving the optimal balance across all five benchmarks.

Table 11 presents the ablation results regarding the balance coefficient

λ

between the primary segmentation task and the auxiliary boundary task. As the balancing coefficient

λ

decreases from

0.6

to

0.2

, the segmentation performance on three key datasets exhibits a consistent upward trend. Specifically, the mIoU increases from

19.6 %

to

20.1 %

on PC-459, from

57.9 %

to

58.4 %

on PC-59, and from

12.5 %

to

12.8 %

on A-847, while performance on A-150 and PAS-20 remains stable or shows marginal improvement. This suggests that moderately reducing the weight of the boundary loss helps harmonize the optimization gradients between the primary segmentation task and the auxiliary boundary task. Consequently, the boundary constraint effectively guides contour refinement without over-interfering with semantic feature learning. However, when

λ

further drops to

0.1

, the performance across four datasets either declines or plateaus, with the sole exception of a slight increase to

12.9 %

on A-847: PC-459 falls to

19.9 %

, A-150 to

33.1 %

, and PC-59 to

58.1 %

, while PAS-20 remains stagnant at

97.1 %

. This phenomenon reveals a delicate trade-off between boundary supervision intensity and semantic segmentation accuracy. An excessively small coefficient provides insufficient contour guidance signals during training, whereas an overly large coefficient allows the boundary loss to dominate the optimization, thereby disrupting semantic consistency. Overall, setting the balancing coefficient to

0.2

yields the optimal performance across all five datasets, achieving effective synergy between boundary supervision and semantic alignment.

4.3.5. Qualitative Results

As illustrated in Figure 6, we present the qualitative evaluation results of our proposed method on four mainstream benchmarks: PC-459, A-150, PC-59, and PAS-20. From top to bottom, the three rows display the input images, ground-truth (GT), and the segmentation masks generated by CLIP-HBD. These diverse and challenging scenarios systematically verify the performance of our model from the following aspects: First, the predicted segmentation masks exhibit sharp boundaries that align highly with the ground-truth. For instance, in the two scenes of the A-150 dataset (Figure 6b), our model achieves precise boundary delineation for building contours, vehicle edges, and the junctions between the foreground and background. Furthermore, the wing contours of the airplane in the first column of Figure 6c, as well as the complex fur edges of the black cat in the second column of Figure 6d, both closely match the ground-truth. These visualization details directly substantiate the effectiveness of our introduced boundary-guided decoding strategy in enhancing edge localization accuracy. Second, our model not only accurately segments common categories seen in training but also demonstrates strong recognition capabilities for unseen categories. Notably, in the first column of Figure 6a, our model successfully captures and segments the “oar” that is missing in the ground-truth annotation. This fully reflects the model’s acute perception of open-vocabulary semantics, proving its capability to transcend the confinement of closed-set categories and accomplish a more comprehensive semantic understanding. Finally, the model demonstrates superior segmentation performance on small objects in complex scenes. In the second column of Figure 6c, the model successfully isolates small objects on the table such as the “bag” and “computer”, and even extracts the subtle contours of the table legs. In the second column of Figure 6a, it not only accurately segments the annotated contents but also identifies a missing extremely small category, the “shoes” on the girl’s feet, though the segmentation of this object is incomplete. These results powerfully validate the robustness of our model in handling small-sized targets and intricate spatial structures.

5. Conclusions and Discussion

In this paper, we focus on the decoding optimization in open-vocabulary semantic segmentation and address the problem caused by the single-scale nature of ViT-based vision–language models. We propose CLIP-HBD, a hierarchical boundary-constrained decoding network that bridges the gap between high-level semantic alignment and spatial precision. We leverage the features of VLM to reconstruct multi-scale features via a ConvNeXt-based backbone and a hierarchical adaptation mechanism. Furthermore, we introduce a boundary-constrained decoding strategy that explicitly predicts boundary maps and transforms them into structural constraints to guide the decoder toward edge regions. This design effectively preserves semantic consistency while restoring fine-grained object boundaries. Extensive experiments across multiple benchmarks demonstrate that CLIP-HBD achieves superior performance in both segmentation accuracy and boundary quality, validating the effectiveness of our approach.

5.1. Limitations

A primary limitation of CLIP-HBD arises from the computational burden imposed by the open-vocabulary formulation. Because the decoder processes cost volumes defined over a large category space, the boundary-constrained decoding stage introduces substantial additional computation. As reported in Table 5, incorporating the boundary-constrained decoder increases GFLOPs from 452.8 to 659.5 and memory usage from 12.1 GiB to 23.1 GiB. This overhead restricts deployment on resource-constrained devices or in latency-sensitive scenarios. Furthermore, segmentation of extremely small objects or thin structures remains challenging. In such cases, boundary signals are too sparse to provide reliable geometric constraints, leading to incomplete mask predictions along fine structures.

5.2. Future Directions

Several directions warrant further investigation. First, mitigating the computational cost of boundary-constrained decoding under large vocabularies is an immediate priority. Second, the proposed boundary-constrained paradigm can be extended to other open-vocabulary dense prediction tasks like open-vocabulary panoptic segmentation. Third, incorporating complementary geometric priors, such as depth or surface normal information, could further enhance boundary localization in geometrically complex scenes.

Author Contributions

Conceptualization, J.W. and J.L.; methodology, J.W.; software, J.W. and A.Y.; validation, J.W. and A.Y.; formal analysis, J.W. and A.Y.; investigation, J.W.; resources, J.L.; data curation, J.W.; writing original draft preparation, J.W.; writing, reviewing and editing, A.Y. and Q.Z.; visualization, J.W. and A.Y.; supervision, Q.Z.; project administration, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, L.; Zhou, T.; Wang, W.; Li, J.; Yang, Y. Deep hierarchical semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 1246–1257. [Google Scholar] [CrossRef]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar] [CrossRef]
Cao, Z.; Mi, X.; Qiu, B.; Cao, Z.; Long, C.; Yan, X.; Zheng, C.; Dong, Z.; Yang, B. Cross-modal semantic transfer for point cloud semantic segmentation. ISPRS J. Photogramm. Remote Sens. 2025, 220, 265–279. [Google Scholar] [CrossRef]
Chen, Q.; Yang, L.; Lai, J.H.; Xie, X. Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4288–4298. [Google Scholar]
Bucher, M.; Vu, T.H.; Cord, M.; Perez, P. Zero-shot semantic segmentation. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 2147–2157. [Google Scholar]
Xian, Y.; Choudhury, S.; He, Y.; Schiele, B.; Akata, Z. Semantic projection network for zero- and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA; IEEE: New York, NY, USA, 2019; pp. 8256–8265. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Zhou, C.; Loy, C.C.; Dai, B. Extract free dense labels from clip. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 696–712. [Google Scholar] [CrossRef]
Liang, F.; Wu, B.; Dai, X.L.; Li, K.P.; Zhao, Y.N.; Zhang, H.; Zhang, P.Z.; Vajda, P.; Marculescu, D. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 7061–7070. [Google Scholar]
Qin, J.; Wu, J.; Yan, P.; Li, M.; Yuxi, R.; Xiao, X.; Wang, Y.; Wang, R.; Wen, S.; Pan, X.; et al. FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19446–19455. [Google Scholar]
Xu, J.; Liu, S.; Vahdat, A.; Byeon, W.; Wang, X.; De Mello, S. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2955–2966. [Google Scholar]
Wang, J.; Li, X.; Zhang, J.; Xu, Q.; Zhou, Q.; Yu, Q.; Sheng, L.; Xu, D. Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter. arXiv 2023, arXiv:2309.02773. [Google Scholar] [CrossRef] [PubMed]
Li, B.; Weinberger, K.Q.; Belongie, S.; Koltun, V.; Ranftl, R. Language-driven semantic segmentation. In Proceedings of the International Conference on Learning Representations, Virtual, 18–22 April 2022; pp. 1–13. [Google Scholar]
Xu, M.; Zhang, Z.; Wei, F.; Hu, H.; Bai, X. Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 2945–2954. [Google Scholar]
Zhou, Z.Q.; Lei, Y.J.; Zhang, B.W.; Liu, L.Q.; Liu, Y.F. ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 11175–11185. [Google Scholar]
Cho, S.; Shin, H.; Hong, S.; An, S.; Lee, S.; Arnab, A.; Seo, P.H.; Kim, S. Cat-Seg: Cost aggregation for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 13255–13264. [Google Scholar]
Xie, B.; Cao, J.; Xie, J.; Khan, F.S.; Pang, Y. Sed: A simple encoder-decoder for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 3426–3436. [Google Scholar] [CrossRef]
Shan, X.; Wu, D.; Zhu, G.; Shao, Y.; Sang, N.; Gao, C. Open-vocabulary semantic segmentation with image embedding balancing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 28412–28421. [Google Scholar]
Shi, Y.; Dong, M.; Xu, C. Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 19–23 October 2025; pp. 11366–11382. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Ali, H.; Rolland, C.; Gustafson, L.; Isola, P.; Berg, A.C.; Lo, W.Y.; Dollár, P.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 3992–4003. [Google Scholar] [CrossRef]
Pan, Y.; Sun, R.; Wang, Y.; Yang, W.; Zhang, T.; Zhang, Y. Purify Then Guide: A Bi-Directional Bridge Network for Open-Vocabulary Semantic Segmentation. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 343–356. [Google Scholar] [CrossRef]
Aydın, M.A.; Çırpar, E.M.; Abdinli, E.; Unal, G.; Sahin, Y.H. ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements. arXiv 2024, arXiv:2411.12044. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021; pp. 1–21. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 2019, 127, 302–321. [Google Scholar] [CrossRef]
Mottaghi, R.; Chen, X.; Liu, X.; Cho, N.G.; Lee, S.W.; Fidler, S.; Urtasun, R.; Yuille, A. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 891–898. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: Lessons learned from the 2015 MS COCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 652–663. [Google Scholar] [CrossRef] [PubMed]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.C.; Salakhutdinov, R.; Zemel, R.S.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 13–23. [Google Scholar]
Alayrac, J.B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. Flamingo: A visual language model for few-shot learning. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 23716–23736. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 34892–34916. [Google Scholar]
Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-shape CNNs for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5229–5238. [Google Scholar] [CrossRef]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar] [CrossRef]
Kirillov, A.; Wu, Y.; He, K.; Girshick, R. PointRend: Image segmentation as rendering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9796–9805. [Google Scholar] [CrossRef]
Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In Proceedings of the 25th Annual Conference on Neural Information Processing Systems 2011, Granada, Spain, 12–15 December 2011; Volume 24, pp. 109–117. [Google Scholar]
Kervadec, H.; Bouchtiba, J.; Desrosiers, C.; Granger, E.; Dolz, J.; Ayed, I.B. Boundary loss for highly unbalanced segmentation. In Proceedings of the 2nd International Conference on Medical Imaging with Deep Learning, London, UK, 8–10 July 2019; pp. 285–296. [Google Scholar]
Jiao, S.Y.; Zhu, H.G.; Huang, J.N.; Zhao, Y.; Wei, Y.; Shi, H. Collaborative vision-text representation optimizing for open-vocabulary segmentation. In Proceedings of the 18th European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 399–416. [Google Scholar]
Yu, Q.; He, J.; Deng, X.; Shen, X.; Chen, L. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional CLIP. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023; Volume 36, pp. 32215–32234. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
Odena, A.; Dumoulin, V.; Olah, C. Deconvolution and checkerboard artifacts. Distill 2016, 1, e3. [Google Scholar] [CrossRef]
Caesar, H.; Uijlings, J.R.R.; Ferrari, V. Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1209–1218. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 8024–8035. [Google Scholar]
Xu, J.; De Mello, S.; Liu, S.; Byeon, W.; Breuel, T.; Kautz, J.; Wang, X. GroupViT: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18134–18144. [Google Scholar]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 2556–2565. [Google Scholar] [CrossRef]
Thomee, B.; Shamma, D.A.; Friedland, G.; Elizalde, B.; Ni, K.; Poland, D.; Borth, D.; Li, L.J. YFCC100M: The new data in multimedia research. Commun. ACM 2016, 59, 64–73. [Google Scholar] [CrossRef]
Ghiasi, G.; Gu, X.; Cui, Y.; Lin, T.-Y. Scaling open-vocabulary image segmentation with image-level labels. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 540–557. [Google Scholar]
Ding, J.; Xue, N.; Xia, G.S.; Dai, D. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11573–11582. [Google Scholar] [CrossRef]
Kirillov, A.; He, K.; Girshick, R.; Rother, C.; Dollár, P. Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9404–9413. [Google Scholar] [CrossRef]
Pont-Tuset, J.; Uijlings, J.; Changpinyo, S.; Soravit, R.; Ferrari, V. Connecting vision and language with localized narratives. In Proceedings of the 16th European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 647–664. [Google Scholar] [CrossRef]
Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; Bai, X. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 717–733. [Google Scholar]
Han, C.; Zhong, Y.; Li, D.; Han, K.; Lin, M. Open-vocabulary semantic segmentation with decoupled one-pass network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 1086–1096. [Google Scholar] [CrossRef]
Mukhoti, J.; Lin, T.Y.; Poursaeed, O.; Dokania, P.K.; Torr, P.H.S. Open vocabulary semantic segmentation with patch aligned contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19413–19423. [Google Scholar] [CrossRef]
Liu, Y.; Bai, S.; Li, G.; Wang, Y.; Tang, Y. Open-vocabulary segmentation with semantic-assisted calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 3491–3500. [Google Scholar]
Dao, S.D.; Shi, H.; Phung, D.Q.; Cai, J. Class enhancement losses with pseudo labels for open-vocabulary semantic segmentation. IEEE Trans. Multimed. 2024, 26, 8442–8453. [Google Scholar] [CrossRef]
Xu, W.; Wang, C.; Feng, X.; Xu, R.; Huang, L.; Zhang, Z. Generalization boosted adapter for open-vocabulary segmentation. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 520–533. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Wang, X.; He, W.; Xuan, X.; Sebastian, C.; Ono, J.P.; Li, X.; Behpour, S.; Doan, T.; Gou, L.; Shen, H.; et al. USE: Universal segment embeddings for open-vocabulary image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 4187–4196. [Google Scholar]

Figure 1. The first row displays the results of the SED [18] method, while the second row shows the predictions of our method. As highlighted by the red dashed boxes, SED suffers from missing small objects, whereas our CLIP-HBD accurately segments them. Furthermore, the yellow dashed boxes demonstrate that our approach significantly improves boundary precision, producing sharper and more accurate object contours compared to the blurred results of SED. (a) represents scenes with small objects, (b) represents cases with boundary precision issues. (Best viewed in color and zoomed in.)

Figure 2. The overall architecture of the proposed CLIP-HBD.

Figure 3. The details of the fusion process within the boundary head.

Figure 4. The detailed architecture of boundary-constrained decoder.

Figure 5. Qualitative comparison for the ablation study of different components. From left to right: input image, ground-truth, baseline, baseline with MHFC, and our complete CLIP-HBD with BCD. (Best viewed in color.)

Figure 6. Qualitative results on open-vocabulary semantic segmentation benchmarks. Representative examples are selected from the (a) PC-459, (b) A-150, (c) PC-59, and (d) PAS-20 datasets. For each dataset, rows from top to bottom correspond to the original input images, ground-truth annotations, and our model’s predictions. (Best viewed in color.)

Table 1. Architectural details of the encoders in the Multi-Scale Hierarchical Feature Construction. The spatial resolutions are calculated based on a default input image size of

384 \times 384

.

Table 1. Architectural details of the encoders in the Multi-Scale Hierarchical Feature Construction. The spatial resolutions are calculated based on a default input image size of

384 \times 384

.

Index ( $l / k$ )	ConvNeXt Encoder	CLIP ViT-B/16 Encoder
1	$96 \times 96 \times 128$	$24 \times 24 \times 512$
2	$48 \times 48 \times 256$	$24 \times 24 \times 512$
3	$24 \times 24 \times 512$	$24 \times 24 \times 512$
4	$12 \times 12 \times 1024$	−

Table 2. Comparison with state-of-the-art open-vocabulary semantic segmentation methods under the Base model setting. The best results are in bold, and the second best are underlined. For evaluation metrics, we report the mean Intersection-over-Union (mIoU, %) on various benchmarks. Here, VLM denotes the vision–language model. − indicates that the corresponding module/result is not included.

Method	Pub.	VLM Backbone	Extra Backbone	Training Data	A-847	PC-459	A-150	PC-59	PAS-20
SPNet [6]	CVPR19	−	ResNet-101	PASCAL VOC	−	−	−	24.3	18.3
ZS3Net [5]	NeurIPS19	−	ResNet-101	PASCAL VOC	−	−	−	19.4	38.3
LSeg [14]	ICLR22	CLIP ViT-B/32	ResNet-101	PASCAL VOC-15	−	−	−	−	47.4
GroupViT [45]	CVPR22	ViT-S/16	−	GCC [46] + YFCC [47]	4.3	4.9	10.6	25.9	50.7
LSeg+ [48]	ECCV22	ALIGN	ResNet-101	COCO-Stuff	2.5	5.2	13.0	36.0	−
ZegFormer [49]	CVPR22	CLIP ViT-B/16	ResNet-101	COCO-Stuff-156	4.9	9.1	16.9	42.8	86.2
OpenSeg [48]	ECCV22	ALIGN	ResNet-101	COCO Panoptic [50] + Loc. Narr. [51]	4.4	7.9	17.5	40.1	−
ZSseg [52]	ECCV22	CLIP ViT-B/16	ResNet-101	COCO-Stuff	7.0	−	20.5	47.7	88.4
DeOP [53]	ICCV23	CLIP ViT-B/16	ResNet-101c	COCO-Stuff-156	7.1	9.4	22.9	48.8	91.7
OVSeg [10]	CVPR23	CLIP ViT-B/16	ResNet-101c	COCO-Stuff + COCO Caption	7.1	11.0	24.8	53.3	92.6
SAN [15]	CVPR23	CLIP ViT-B/16	−	COCO-Stuff	10.1	12.6	27.5	53.8	94.0
PACL [54]	CVPR23	CLIP ViT-B/16	−	GCC [46] + YFCC [47]	−	−	31.4	50.1	72.3
CAT-Seg [17]	CVPR24	CLIP ViT-B/16	ResNet-101	COCO-Stuff	8.4	16.6	27.2	57.5	93.7
EBSeg [19]	CVPR24	CLIP ViT-B/16	SAM [21]	COCO-Stuff	11.1	17.3	30.0	56.7	94.6
SCAN [55]	CVPR24	CLIP ViT-B/16	Swin-B	COCO-Stuff	10.8	13.2	30.8	58.4	97.0
SED [18]	CVPR24	ConvNeXt-B	−	COCO-Stuff	11.4	18.6	31.6	57.3	94.4
CEL [56]	TMM25	CLIP R50	−	COCO-Stuff	9.7	12.6	29.9	55.6	91.8
CLIP-HBD (Ours)	−	CLIP ViT-B/16	−	COCO-Stuff	12.8	20.1	33.5	58.4	97.1

Table 3. Comparison with state-of-the-art open-vocabulary semantic segmentation methods under the Large model setting. Bold and underlined entries indicate the best and the second-best results, respectively. We report the mean Intersection-over-Union (mIoU, %) across various benchmarks. Here, VLM denotes the vision–language model. − indicates that the corresponding module/result is not included.

Method	Pub.	VLM Backbone	Extra Backbone	Training Data	A-847	PC-459	A-150	PC-59	PAS-20
LSeg [14]	ICLR22	CLIP ViT-B/32	ViT-L/16	PASCAL VOC-15	−	−	−	−	52.3
OpenSeg [48]	ECCV22	ALIGN	Eff-B7 [58]	COCO Panoptic [50] + Loc. Narr. [51]	8.8	12.2	28.6	48.2	72.2
OVSeg [10]	CVPR23	CLIP ViT-L/14	Swin-B	COCO-Stuff + COCO Caption	9.0	12.4	29.6	55.7	94.5
ODISE [12]	CVPR23	CLIP ViT-L/14	Stable Diffusion	COCO Panoptic [50]	11.1	14.5	29.9	57.3	−
SAN [15]	CVPR23	CLIP ViT-L/14	−	COCO-Stuff	13.7	17.1	33.3	60.2	95.5
FC-CLIP [40]	NeurIPS23	ConvNeXt-L	−	COCO Panoptic [50]	14.8	18.2	34.1	58.4	95.4
CAT-Seg [17]	CVPR24	CLIP ViT-L/14	Swin-B	COCO-Stuff	10.8	20.4	31.5	62.0	96.6
EBSeg [19]	CVPR24	CLIP ViT-L/14	SAM [21]	COCO-Stuff	13.7	21.0	32.8	60.2	96.4
SCAN [55]	CVPR24	CLIP ViT-B/16	Swin-B	COCO-Stuff	14.0	16.7	33.5	59.3	97.2
SED [18]	CVPR24	ConvNeXt-L	−	COCO-Stuff	13.9	22.6	35.2	60.6	96.1
USE [59]	CVPR24	CLIP ViT-L/14	Swin-B, SAM [21]	COCO-Stuff	13.4	15.0	37.1	58.0	−
BBN [22]	TCSVT25	CLIP ViT-L/14	−	COCO-Stuff	14.2	23.5	34.5	63.7	96.8
GBA [57]	TCSVT25	ConvNeXt-L	−	COCO Panoptic [50]	15.1	18.5	35.9	59.6	95.8
CLIP-HBD (Ours)	−	CLIP ViT-L/14	−	COCO-Stuff	16.6	24.5	38.5	63.9	97.6

Table 4. Ablation study of different components on various benchmarks. MHFC denotes the Multi-Scale Hierarchical Feature Construction, and BCD represents the boundary-constrained decoding strategy.

Baseline	MHFC	BCD	A-847	PC-459	A-150	PC-59	PAS-20
✓			10.1	16.4	28.5	53.5	93.6
✓	✓		12.3	19.7	32.4	58.1	95.9
✓	✓	✓	12.8	20.1	33.5	58.4	97.1

Table 5. Ablation study on model efficiency and performance. MHFC and BCD denote Multi-Scale Hierarchical Feature Construction and boundary-constrained decoding, respectively. The mIoU is evaluated on the A-150 dataset. Check mark indicates that the corresponding module is included in the implementation.

Components			Performance	Efficiency Metrics
Baseline	MHFC	BCD	mIoU (%)	Total Params. (M)	Latency (s)	GFLOPs	Mem. (GiB)
✓			28.5	151.8	0.32	452.8	12.1
✓	✓		32.4	201.3	0.34	479.2	16.4
✓	✓	✓	33.5	217.8	0.48	659.5	23.1

Table 6. Ablation study on backbone architectures and CLIP feature fusion. Performance is measured by mIoU (%) across five benchmarks, alongside the boundary F-score (%) evaluated on COCO-Stuff. The best results are highlighted in bold. Check mark indicates that the corresponding module is included in the implementation. − indicates that the corresponding module/result is not included.

Backbone	Fuse CLIP	A-847	PC-459	A-150	PC-59	PAS-20	F-Score
Transformer	−	12.1	19.9	32.3	57.9	94.7	31.2
Transformer	✓	12.3	20.1	32.5	58.2	95.5	32.1
Swin	−	12.4	19.6	32.7	57.9	96.1	31.7
Swin	✓	12.3	19.8	33.1	58.2	96.8	33.9
ConvNeXt	−	12.6	19.8	32.6	58.0	96.6	32.8
ConvNeXt	✓	12.8	20.1	33.5	58.4	97.1	34.8

Table 7. Ablation studies on the core hyperparameters and structural designs. We analyze (a) the kernel size of depthwise separable convolution, (b) the source of multi-scale features, and (c) the effect of the boundary constraint. The best results in each letter are highlighted in bold. Check mark indicates that the corresponding module is included in the implementation.

Configuration	A-847	PC-459	A-150	PC-59	PAS-20
(a) Depthwise Conv Kernel Size
$k = 7$	12.7	19.4	33.4	57.8	96.8
$k = 9$	12.8	20.1	33.5	58.4	97.1
$k = 11$	12.4	19.6	33.5	58.1	97.2
(b) Multi-Scale Feature Source
W/o Multi-scale (−)	10.1	17.2	31.4	54.7	93.9
CLIP Encoder Layers	12.5	19.8	33.1	57.6	95.9
Hierarchical Features	12.8	20.1	33.5	58.4	97.1
(c) Boundary Constraint
W/o Constraint (−)	12.4	19.6	32.8	57.8	96.4
With Constraint (✓)	12.8	20.1	33.5	58.4	97.1

Table 8. Ablation study of loss function components. We investigate the combination of cross-entropy (

L_{C E}

) and Dice loss (

L_{D i c e}

) for segmentation, and the effectiveness of adding boundary loss (

L_{b d}

). The best results are highlighted in bold. Check mark indicates that the corresponding module is included in the implementation.

Table 8. Ablation study of loss function components. We investigate the combination of cross-entropy (

L_{C E}

) and Dice loss (

L_{D i c e}

) for segmentation, and the effectiveness of adding boundary loss (

L_{b d}

). The best results are highlighted in bold. Check mark indicates that the corresponding module is included in the implementation.

$L_{CE}$	$L_{Dice}$	$L_{bd}$	A-847	PC-459	A-150	PC-59	PAS-20
✓			12.3	19.8	32.3	58.1	95.8
	✓		12.4	19.7	32.7	58.1	95.9
✓	✓		12.6	19.9	33.1	58.2	96.5
✓	✓	✓	12.8	20.1	33.5	58.4	97.1

Table 9. Ablation study on the formulation of boundary loss (

L_{b d}

). We compare standard cross-entropy (CE) and weighted cross-entropy (W-CE) for boundary supervision. The base segmentation loss

L_{s e g}

(

L_{C E} + L_{D i c e}

) is fixed. The best results are highlighted in bold. − indicates that the corresponding module/result is not included.

Table 9. Ablation study on the formulation of boundary loss (

L_{b d}

). We compare standard cross-entropy (CE) and weighted cross-entropy (W-CE) for boundary supervision. The base segmentation loss

L_{s e g}

(

L_{C E} + L_{D i c e}

) is fixed. The best results are highlighted in bold. − indicates that the corresponding module/result is not included.

Boundary Loss ( $L_{bd}$ )	A-847	PC-459	A-150	PC-59	PAS-20	F-Score
None (W/o $L_{b d}$ )	12.6	19.9	33.1	58.2	96.5	−
Standard CE	12.6	19.9	33.2	58.2	96.8	34.5
Weighted CE	12.8	20.1	33.5	58.4	97.1	34.8

Table 10. Ablation study on the positive and negative sample weights (

ω_{p o s}

,

ω_{n e g}

) in boundary loss (

L_{b d}

) and their impact on performance across various benchmarks. The best results are highlighted in bold.

Table 10. Ablation study on the positive and negative sample weights (

ω_{p o s}

,

ω_{n e g}

) in boundary loss (

L_{b d}

) and their impact on performance across various benchmarks. The best results are highlighted in bold.

$ω_{pos}$	$ω_{neg}$	A-847	PC-459	A-150	PC-59	PAS-20	F-Score
1	1	12.6	19.9	33.2	58.2	96.8	34.5
5	1	12.6	19.8	33.4	58.1	96.9	34.7
10	1	12.8	20.1	33.5	58.4	97.1	34.8
15	1	12.7	20.1	33.2	58.1	97.2	35.1
20	1	12.5	19.8	33.0	57.9	96.9	34.9

Table 11. Sensitivity analysis of the balancing hyperparameter

λ

between segmentation loss and boundary loss. Performance is evaluated using mIoU (%) across five benchmarks. The best results are highlighted in bold.

Table 11. Sensitivity analysis of the balancing hyperparameter

λ

between segmentation loss and boundary loss. Performance is evaluated using mIoU (%) across five benchmarks. The best results are highlighted in bold.

$λ$	A-847	PC-459	A-150	PC-59	PAS-20
0.6	12.5	19.6	33.3	57.9	96.9
0.5	12.6	19.7	33.2	58.2	96.8
0.2	12.8	20.1	33.5	58.4	97.1
0.1	12.9	19.9	33.1	58.1	97.1

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Zhou, Q.; Yang, A.; Lin, J. CLIP-HBD: Hierarchical Boundary-Constrained Decoding for Open-Vocabulary Semantic Segmentation. Computers 2026, 15, 318. https://doi.org/10.3390/computers15050318

AMA Style

Wang J, Zhou Q, Yang A, Lin J. CLIP-HBD: Hierarchical Boundary-Constrained Decoding for Open-Vocabulary Semantic Segmentation. Computers. 2026; 15(5):318. https://doi.org/10.3390/computers15050318

Chicago/Turabian Style

Wang, Jing, Quan Zhou, Anyi Yang, and Junyu Lin. 2026. "CLIP-HBD: Hierarchical Boundary-Constrained Decoding for Open-Vocabulary Semantic Segmentation" Computers 15, no. 5: 318. https://doi.org/10.3390/computers15050318

APA Style

Wang, J., Zhou, Q., Yang, A., & Lin, J. (2026). CLIP-HBD: Hierarchical Boundary-Constrained Decoding for Open-Vocabulary Semantic Segmentation. Computers, 15(5), 318. https://doi.org/10.3390/computers15050318

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

CLIP-HBD: Hierarchical Boundary-Constrained Decoding for Open-Vocabulary Semantic Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Vision–Language Models

2.2. Boundary-Aware Segmentation

2.3. Open-Vocabulary Semantic Segmentation

3. Methods

3.1. Multi-Scale Hierarchical Feature Construction

3.2. Boundary and Cost Volume Generation

3.3. Boundary-Constrained Decoding Strategy

3.4. Multi-Task Hybrid Loss

4. Experiments and Analyses

4.1. Experimental Dataset and Settings

4.1.1. Datasets

4.1.2. Evaluation Metric

4.1.3. Implement Details

4.2. Comparison with State-of-the-Art Methods

4.3. Ablation Studies

4.3.1. Effectiveness of Different Components

4.3.2. Ablation Studies of Multi-Scale Feature Extraction

4.3.3. Ablation Studies of Boundary-Constrained Decoding

4.3.4. Analysis of Loss Functions and Hyperparameters

4.3.5. Qualitative Results

5. Conclusions and Discussion

5.1. Limitations

5.2. Future Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI