Text-Guided Refinement for Referring Image Segmentation

Qiu, Shuang; Zhang, Shiyin; Ruan, Tao

doi:10.3390/app15095047

Open AccessArticle

Text-Guided Refinement for Referring Image Segmentation

by

Shuang Qiu

^1,*

,

Shiyin Zhang

² and

Tao Ruan

³

¹

School of Artificial Intelligence, Taiyuan University of Technology, Taiyuan 030024, China

²

School of Control and Computer Engineering, North China Electric Power University, Baoding 071003, China

³

The State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5047; https://doi.org/10.3390/app15095047

Submission received: 27 March 2025 / Revised: 29 April 2025 / Accepted: 30 April 2025 / Published: 1 May 2025

Download

Browse Figures

Versions Notes

Abstract

Referring image segmentation aims to segment an object described by a natural language expression from an image. Existing methods perform multi-modal fusion during encoding, typically integrating image and text features before predicting masks via upsampling networks. However, this approach often lacks sufficient multi-modal interactions during decoding, leading to challenges in achieving precise edge predictions for objects with varying scales. Additionally, the isolated interaction between linguistic and visual features at different scales fails to utilize the continuous guidance of language to multi-scale visual features. To address this issue, we propose the Text-Guided Refinement Network (TGRN). It employs a cascaded pyramid structure with a text-guided gating mechanism to enable selective and efficient integration of multi-modal features across multiple scales at the decoding stage. The proposed TGRN offers the following advantages: (a) It enhances information flow across feature scales, improving the network’s capacity to represent multi-scale semantics and achieve accurate segmentation. (b) It leverages text information to guide feature fusion, allowing for strengthened multi-modal interactions and refined edge perception during decoding. (c) It facilitates effective multi-modal information integration through a language-embedded visual encoder. Extensive experiments on three benchmark datasets validate the effectiveness of the proposed approach, demonstrating its superior performance in referring segmentation.

Keywords:

referring image segmentation; multi-modal; gating mechanism; cascaded network

1. Introduction

Referring image segmentation, also known as text-prompt-based segmentation, is a growing research field focused on segmenting objects in images based on textual descriptions provided by users. It aims to segment entities based on natural language expressions by considering appearance, attributes, spatial relationships, and contextual cues relevant to the description. Unlike traditional segmentation methods, this task bridges the gap between language and images by requiring an integrated understanding of both modalities. It holds significant promise for diverse applications in technology and healthcare, including human–computer interaction, medical imaging, and immersive AR/VR experiences.

In referring image segmentation, significant scale variations among objects in input images often challenge segmentation networks. These variations stem from two primary factors. First, objects of different categories inherently exhibit scale differences. Background categories like “clouds”, “walls”, and “roads” often occupy large and contiguous regions, while foreground objects like “people”, “cars”, and “computers” are typically segmented into smaller individual instances and present greater segmentation challenges. Second, scale differences also result from variations in camera perspectives, where objects of the same category appear at different scales, further complicating recognition and often producing inaccurate segmentation results. Large-scale objects may experience internal mis-classification, resulting in discontinuous predicted pixel labels, while small-scale objects may blur into the background, leading to imprecise edges or complete omission. To address these issues, continuous interaction between multi-scale visual features and textual information is essential. Such integration enhances the model’s ability to recognize objects across varying scales and achieve precise segmentation, especially for complex edge structures.

Early methods [1,2,3,4,5] primarily relied on the powerful learning capabilities of deep neural networks, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), to combine linguistic and visual features for segmentation mask prediction. However, these approaches often neglect variations in the representational capacity of multi-modal features for targets at different scales, leading to inaccuracies in segmentation results. To address this, subsequent research introduced multi-scale feature perspectives. Some methods [6,7] adopt multi-resolution input strategies, such as image pyramids, to model objects at varying scales by fusing features across resolutions. This approach incurs high computational costs and overlooks the inherent multi-scale capabilities of convolutional networks. Other approaches [8,9] utilize multi-scale features extracted directly from networks, avoiding the computational burden of image pyramids. However, simple fusion techniques like addition or concatenation often fail to capture the varying importance of scales, leading to suboptimal segmentation results. In Figure 1a, we present two referring segmentation examples based on the same image, showing the mask annotations for the referring targets and the segmentation results using an existing multi-scale feature fusion method. For a small target like “light”, the method fails to detect it, leading to missed detections. Conversely, for a larger object like a “person”, the target is accurately located, though partial omissions, such as missing leg areas, occur due to dim lighting. These examples highlight the need for more effective multi-scale feature fusion mechanisms to address such challenges.

While multi-scale feature fusion methods improve the network’s ability to perceive objects at varying scales, they often overlook the critical role of linguistic guidance. Previous approaches primarily rely on encoder-stage fusion strategies, where images and referring expressions are processed separately by CNNs or RNNs and then combined into multi-modal features. The final segmentation mask prediction is obtained through subsequent upsampling operations and decoding networks. However, this strategy typically restricts multi-modal interactions to the encoding stage before the output side of the network, combining linguistic and visual features independently at each scale [10,11,12], as illustrated in Figure 1b. Such methods fail to leverage the continuous guidance of language to enhance cross-scale multi-modal fused features. Moreover, repetitive strides and pooling operations in convolutional neural networks often lead to the loss of critical fine-structural information, yet few referring segmentation methods explicitly address the issue of detail recovery. To overcome these challenges, we propose to integrate multi-modal and multi-scale information with continuous linguistic guidance at the decoding stage. This approach allows models to progressively recover fine details and improve segmentation accuracy, delivering precise and fine-grained results even for complex targets.

Recently, some approaches [13,14,15,16] have utilized pre-trained multi-modal large models, such as CLIP [17] and Segment Anything [18], which have gained attention for their versatility in diverse tasks. However, their optimization primarily focuses on global semantic alignment rather than pixel level details, resulting in poor edge precision for object boundaries described in natural language prompts. Additionally, their large size poses practical difficulties in deployment, particularly for resource-constrained applications. These drawbacks highlight the value of designing lightweight referring segmentation methods that are not based on pre-trained large models but focus on integrating multi-modal and multi-scale information to achieve precise and fine-grained segmentation results.

To resolve the aforementioned problems, we propose the Text-Guided Refinement Network (TGRN) for referring segmentation, leveraging gated mechanisms to selectively aggregate multi-scale features. Instead of performing cross-modal information fusion at the feature encoding stage, we adopt a decoder fusion strategy that progressively aligns and refines multi-level cross-modal features with linguistic guidance, as illustrated in Figure 1c. A cascaded pyramid structure is utilized to construct multiple feature pyramids from multi-modal features at varying scales, significantly enhancing the network’s capacity to capture and perceive objects of diverse sizes. This design comprises two key components: TGFusionNet for multi-modal cross-scale feature alignment and RefineNet for progressive detail refinement. In TGFusionNet, a text-guided gated fusion mechanism dynamically highlights image regions most relevant to the referring expression, reducing interference from irrelevant areas and enhancing semantic consistency. The gating function ensures precise feature selection and adaptive fusion, enabling the model to handle objects of diverse sizes effectively. To further refine the segmentation output, RefineNet incorporates multi-scale context and boundary cues, progressively recovering fine details and addressing the critical challenge of accurate edge prediction. By integrating linguistic guidance throughout both components, the TGRN bridges the gap between referring expressions and images, achieving fine-grained and precise segmentation results. The main contributions of this paper are as follows:

We propose the Text-Guided Refinement Network, which utilizes a novel text-guided multi-scale feature decoding approach, integrating hierarchical image features through a dynamic gating mechanism. This approach reduces irrelevant information and ensures precise feature alignment, addressing key challenges in segmentation.
We designed a cascaded pyramid structure, consisting of TGFusionNet and RefineNet, to enhance the network’s ability to perceive and represent objects of varying scales. This structure ensures robustness in segmenting targets of diverse sizes and progressively refines fine details, addressing the challenge of accurate edge prediction.
Extensive experiments on three benchmarks demonstrate the superior performance of our method, achieving significant improvements over baselines, including a +7.82% gain on the UNC dataset.

2. Related Work

In this section, we review the most related works to ours in the following two areas: semantic segmentation and referring expression grounding.

2.1. Semantic Segmentation

Semantic segmentation has advanced significantly with the adoption of convolutional neural networks (CNNs), particularly fully convolutional networks (FCNs) [8], which enable pixel-wise dense labeling by replacing fully connected layers with convolutional layers. However, early FCN-based methods struggle with low-resolution segmentation maps due to pooling layers, which enlarge the receptive field but reduce spatial detail. To address this, solutions such as dilated convolutions [19,20] expand the receptive field without sacrificing resolution, while skip connections [6,8] preserve high-resolution features from earlier layers. For example, DeepLab [6,19] utilizes atrous convolutions to eliminate pooling operations and improve segmentation performance. Multi-scale context modeling [21] and pyramid pooling operations [22] further enhance segmentation accuracy by capturing diverse visual contexts. Recent advancements include attention mechanisms [23,24], such as those in DANet [25] and CFNet [26], which leverage self-attention to model long-range dependencies and improve contextual understanding. Encoder–decoder architectures [21] are also widely used to mitigate detail loss from continuous downsampling, while RGB-D methods [27,28] incorporate depth information for more accurate predictions. These developments provide valuable insights and foundational techniques for advancing referring image segmentation.

2.2. Referring Expression Grounding

Referring expression grounding is a task that aims to localize objects in an image based on a natural language description. It includes two branches: localization and segmentation. Some approaches to referring image localization adopt a two-stage framework where object detectors generate candidate regions, which are ranked by their relevance to the referring expression. Methods such as CNN-LSTM architectures [29] select the region with the highest posterior probability, while others [30,31] optimize the joint probability of the target object and the expression. More recently, one-stage frameworks [32,33] have emerged, directly predicting target region coordinates in an end-to-end manner, reducing reliance on excessive candidate boxes. For referring image segmentation, early methods [1,2,3,4,5] rely on simple concatenation of language and visual features, followed by fully convolutional networks for pixel-wise mask prediction. Recent advancements [10,11,12,34,35] introduce self-attention and cross-attention mechanisms to better integrate linguistic and visual information. Cycle-consistency learning [36] and adversarial training [37] are also explored to boost the segmentation performance. For example, Ye et al. [12] use non-local modules to enhance pixel-word mixed feature, and Wang et al. [38] propose asymmetric cross-guided attention between visual and linguistic modalities. Bi-directional relationship inference networks [10] provide mutual guidance between modalities, and Huang et al. [35] model relationships among entities and attributes. LSCM [11] uses dependency parsing to guide multi-modal context learning. While these methods have advanced the field, most rely on encoder stage fusion, potentially reducing the consistency between language and vision in the semantic space. Different from the previous works, we propose the Text-Guided Refinement Network, which focuses on decoder-stage multi-modal fusion. Unlike methods that perform fusion exclusively during encoding, our approach leverages a cascaded pyramid structure and a text-guided gating mechanism to enable continuous interaction between linguistic and visual features throughout the decoding process and ensure the precise alignment of multi-modal features cross scales.

3. Method

The task of referring image segmentation involves identifying and segmenting specific objects described by natural language expressions within an image. To address this, we propose the Text-Guided Refinement Network, a model designed to integrate seamlessly into any encoder–decoder architecture while enhancing multi-modal feature fusion. The overall architecture of the proposed method is illustrated in Figure 2. The model processes natural images and corresponding referring expressions by extracting features through the image and language feature encoder. Multi-modal information from the two-modality original features is encoded to generate hierarchical multi-scale features M, which are fed into a cascaded pyramid structure to enable robust multi-scale integration. Within the cascaded pyramid structure, a text-guided gating mechanism is employed to ensure precise multi-modal feature fusion at the decoding stage, effectively enhancing the network’s ability to capture objects of varying scales and refine fine-grained details. The final decoded multi-modal features are used to produce accurate and precise segmentation predictions.

Specifically, we adopt the method from [35] as the multi-modal encoder to extract features at three scales, denoted as

M = {M_{3}, M_{4}, M_{5}}

, corresponding to the multi-modal fused features obtained from the Res3, Res4, and Res5 layers of the backbone network, each

M \in R^{W \times H \times C_{m}}

, where W, H, and

C_{m}

represent the feature’s width, height, and channel dimensions, respectively. The extracted multi-modal features M are further enhanced through the ASPP module [19] to improve the model’s ability to detect small objects. These features are then passed into the proposed decoder. Each component of the TGRN is elaborated in the following subsections.

3.1. Cascaded Pyramid Structure

To effectively capture referring targets of various scales, we designed a cascaded pyramid network inspired by [39]. It consists of two main components: TGFusionNet and RefineNet (as shown in Figure 2). This architecture enables multi-modal features extracted at different encoder stages to be progressively aligned, fused, and refined during decoding. We first extract three levels of multi-modal features from the encoder, denoted as

M_{3}

,

M_{4}

, and

M_{5}

, corresponding to the outputs of Res3, Res4, and Res5 in the backbone network. These features encode visual and linguistic information at increasing levels of semantic abstraction and decreasing resolution. To prepare the features for fusion, a 1 × 1 convolution is applied to each of

M_{3}

,

M_{4}

, and

M_{5}

, yielding initial processed features

M_{3}^{*}

,

M_{4}^{*}

, and

M_{5}^{*}

. The TGFusionNet consists of two pyramids, and its structure is similar to the feature upsampling schemes commonly used in semantic segmentation networks. It performs hierarchical feature fusion across these three scales. At each level, higher-level features are integrated into lower-level ones using a text-guided gating mechanism (TG), which selectively emphasizes spatial regions relevant to the referring expression. Specifically,

M_{4}^{F}

, which integrates information from two scales, is obtained as follows:

M_{4}^{F} = T G (M_{5}^{*}, M_{4}^{*})

(1)

Here, TG represents the text-guided gating mechanism, and

M_{i}^{F}

represents the fused feature at scale i enriched by contextual and cross-scale information. Similarly, the feature

M_{3}^{F}

, which integrates information from three scales, is computed as follows:

M_{3}^{F} = T G (M_{5}^{*}, M_{4}^{*}, M_{3}^{*})

(2)

where

M_{5}^{*}

,

M_{4}^{*}

,

M_{3}^{*} \in R^{W \times H \times C_{m}}

, and the final fused feature

M_{3}^{F} \in R^{W \times H \times C_{F}}

.

The output features

M_{3}^{F}

,

M_{4}^{F}

, and

M_{5}^{*}

are passed to RefineNet, which focuses on progressively restoring spatial detail and enhancing boundary precision. To improve the learning of scale-aware features and constrain TGFusionNet during training, each of these three features is also passed through two 1 × 1 convolution for auxiliary segmentation prediction, and the corresponding loss is computed against the ground truth mask. In the refinement stage, RefineNet applies a variable number of bottleneck modules [40] to each input, followed by an additional text-guided fusion step to further align and integrate features across scales. The structure of bottleneck module is shown in Figure 3b. This stage is used to address challenging areas that may not be sufficiently resolved by TGFusionNet alone. Overall, the cascaded pyramid structure enables the coarse-to-fine alignment of visual–linguistic features, adaptive feature selection via text-guided gating, and progressive refinement for accurate mask prediction. The specific structure of the text-guided gating mechanism will be detailed in the following section.

3.2. Text-Guided Gating Mechanism

For each referring expression, a text feature R, where

R \in R^{C_{r}}

, is generated by the text feature encoder. Let

F_{i}

denote the input feature for the text-guided gated fusion mechanism, where

i \in 1, 2, \dots, N

, and N represents the number of input features. A spatial attention weight is learned between the text feature and the multi-modal feature

F_{i}

to quantify the importance of each spatial location in the feature based on its relevance to the text. The spatial attention weight is defined as follows:

a_{i} = (R W_{R}) {(F_{i} W_{F})}^{T}

(3)

where

W_{R} \in R^{C_{r} \times C_{h}}

and

W_{F} \in R^{C_{m} \times C_{h}}

are learnable parameters, and

a_{i} \in R^{W \times H}

represents the spatial attention weights. These weights are used to apply a weighted sum of the multi-modal feature

F_{i}

, generating a feature that incorporates global information and enhances the regions of the multi-modal feature that are relevant to the referring description. Next, the global feature is concatenated with the text feature and passed through a fully connected layer to obtain

G_{i} \in R^{C_{m}}

. This feature is used to select the information from other scales that will be fused, as represented by the following equation:

Y_{i}^{j} = (F_{j} W_{y}) σ (G_{i} W_{g})

(4)

where

W_{y} \in R^{C_{m} \times C_{y}}

and

W_{g} \in R^{C_{m} \times C_{y}}

are learnable parameters, and

σ

denotes the softmax function.

F_{j}

represents the feature from another scale that will be fused with

F_{i}

, and

Y_{i}^{j}

denotes the information selected from

F_{j}

for fusion with

F_{i}

. This procedure is repeated to obtain the information for fusion from other multi-modal features in the TG structure. Then, the information from other scales is aggregated into

F_{i}

as shown in Equation (5), yielding the preliminary fused feature

F_{i}^{*}

that contains a combination of multi-modal information:

F_{i}^{*} = F_{i} W_{i} + \sum_{j = 1}^{N - 1} Y_{i}^{j}

(5)

where

W_{i} \in R^{C_{m} \times C_{y}}

are learnable parameters, and

F_{i}^{*} \in R^{W \times H \times C_{y}}

represents the fused feature of scale i, enriched with multi-modal information guided by the referring expression. This process is shown in Figure 3a.

The features

F_{i}^{*}

obtained from different scales are fed into the gating structure. Similar to the approach in [12,21], a gating function is used for further multi-scale feature fusion. Specifically, for each input feature

F_{i}^{*}

, a memory gate

m_{i}

and a reset gate

r_{i}

are generated, where

r_{i}, m_{i} \in R^{W \times H}

. These gate functions are similar to the gates in an LSTM, but for each input, the gate functions are computed independently, without any dependence on the order of inputs. Each input feature

F_{i}^{*}

is associated with a context controller

C_{i}

, which regulates the information flow from other input features

F_{j}^{*}

to

F_{i}^{*}

. This process is expressed as follows:

C_{i} = (1 - m_{i}) ⊙ F_{i}^{*} + \sum_{j}^{N - 1} γ_{j} m_{j} ⊙ F_{j}^{*}

(6)

The fused feature

F u s i o n_{i}

is then computed as follows:

F u s i o n_{i} = r_{i} ⊙ tanh (C_{i}) + (1 - r_{i}) ⊙ F_{i}^{*}

(7)

where ⊙ denotes the Hadamard product, and

γ_{j}

is a learnable parameter that adjusts the proportion of information retained by the memory gate for each input feature. This parameter controls the flow of features from different scales (j) into the current scale (i). Finally, the multi-scale features are aggregated through the summation of

F u s i o n_{i}

and passed to the subsequent network layers for segmentation prediction. The proposed gating mechanism is designed not only to align features across different scales under textual supervision but also to enhance edge localization by filtering out irrelevant visual content. Language-derived cues guide the model to emphasize semantically relevant regions, enabling it to distinguish fine object contours even under scale variation conditions.

3.3. Language-Embedded Visual Encoder

To integrate the multi-modal information effectively and enable the preliminary alignment between the referring expression and image content, we designed a language-embedded visual encoder that injects textual context into early visual features. This allows the model to be guided by language before decoding begins. Given a referring expression as

S = {s_{1}, s_{2}, \dots, s_{T}}

, each word

s_{t}

is first embedded into a 300-dimensional vector using GloVe embeddings [41], following [34]. These are passed through a LSTM, and the final hidden state

h_{T}

is used as the global text representation. To condition the visual encoder on this language representation, we enhance the visual feature map

V_{i}

extracted from the backbone network by embedding linguistic relevance. For each spatial location

(x, y)

, we compute the similarity between the visual feature

V_{i} (x, y)

and the text embedding

h_{T}

. The attention weight is defined as follows:

A (x, y) = softmax (〈 V_{i} (x, y), h_{T} 〉),

(8)

where

〈 \cdot, \cdot 〉

denotes the dot product, and softmax ensures spatial normalization. The text feature is then modulated by the attention weights to form spatially aware text embedding:

T^{'} (x, y) = A (x, y) \cdot h_{T} .

(9)

Subsequently, we concatenate

T^{'} (x, y)

with

V_{i} (x, y)

to form the final language-embedded feature:

V_{i}^{*} (x, y) = Concat (V_{i} (x, y), T^{'} (x, y)) .

(10)

Finally, a

1 \times 1

convolution is applied to

V_{i}^{*}

to reduce the channel dimension, ensuring compatibility with downstream modules. This process ensures that the resulting visual features are contextually modulated by the referring expression, producing language-aware feature maps that guide the model’s attention toward semantically relevant regions. As a result, it benefits subsequent multi-modal fusion, particularly in complex scenes.

4. Experiment

We evaluated the performance of our Text-Guided Refinement Network on the referring segmentation task through extensive experiments conducted on three benchmark datasets. We report the experiment results of our method in comparison with previous methods and the ablation studies that verify the effectiveness of each proposed module.

4.1. Datasets

We trained and evaluated the proposed TGRN on three widely used referring image segmentation datasets: UNC [42], UNC+ [42], and G-Ref [30]. All three datasets are built upon MS-COCO [43], containing 19,994, 19,992, and 26,711 images, respectively. They provide 142,209, 141,564, and 104,560 referring expressions, describing over 50,000 unique objects. The UNC dataset contains referring expressions that are relatively short, averaging fewer than 4 words, often relying on spatial descriptions. In contrast, the UNC+ dataset prohibits the use of spatial location terms during data collection, forcing expressions to focus exclusively on the appearance attributes of the target objects. This restriction makes the task more challenging, as the matching between language and visual regions depends entirely on appearance-based information. The G-Ref dataset further increases complexity with longer referring expressions, averaging 8.4 words, which include richer contextual details and diverse linguistic structures. Overall, the combination of these datasets provides a diverse and representative testbed for evaluating referring image segmentation. They differ in sentence complexity, spatial grounding, and linguistic structure, thereby covering a broad range of real-world challenges. Their complementary nature enables a comprehensive validation of our model’s generalization capability across various scenarios.

4.2. Evaluation Metrics

In line with previous studies [1,12,34], we evaluated referring segmentation performance using the region-based metrics, Overall Intersection over Union (IoU), and Prec@X. Overall IoU is computed as the total intersection area divided by the total union area across all test samples. The Prec@X metric represents the percentage of test samples where the IoU between the predicted and ground-truth masks exceeds the threshold X, with X values chosen from

\{0.5, 0.6, 0.7, 0.8, 0.9\}

.

4.3. Implementation Details

The proposed framework was implemented using the public TensorFlow toolbox and trained on an Nvidia GTX 1080 GPU for 200,000 iterations. Consistent with prior works [2,12], we adopted DeepLab-ResNet101 [19] pretrained on the PASCAL-VOC dataset [44] as the CNN backbone for extracting visual features. Specifically, the outputs from Res3, Res4, and Res5 layers were utilized for multi-level feature fusion and were subsequently fed into the proposed decoder. For pre-processing, input images and their corresponding ground truth segmentation masks were resized to

320 \times 320

. The feature channel dimensions were unified across the framework and set as

C_{m} = C_{r} = C_{h} = C_{y} = 500

. Model training employed the binary cross-entropy loss averaged over all pixels, with optimization performed using the Adam optimizer. The initial learning rate was set to 2.5 ×

10^{- 4}

, and weight decay was applied at 5 ×

10^{- 4}

. Notably, the CNN backbone parameters were frozen during training to focus the optimization process on the proposed decoder and multi-modal fusion components.

4.4. Comparison with State-of-the-Arts

To verify the effectiveness of the proposed model, we compared it with fourteen methods, namely, RMI [3], ASGN [37], RRN [2], MAttNet [45], CMSA [12], CAC [36], STEP [34], BCAM [10], CMPC [35], LSCM [11], CMPC+ [46], EFN [47], TGMI [48], and RBVL [49]. To ensure a fair comparison, we selected methods that utilize backbone architectures similar to DeepLab and related structures, such as ResNet101, for extracting image features. These models are pre-trained on the PASCAL-VOC dataset, which contains approximately 10K images. In contrast, some existing methods, such as LTS [50], depend on Darknet [51], pre-trained on the much larger MS-COCO dataset containing around 110K images. This discrepancy in pre-training data size may introduce biases, making direct performance comparisons less equitable. The results, presented in Table 1 demonstrate the advantages of our approach across the test subsets of the three benchmark datasets. Our proposed approach outperforms all state-of-the-art methods with similar backbones in terms of overall IoU on most datasets, except for G-Ref, where it achieves comparable performance to EFN. Notably, compared to the TGMI method, the TGRN achieves a performance gain of approximately 1% across all three datasets. In particular, on the test subsets of the UNC+ dataset, our method demonstrates an improvement of around 2%, highlighting its effectiveness in accurately segmenting and recognizing referred objects. These results validate the robustness and precision of the proposed TGRN model, further indicating its potential in enhancing the performance of referring image segmentation tasks. Visual examples in Figure 4 illustrate the capability of our method to accurately segment specific regions based on the provided query expressions.

In addition, since the UNC, UNC+, and G-Ref were all collected from MS-COCO, we followed the work [47] by using the MS-COCO dataset for pre-training to facilitate comparisons with more state-of-the-art methods. The results of the model trained in this setting are denoted as TGRN_coco as shown in Table 2, which demonstrates that sufficient training data can yield better results. Our method still achieves superior performance compared to existing methods, such as a 1% improvement in G-Ref over the M³Dec method.

4.5. Ablation Study

To evaluate the contribution of each module in the proposed method, we conducted ablation experiments using three test sets (val, testA, and testB) of the UNC dataset. The following models were tested in the ablation study:

Baseline: Since this method primarily focuses on designing a multi-scale feature fusion decoder, the encoder from the widely used CMPC method [35] was chosen for multi-modal feature encoding. To evaluate the necessity of multi-scale information fusion, segmentation predictions were made using only the multi-modal feature.
Baseline + RRN: This model incorporates the RRN method [2] following the encoder of the CMPC model. The multi-modal features from three different scales are sequentially fed into a convLSTM network for multi-scale feature fusion.
Baseline + CPN: This model denotes the addition of a cascaded pyramid structure for multi-scale information fusion based on the baseline model. In this model, features of different scales are not fused by a gate function but are achieved by simple feature addition.
Baseline + CPN + TG: This model extends the previous one by adding the text-guided gating mechanism (TG) for enhanced fusion within the cascaded pyramid structure.
Baseline + ALL: This model incorporates language-embedded visual encoder for subsequent processing, building upon the previous model. Also, the model includes the ASPP module to enhance the network’s ability to capture fine-grained image details, thus forming the complete network structure proposed in this paper.

The quantitative and qualitative results of the ablation experiments are presented in Table 3 and Figure 5, respectively. The table includes the results on the three test subsets—val, testA, and testB—of the UNC dataset. From the results, all three models proposed in this paper outperform the baseline models. The visualization results in Figure 5 further illustrate the superiority of the proposed models, particularly in segmenting complex and small-scale targets.

Specifically, when compared to the baseline model that only utilizes single-scale features for prediction, the Baseline + CPN method demonstrates significant improvements across several evaluation metrics. On the val test set, the multi-scale method’s Prec@X indicators at five thresholds are 72.05%, 65.45%, 55.38%, 39.30%, and 12.32%, respectively, reflecting an average improvement of nearly 10% over the baseline model. These results underscore the effectiveness of the cascaded pyramid structure in achieving robust multi-scale feature fusion.

Compared to the existing RRN method, the CPN model shows a more than 1% improvement in Overall IoU on the testA set. This indicates that the proposed cascaded pyramid structure better captures the directional flow of information across different scales, enhancing its ability to recognize targets with higher accuracy.

Further enhancement is observed with the inclusion of the TG module. On the testB test set, the model incorporating the text-guided gating mechanism achieves a Prec@0.5 score of 71.19%, representing an improvement of over 4% compared to the model without this module (67.11%). Similarly, on the Overall IoU, the TG module contributes a more than 1% improvement, reaching 60.96%. These results demonstrate that the text-guided gating method effectively selects relevant information for referring segmentation from multi-scale features, thereby facilitating the flow of information at various scales.

For the language-embedded visual encoder, the results on the testA set show that the Baseline + ALL method achieves a 1% improvement in Overall IoU compared to the model without this module. This suggests that the visual feature representation method effectively filters out irrelevant areas in the image, reducing noise and enabling more precise segmentation. Additionally, the ASPP module, with its dilated convolution and downsampling capabilities, enhances the recognition ability of smaller-scale targets, further strengthening the model’s overall performance. The baseline single-scale model achieves Overall IoU values of 56.00%, 59.77%, and 54.12% on the val, testA, and testB sets, respectively. In comparison, the proposed method achieves values of 63.82%, 66.76%, and 61.93%, leading to an average improvement of 7%. These results confirm the superiority and robustness of the proposed method in addressing the challenges of referring segmentation.

Figure 4 and Figure 5 present the segmentation results obtained from the aforementioned models. The examples demonstrate that the proposed models achieve superior performance compared to the baseline model, particularly in scenarios requiring complex multi-modal reasoning. Specifically, the model excels in scenes involving multiple similar objects and weak or missing spatial cues. In such cases, the ability of the TGRN to maintain continuous language guidance throughout the decoding process proves essential. The text-guided gating selectively activates spatial features that correspond to linguistically relevant regions, thereby preserving detail at the object boundaries during the multi-scale fusion process. For instance, in the fifth row of Figure 5, only the full model (“Baseline + ALL”) correctly recovers the entire arm of the woman in white—a region that is small and low in contrast, requiring both accurate semantic understanding and fine-grained feature preservation. Similarly, in the second row, the model precisely segments the human at the top of the image, demonstrating improved boundary recovery. Moreover, in Figure 4, for the referring expression “the real cat not the reflection” which relies on visual attributes rather than spatial location, the proposed model successfully distinguishes the correct target from similar distractors.

To further illustrate the impact of the multi-scale feature fusion, Figure 6 presents a comparison of features before and after applying the cascaded pyramid structure. The second, third, and fourth columns depict the multi-modal features extracted from the convolutional layers Res3, Res4, and Res5, respectively. It is evident that the cascaded pyramid structure enables the model to retain and integrate information related to the referring target across different scales, leading to accurate segmentation results. For instance, in the first row, the features at the three scales identify different regions of the “banana” but fail to capture the complete object. After applying the cascaded pyramid structure, the fused feature effectively recognizes the entire “banana”. Similarly, in the third row, the scale 5 feature initially misidentifies the target. However, after the fusion process, this erroneous information is mitigated, allowing the network to correct the target’s location and boundaries. Figure 7 provides a visual comparison of segmentation heatmaps before and after applying the text-guided gating structure. The input features at different scales (columns 2 to 4) show general activation in relevant regions, but the activations tend to be diffuse or misaligned around the object edges. After applying gated fusion (fifth column), the features become more concentrated around object contours, suggesting that the text-guided gating mechanism helps to refine spatial focus and reduce noise from irrelevant areas. For example, in the third row, initial features highlight a broad region including distractors. After text-guided fusion, the region of interest is narrowed to the person’s body and legs with greater boundary precision, showing that the reinforced multi-modal interaction directly enhances edge-aware segmentation.

4.6. Generalizability Analysis

The proposed method can seamlessly integrate as a decoder into any encoder–decoder architecture for referring segmentation tasks. To evaluate the generalizability of this approach, we incorporated CMSA [12], a representative model for multi-modal feature encoding, as the encoder in our experiments. After feature encoding with the CMSA model, we compared the performance of the original decoding method with our proposed fusion structure. The results on the three test subsets of the UNC dataset, as presented in Table 4, demonstrate the effectiveness of our proposed fusion model when applied to multi-modal features encoded by CMSA. Notably, on the val test set, the proposed model achieves an average 6% improvement in the Prec@X indicator compared to the CMSA method. Furthermore, on the testA set, the Overall IoU increases from 60.61% to 65.13%, reflecting a significant improvement of nearly 5%. These findings validate the robustness and generalizability of the proposed method, emphasizing its applicability across different encoder architectures.

4.7. Computational Analysis

We compare the computational efficiency of our proposed TGRN model with the baseline CMPC on the UNC testA set. All models are evaluated at the same input resolution (320 × 320) using a PC equipped with an Nvidia GTX 1080 GPU. The results are summarized in Table 5. As shown, the TGRN achieves an IoU of 66.76%, outperforming CMPC by 2.23%, with a moderate increase in training time and inference time. The memory usage also rises slightly by 2.1%, primarily due to the cascaded pyramid structure and the text-guided gating mechanism. Despite the additional computation, the performance improvement justifies the trade-off. Furthermore, the TGRN does not rely on large pre-trained vision-language models (e.g., CLIP or SAM), making it a more practical and lightweight solution for real-world applications with limited resources.

4.8. Limitations and Future Work

Our method is not without limitations. Particularly, the model may struggle with ambiguous expressions, especially when multiple similar objects are present, as the current global text encoding lacks fine-grained disambiguation. Additionally, it assumes well-formed input, making it less robust to noisy or incomplete language, such as typos or informal phrases. To address these issues, future work could explore scene-level reasoning, uncertainty-aware prediction, and LLM-based text encoding for improved robustness. Moreover, interactive refinement with user feedback may further enhance practical applicability. Finally, while validated on three widely used datasets, future evaluation on cross-domain scenarios and interactive dialogue-based expression is necessary to assess generalization in real-world settings.

5. Conclusions

In this paper, we address the challenge of coarse target edge segmentation caused by scale discrepancies in referring image segmentation tasks. To enhance the network’s ability to perceive targets across varying scales, we propose a multi-modal feature fusion framework integrated into the decoder stage. This framework enables the dynamic integration of multi-scale and multi-modal features during decoding, leveraging the complementary strengths of linguistic and visual information to refine segmentation accuracy. Unlike existing methods that treat all features equally during fusion, our framework incorporates a novel text-guided gating mechanism within a cascaded pyramid structure. This mechanism prioritizes relevant features while suppressing noise, ensuring the effective alignment of visual and linguistic cues. By integrating these features selectively and dynamically, the proposed method significantly enhances the model’s precision and robustness in multi-scale target segmentation. To validate the effectiveness of our method, extensive experiments were conducted on three benchmark datasets, demonstrating consistent performance improvements over existing state-of-the-art approaches. These results highlight the potential of the TGRN to advance the field of referring image segmentation by offering a precise and scalable solution that bridges the gap between multi-modal feature alignment and multi-scale target recognition.

Author Contributions

Conceptualization, S.Q.; methodology, S.Q.; investigation, S.Q. and S.Z.; resources, S.Q. and T.R.; writing—original draft preparation, S.Q.; writing—review and editing, S.Q., S.Z., and T.R.; visualization, S.Q.; funding acquisition, S.Q. and S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Fundamental Research Program of Shanxi Province (No. 202203021212236), National Natural Science Foundation of China (Grant No. 62403199).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data available in a publicly accessible repository.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from natural language expressions. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring Image Segmentation via Recurrent Refinement Networks. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Liu, C.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Yuille, A. Recurrent Multimodal Interaction for Referring Image Segmentation. arXiv 2017, arXiv:1703.07939. [Google Scholar]
Margffoy-Tuay, E.; Pérez, J.C.; Botero, E.; Arbeláez, P. Dynamic Multimodal Instance Segmentation guided by natural language queries. arXiv 2018, arXiv:1807.02257. [Google Scholar]
Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-Word-Aware Network for Referring Expression Image Segmentation. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. arXiv 2018, arXiv:1802.02611. [Google Scholar]
Lin, G.; Shen, C.; Van Den Hengel, A.; Reid, I. Efficient piecewise training of deep structured models for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 3194–3203. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the CVPR, Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Shuai, B.; Zuo, Z.; Wang, B.; Wang, G. Scene segmentation with dag-recurrent neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 1480–1493. [Google Scholar] [CrossRef] [PubMed]
Hu, Z.; Feng, G.; Sun, J.; Zhang, L.; Lu, H. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 4424–4433. [Google Scholar]
Hui, T.; Liu, S.; Huang, S.; Li, G.; Yu, S.; Zhang, F.; Han, J. Linguistic structure guided context modeling for referring image segmentation. In Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020; Proceedings, Part X; Springer: Cham, Switzerland, 2020; pp. 59–75. [Google Scholar]
Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 16–20 June 2019; pp. 10502–10511. [Google Scholar]
Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 18155–18165. [Google Scholar]
Ding, H.; Liu, C.; Wang, S.; Jiang, X. VLT: Vision-language transformer and query generation for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 7900–7916. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 11686–11695. [Google Scholar]
Kim, N.; Kim, D.; Lan, C.; Zeng, W.; Kwak, S. Restr: Convolution-free referring image segmentation using transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 18145–18154. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Wei, Y.; Xiao, H.; Shi, H.; Jie, Z.; Feng, J.; Huang, T.S. Revisiting Dilated Convolution: A Simple Approach for Weakly-and Semi-Supervised Semantic Segmentation. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Ding, H.; Jiang, X.; Shuai, B.; Liu, A.Q.; Wang, G. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2393–2402. [Google Scholar]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the CVPR, Hawaii, HI, USA, 21–26 July 2017. [Google Scholar]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C.C.; Lin, D.; Jia, J. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 267–283. [Google Scholar]
Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 16–20 June 2019; pp. 3146–3154. [Google Scholar]
Zhang, H.; Zhang, H.; Wang, C.; Xie, J. Co-occurrent features in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Angeles, CA, USA, 16–20 June 2019; pp. 548–557. [Google Scholar]
Rizzoli, G.; Shenaj, D.; Zanuttigh, P. Source-free domain adaptation for rgb-d semantic segmentation with vision transformers. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Hawaii, HI, USA, 4–8 January 2024; pp. 615–624. [Google Scholar]
Du, S.; Wang, W.; Guo, R.; Wang, R.; Tang, S. Asymformer: Asymmetrical cross-modal representation learning for mobile platform real-time rgb-d semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 17–21 June 2024; pp. 7608–7615. [Google Scholar]
Hu, R.; Xu, H.; Rohrbach, M.; Feng, J.; Saenko, K.; Darrell, T. Natural language object retrieval. In Proceedings of the CVPR, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Liu, J.; Wang, L.; Yang, M.H. Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4856–4864. [Google Scholar]
Zhao, P.; Zheng, S.; Zhao, W.; Xu, D.; Li, P.; Cai, Y.; Huang, Q. Rethinking Two-Stage Referring Expression Comprehension: A Novel Grounding and Segmentation Method Modulated by Point. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 7487–7495. [Google Scholar]
Lu, M.; Li, R.; Feng, F.; Ma, Z.; Wang, X. LGR-NET: Language Guided Reasoning Network for Referring Expression Comprehension. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7771–7784. [Google Scholar] [CrossRef]
Liao, Y.; Liu, S.; Li, G.; Wang, F.; Chen, Y.; Qian, C.; Li, B. A real-time cross-modality correlation filtering method for referring expression comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 10880–10889. [Google Scholar]
Chen, D.J.; Jia, S.; Lo, Y.C.; Chen, H.T.; Liu, T.L. See-through-text grouping for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7454–7463. [Google Scholar]
Huang, S.; Hui, T.; Liu, S.; Li, G.; Wei, Y.; Han, J.; Liu, L.; Li, B. Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 10488–10497. [Google Scholar]
Chen, Y.W.; Tsai, Y.H.; Wang, T.; Lin, Y.Y.; Yang, M.H. Referring expression object segmentation with caption-aware consistency. arXiv 2019, arXiv:1910.04748. [Google Scholar]
Qiu, S.; Zhao, Y.; Jiao, J.; Wei, Y.; Wei, S. Referring image segmentation by generative adversarial learning. IEEE Trans. Multimed. 2020, 22, 1333–1344. [Google Scholar] [CrossRef]
Wang, H.; Deng, C.; Yan, J.; Tao, D. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3939–3948. [Google Scholar]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7103–7112. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Yu, L.; Poirson, P.; Yang, S.; Berg, A.C.; Berg, T.L. Modeling context in referring expressions. In Proceedings of the ECCV, Amsterdam, The Netherlands, 10–16 October 2016. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the ECCV, Zurich, Switzerlan, 5–12 September 2014. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1307–1315. [Google Scholar]
Liu, S.; Hui, T.; Huang, S.; Wei, Y.; Li, B.; Li, G. Cross-Modal Progressive Comprehension for Referring Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 4761–4775. [Google Scholar] [CrossRef] [PubMed]
Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 15506–15515. [Google Scholar]
Qiu, S.; Wang, W. Referring Image Segmentation via Text Guided Multi-Level Interaction. In Proceedings of the 2023 IEEE/CIC International Conference on Communications in China (ICCC), Dalian, China, 10–12 August 2023; pp. 1–6. [Google Scholar]
Pu, M.; Luo, B.; Zhang, C.; Xu, L.; Xu, F.; Kong, M. Text-Vision Relationship Alignment for Referring Image Segmentation. Neural Process. Lett. 2024, 56, 64. [Google Scholar] [CrossRef]
Jing, Y.; Kong, T.; Wang, W.; Wang, L.; Li, L.; Tan, T. Locate then segment: A strong pipeline for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 9858–9867. [Google Scholar]
Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Luo, G.; Zhou, Y.; Sun, X.; Cao, L.; Wu, C.; Deng, C.; Ji, R. Multi-task collaborative network for joint referring expression comprehension and segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 10034–10043. [Google Scholar]
Liu, C.; Jiang, X.; Ding, H. Instance-specific feature propagation for referring segmentation. IEEE Trans. Multimed. 2022, 25, 3657–3667. [Google Scholar] [CrossRef]
Huang, Z.; Xue, M.; Liu, Y.; Xu, K.; Li, J.; Yu, C. DCMFNet: Deep Cross-Modal Fusion Network for Different Modalities with Iterative Gated Fusion. In Proceedings of the 50th Graphics Interface Conference, Halifax, NS, Canada, 3–6 June 2024; pp. 1–12. [Google Scholar]
Liu, C.; Ding, H.; Zhang, Y.; Jiang, X. Multi-modal mutual attention and iterative interaction for referring image segmentation. IEEE Trans. Image Process. 2023, 32, 3054–3065. [Google Scholar] [CrossRef] [PubMed]

Figure 1. (a) Two referring segmentation examples of different scales in the same image, as predicted by the existing method. (b,c) Two multi-modal fusion mechanisms. While existing methods perform language and vision fusion at the encoder stage, the proposed method achieves this in the decoder, incorporating multi-scale interactions across three levels.

Figure 2. The overall architecture of the proposed approach. The encoder extracts features of the input image and description, subsequently performing an initial fusion to generate multi-modal feature. The decoder, represented by the Text-Guided Refinement Network, comprises two main components: TGFusionNet and RefineNet, which are enclosed in dashed boxes of different colors. TGFusionNet achieves cross-scale fusion of multi-modal features, while RefineNet further refines these fused features to enhance segmentation accuracy.

Figure 3. (a) The architecture of text-guided fusion. ⊕ denotes element-wise sum. © denotes concatenation. ⊗ denotes matrix multiplication. C, H, and W are the channel number, height and width of feature maps, respectively. (b) The architecture of the bottleneck.

Figure 4. Visual examples of referring image segmentation by our method.

Figure 5. Visual examples of the proposed modules.

Figure 6. Comparison of segmentation heatmaps from Res3, Res4, and Res5 before and after the fusion by the proposed cascaded pyramid structure.

Figure 7. Comparison of segmentation heatmaps from Res3, Res4, and Res5 before and after the fusion by the proposed text-guided gating mechanism.

Table 1. Comparison with state-of-the-art methods on three benchmark datasets using Overall IoU as a metric.

Methods	Vis.Encoder	Lang.Encoder	UNC			UNC+			G-Ref
Methods	Vis.Encoder	Lang.Encoder	val	testA	testB	val	testA	testB	val
RMI [3]	DL-101	LSTM	45.18	45.69	45.57	29.86	30.48	29.50	34.52
ASGN [37]	DL-101	LSTM	50.46	51.20	49.27	38.41	39.79	35.97	41.36
RRN [2]	DL-101	LSTM	55.33	57.26	53.95	39.75	42.15	36.11	36.45
MAttNet [45]	M-RCN	LSTM	56.51	62.37	51.70	46.67	52.39	40.08	-
CMSA [12]	DL-101	LSTM	58.32	60.61	55.09	43.76	47.60	37.89	39.98
CAC [36]	R-101	LSTM	58.90	61.77	53.81	-	-	-	44.32
STEP [34]	DL-101	LSTM	60.04	63.46	57.97	48.19	52.33	40.41	46.40
BCAM [10]	DL-101	LSTM	61.35	63.37	59.57	48.57	52.87	42.13	48.04
CMPC [35]	DL-101	LSTM	61.36	64.53	59.64	49.56	53.44	43.23	49.05
LSCM [11]	DL-101	LSTM	61.47	64.99	59.55	49.34	53.12	43.50	-
CMPC+ [46]	DL-101	LSTM	62.47	65.08	60.82	50.25	54.04	43.47	49.89
EFN [47]	R-101	GRU	62.76	65.69	59.67	51.50	55.24	43.01	51.93
TGMI [48]	DL-101	LSTM	62.47	65.17	60.30	49.36	53.37	43.62	50.07
RBVL [49]	DL-101	LSTM	62.89	65.01	61.52	51.99	54.27	45.34	50.14
TGRN	DL-101	LSTM	63.82	66.76	61.93	51.65	56.18	44.21	51.63

Table 2. Comparison with state-of-the-art methods pre-trained on MS-COCO dataset on three benchmark datasets using Overall IoU as a metric.

Methods	Vis.Encoder	Lang.Encoder	UNC			UNC+			G-Ref
Methods	Vis.Encoder	Lang.Encoder	val	testA	testB	val	testA	testB	val
MCN [52]	DN-53	GRU	62.44	64.20	59.71	50.62	54.99	44.69	49.22
ISFP [53]	DN-53	GRU	65.19	68.45	62.73	52.70	56.77	46.39	52.67
LTS [50]	DN-53	GRU	65.43	67.76	63.08	54.21	58.32	48.02	54.40
VLT [14]	DN-53	GRU	65.65	68.29	62.73	55.50	59.20	49.36	52.99
DCMFNet [54]	DN-53	LSTM	65.84	69.34	63.09	54.78	60.03	49.30	51.99
M³Dec [55]	DN-53	GRU	67.88	70.82	65.02	56.98	61.26	50.11	54.79
TGRN_coco	DL-101	LSTM	68.59	70.39	66.10	57.28	61.44	50.66	55.83

Table 3. Prec@X and Overall IoU results of different module combination on UNC dataset.

Methods	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	IoU
Baseline	62.87	54.91	44.16	28.43	7.24	56.00
Baseline + RRN	70.78	63.25	52.54	35.55	9.42	60.42
Baseline + CPN	72.05	65.45	55.38	39.30	12.32	61.23
Baseline + CPN + TG	74.45	60.34	57.42	35.35	11.34	62.45
Baseline + ALL	76.13	69.62	60.15	44.96	15.83	63.82
Baseline	70.60	61.59	48.98	28.09	4.67	59.77
Baseline + RRN	75.38	68.29	57.20	38.02	8.84	62.84
Baseline + CPN	76.83	70.59	60.65	41.86	11.84	64.22
Baseline + CPN + TG	77.44	71.46	61.45	43.65	13.65	65.34
Baseline + ALL	80.35	74.97	65.12	47.78	14.72	66.76
Baseline	60.65	51.03	37.98	22.32	4.71	54.12
Baseline + RRN	67.91	59.41	48.87	33.27	11.03	58.70
Baseline + CPN	67.11	60.08	51.56	38.00	15.31	59.23
Baseline + CPN + TG	71.19	62.96	53.23	41.61	16.18	60.96
Baseline + ALL	72.13	64.72	56.26	42.71	17.56	61.93

Table 4. Prec@X and Overall IoU results on UNC dataset using CSMA as multi-modal encoder.

UNC	Methods	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	IoU
val	CMSA	66.44	59.70	50.77	35.52	10.96	58.32
val	CMSA + ALL	72.26	66.11	57.38	42.51	14.79	61.69
testA	CMSA	70.28	63.64	54.07	38.38	11.21	60.61
testA	CMSA + ALL	77.18	71.42	62.03	45.36	13.42	65.13
testB	CMSA	61.51	53.60	45.40	32.40	12.90	55.09
testB	CMSA + ALL	68.42	61.12	51.14	37.31	14.90	59.16

Table 5. Comparison of computational time and memory costs.

Methods	IoU	Training Time (h)	Testing Time (ms)	Memory Costs (kb)
CMPC	64.53	18.46	85	1,057,705
TGRN	66.76	20.30	103	1,080,341

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qiu, S.; Zhang, S.; Ruan, T. Text-Guided Refinement for Referring Image Segmentation. Appl. Sci. 2025, 15, 5047. https://doi.org/10.3390/app15095047

AMA Style

Qiu S, Zhang S, Ruan T. Text-Guided Refinement for Referring Image Segmentation. Applied Sciences. 2025; 15(9):5047. https://doi.org/10.3390/app15095047

Chicago/Turabian Style

Qiu, Shuang, Shiyin Zhang, and Tao Ruan. 2025. "Text-Guided Refinement for Referring Image Segmentation" Applied Sciences 15, no. 9: 5047. https://doi.org/10.3390/app15095047

APA Style

Qiu, S., Zhang, S., & Ruan, T. (2025). Text-Guided Refinement for Referring Image Segmentation. Applied Sciences, 15(9), 5047. https://doi.org/10.3390/app15095047

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Text-Guided Refinement for Referring Image Segmentation

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation

2.2. Referring Expression Grounding

3. Method

3.1. Cascaded Pyramid Structure

3.2. Text-Guided Gating Mechanism

3.3. Language-Embedded Visual Encoder

4. Experiment

4.1. Datasets

4.2. Evaluation Metrics

4.3. Implementation Details

4.4. Comparison with State-of-the-Arts

4.5. Ablation Study

4.6. Generalizability Analysis

4.7. Computational Analysis

4.8. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI