MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation

Zhang, Tianxiang; Li, Junbai; Feng, Yanqiang; Wen, Zhaokun; Liu, Li; Li, Jiangyun

doi:10.3390/rs18121949

Open AccessArticle

MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation

by

Tianxiang Zhang

^1,2

,

Junbai Li

^1,2

,

Yanqiang Feng

^1,2,

Zhaokun Wen

^1,2,

Li Liu

^1,2 and

Jiangyun Li

^1,2,*

¹

School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China

²

Key Laboratory of Knowledge Automation for Industrial Processes, Ministry of Education, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(12), 1949; https://doi.org/10.3390/rs18121949 (registering DOI)

Submission received: 16 March 2026 / Revised: 6 June 2026 / Accepted: 9 June 2026 / Published: 12 June 2026

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

MSMamba introduces a Mamba-driven multi-semantic framework for referring remote sensing image segmentation.
The proposed Visual–Text Fine-grained Block (VTFB) and multi-scale fusion decoder (MSFD) significantly enhance fine-grained language grounding and scale-aware boundary refinement.

What is the implication of the main finding?

State-space modeling provides an efficient, scalable alternative to heavy attention mechanisms for long-range context modeling in large-scale remote sensing imagery.
Experiments on four public benchmarks show consistent improvements in segmentation accuracy, especially with long-text and cluttered scene descriptions.

Abstract

Remote sensing referring segmentation aims to extract the exact region of an object in an aerial image based on a natural language description, but it remains challenging because remote sensing scenes cover large areas, many objects look similar, and the descriptions are often long and detailed. Existing attention-based models are computationally expensive on large images and may underuse fine-grained language cues, which can lead to inaccurate or incomplete masks. To address this, we present MSMamba, an efficient framework built on a state space model for stable long-range context modeling over large spatial grids. We further strengthen language grounding by identifying descriptive words in the expression and using them to guide visual features from coarse localization to boundary refinement. In addition, we design a scale-aware decoding strategy that fuses multi-scale features with adaptive gating to better handle severe size variation and thin structures. Experiments on four public benchmarks show that MSMamba consistently improves segmentation quality. On RefSegRS, MSMamba improves Pr@0.8 on the test set by 25.53% and increases mIoU by 6.65%. On RRSIS-HR, MSMamba improves Pr@0.8 by 9.09% and increases mIoU by 3.02%. These results suggest that combining a state space model with structured language guidance and scale-aware fusion is a practical alternative to attention-only designs for remote sensing referring segmentation.

Keywords:

remote sensing; referring image segmentation; state space model

1. Introduction

With the rapid increase in spatial resolution and diversity of remote sensing imagery, there is a growing need for fine-grained, human-interpretable scene understanding. Referring remote sensing image segmentation (RRSIS) [1,2,3] addresses this demand by taking a remote sensing image and a free-form natural language expression as input, and producing a segmentation mask for the uniquely referred target. By explicitly bridging visual content and linguistic descriptions, RRSIS enables more flexible human–machine interaction in applications such as intelligent monitoring [4], land-use and land-cover analysis [5], disaster assessment [6], and urban planning [7], and represents an important step from perceptual to more cognitive remote sensing interpretation.

To make RRSIS useful in the above applications, the model needs reliable vision–language interaction in high-resolution, large-area scenes. Figure 1 summarizes the main interaction paradigms and their limitations. In Figure 1a, self-attention concatenates visual tokens I and text tokens T to produce the multimodal self-attention feature

F_{SA}

. Since each token attends to all others, the cost grows quickly with the sequence length, which is expensive for large images. It can also become less focused in scenes that contain many look-alike regions, or large areas with little texture and weak contrast. This makes it harder to keep clear long-range and multi-scale links, and it may lead to incomplete or broken masks. In Figure 1b, cross-attention uses pooled text queries

T^{'}

to attend to visual features

I^{'}

and outputs the multimodal cross-attention feature

F_{CA}

. This is more efficient, but pooling the text tokens into a smaller set of queries may remove fine-text details, and the fixed query capacity limits step-by-step refinement, especially for small or low-contrast targets. Overall, current attention-based designs still face a trade-off between efficiency and fine-grained alignment.

Recent Mamba-style state space models have shown strong potential for remote sensing vision [8,9,10]. This trade-off motivates us to consider a Mamba-based design, as shown in Figure 1c. With selective scanning and state-space updates on joint visual-text states

{(I & T)}^{'}

, Mamba produces the multimodal Mamba feature

F_{Mamba}

with linear cost and refines context gradually across layers and spatial positions. Because the state is updated during scanning, the model can build global context over large areas and maintain more continuous long-structure representations, which suits remote sensing scenes. However, using Mamba directly is still not enough. Remote sensing images often contain many similar objects, such as dense buildings and repeated farmland. In such cases, state propagation alone can blur the distinction among candidates and confuse the referred target. Stronger attribute cues from the text are therefore required; otherwise, the model may yield incomplete masks under strict evaluation settings. In addition, RRSIS expressions are often long and detailed, containing multiple attributes and complex spatial relationships. If language is only used coarsely, the model may underutilize these constraints and fail to ground complex descriptions. These issues motivate a Mamba-driven framework that strengthens fine-grained vision–text alignment and improves robustness under visual confusion, leading to our proposed method.

Building on the above analysis, we propose MSMamba, a multi-semantic Mamba-driven framework for referring remote sensing image segmentation. By leveraging Mamba-style state space modeling, MSMamba scales vision-language reasoning to high-resolution, large-area imagery through efficient context propagation on dense spatial grids. The continuous state update helps build a stable global context, which is particularly important for remote sensing scenes with wide spatial extents and strong visual ambiguity among candidate regions.

While a global context is necessary, accurate referring segmentation also requires fine-grained grounding between the expression and local visual details. To meet this need, we design a lightweight Visual–Text Fine-grained Block (VTFB) that decomposes the input expression into attribute-level features and further builds complementary global and local text features. These features guide the segmentation process in a stage-wise manner so that language information can modulate visual features at different scales and support more precise alignment. Meanwhile, remote sensing targets vary greatly in size and shape, and many targets are small or have thin boundaries, so strong cross-scale fusion is also required. Therefore, we further introduce an enhanced multi-scale fusion decoder (MSFD), which strengthens cross-scale feature interaction through differential modeling and gated fusion. This design improves cross-resolution consistency and helps delineate small targets and complex boundaries.

With these designs, MSMamba achieves state-of-the-art (SOTA) performance on four mainstream RRSIS benchmarks, including RefSegRS, RRSIS-D, RISBench, and RRSIS-HR. Across datasets, MSMamba provides more reliable grounding for long and attribute-rich expressions and produces higher-quality masks in cluttered scenes, especially on thin structures and boundary-sensitive regions. These results support the effectiveness of combining state-space multimodal modeling with explicit fine-grained semantic injection and strengthened multi-scale decoding.

In summary, our main contributions are as follows:

We explore a Mamba-driven framework for referring remote sensing image segmentation and design a Visual–Text Fine-grained Block (VTFB) based on state space modeling to improve fine-grained vision–language alignment under long and complex descriptions.
We design an enhanced multi-scale fusion decoder (MSFD) that strengthens cross-scale feature interaction and improves segmentation of small objects and boundary regions.
Extensive experiments on four public benchmarks (RRSIS-D, RRSIS-HR, RefSegRS, and RISBench) demonstrate the effectiveness and state-of-the-art performance of MSMamba. Further analysis shows more evident advantages in long-text scenarios.

2. Related Works

2.1. Referring Image Segmentation

Referring Image Segmentation (RIS) extends visual grounding by predicting a pixel-accurate mask for a target specified by a natural language expression. It requires precise boundaries and reliable vision–language alignment [11]. Early RIS methods typically adopted a two-stream design with convolutional visual backbones and recurrent language encoders, and then used iterative refinement or recurrent multimodal interaction to reduce boundary ambiguity [12,13,14]. However, these approaches often encode expressions coarsely and provide limited long-range, multi-scale reasoning. This can lead to missed small or thin targets and fragmented masks in cluttered scenes.

Later work strengthened linguistic reasoning and fine-grained grounding through structured semantics and tighter cross-modal alignment. Typical directions include decomposing expressions into components for modular alignment [15], emphasizing keyword-aware cues at the pixel level [16], modeling cross-modal dependencies with attention mechanisms [17], and explicitly representing entities and attributes for more interpretable grounding [18]. Multi-stage coarse-to-fine alignment with consistency constraints further improves robustness in complex scenes [19]. Nevertheless, many methods still rely on attention-heavy interactions or pooled sentence-level representations, which makes it difficult to scale to larger receptive fields while preserving fine-grained semantic constraints.

With the rise of Transformer, RIS shifted toward unified cross-modal modeling with layered self-attention. Language-aware mechanisms are integrated into visual Transformers [20,21], and linguistic cues are incorporated early in the encoder to maintain semantic consistency across feature hierarchies [22,23,24]. These frameworks generally improve global dependency modeling and boundary quality, but dense attention is computationally expensive for high-resolution inputs. It can also be less stable in scenes with repetitive textures and large areas of similar regions, which motivates more scalable long-range modeling and more structured semantic conditioning.

Recent studies further advance RIS by improving semantic granularity and contextual reasoning, including joint word-pixel and sentence-mask alignment [25], modeling implicit linguistic cues [26], reducing cross-modal gaps with masking or reconstruction objectives [27], and extending RIS to more open-domain referring settings [28]. These advances have also encouraged referring segmentation to move beyond natural images into domain-specific imagery, including remote sensing, where large-scale scenes and fine-grained descriptions make efficient and reliable grounding particularly important.

2.2. Referring Remote Sensing Image Segmentation

Referring remote sensing image segmentation (RRSIS) adapts RIS to aerial imagery and requires grounding free-form expressions on large-scale remote sensing scenes. Remote sensing images are extremely large and exhibit strong scale variation, which challenges both efficiency and the delineation of small objects. Many regions look visually similar from a top-down viewpoint, and repeated patterns increase instance ambiguity. Expressions are often long and attribute-rich, and the layered constraints must be explicitly grounded to avoid drifting into similar-looking regions. These properties make RRSIS more demanding than natural-image RIS.

The task and the RefSegRS benchmark were introduced by Yuan et al. [1], who fused language-guided embeddings into multi-level visual features to reduce missed detections of small or sparse objects. In parallel, geometry-aware designs were developed to address severe scale variation and rotation effects. RMSIN enhances multi-scale integration and rotation robustness, releasing the more challenging RRSIS-D benchmark [3]. Subsequent work further strengthens grounding with finer word-level interaction and pixel-level alignment, such as FIANet [29]. MAFN [30] further enhances structural consistency and boundary quality through pixel-level relevance modeling and unified multi-scale and rotation-aware decoding.

More recent approaches emphasize stronger bidirectional vision–language alignment during encoding and introduce more challenging evaluation settings. CroBIM incorporates prompt learning and cascaded cross-modal refinement modules and releases the RISBench benchmark with attribute-rich expressions and complex constraints [31]. CADFormer adopts closed-loop mutual guidance and provides the high-resolution RRSIS-HR benchmark to better reflect large-area, high-detail imagery [32].

Overall, existing RRSIS methods perform well on short expressions and medium-to-large targets, but three issues remain. Many models still scale poorly to high-resolution, large-area imagery due to heavy backbones and attention-intensive interaction. Fine-grained grounding is often insufficient under strong visual ambiguity because attribute-level cues are not continuously exploited. Cross-scale fusion can also be weak for large-scale variation and thin boundaries, leading to incomplete masks and degraded boundary quality.

3. Proposed Methodology

3.1. Preliminaries: State Space Model (SSM)

SSM is inspired by linear time-invariant systems and models mapping from inputs to outputs through a latent state. For an input signal

x (t) \in R^{d_{s}}

and output

y (t) \in R^{d_{s}}

, an SSM maintains a hidden state

h (t) \in R^{N_{s}}

, where

N_{s}

denotes the state dimension and

d_{s}

denotes the input feature dimension. The dynamics are governed by the following linear ordinary differential equation:

\begin{matrix} h^{'} (t) & = A h (t) + B x (t), \\ y (t) & = C h (t), \end{matrix}

(1)

where

A \in R^{N_{s} \times N_{s}}

is the state transition matrix, and

B \in R^{N_{s} \times d_{s}}

and

C \in R^{d_{s} \times N_{s}}

map the input to the state and the state to the output, respectively. In vision tasks, images are processed as discrete token sequences by serializing two-dimensional features, so the continuous SSM needs to be discretized to operate on sampled sequences.

Mamba [33] adopts the zero-order hold (ZOH) scheme to discretize the continuous system with a discretization step size

Δ

. Specifically, the continuous parameters

A

and

B

are transformed into discrete counterparts

\bar{A}

and

\bar{B}

as

\begin{matrix} \bar{A} & = exp (Δ A), \\ \bar{B} & = {(Δ A)}^{- 1} (exp (Δ A) - I) Δ B, \end{matrix}

(2)

where

I

denotes the identity matrix. After discretization, the state update can be written in a recurrent form:

\begin{matrix} h_{t} & = \bar{A} h_{t - 1} + \bar{B} x_{t}, \\ y_{t} & = C h_{t}, \end{matrix}

(3)

which enables long-range dependency modeling through state propagation with linear complexity in sequence length.

A key limitation of the standard SSM is that its parameters remain constant across different inputs, which restricts content-adaptive modeling. Mamba addresses this by introducing a selective mechanism that makes key parameters input-dependent, notably the input-to-state and state-to-output mappings (and in many implementations also the timescale

Δ

), allowing the model to dynamically adjust its contextual representations while preserving linear-time training and inference.

Due to the strong long-sequence modeling ability of Mamba, its exploration has gradually extended from language modeling to the vision domain. Recent studies have developed Mamba-based visual backbones for image classification and segmentation [34]. Since Mamba is originally designed for one-dimensional sequences rather than two-dimensional images, ViM [34] adapts it through image serialization and bidirectional scanning, while VMamba further introduces four-directional scanning to enhance spatial modeling [35]. In remote sensing, existing works mainly focus on the influence of different scanning strategies on semantic segmentation performance [36,37]. However, existing evidence shows that multi-directional scanning does not bring consistently significant gains [38]. Therefore, for remote sensing scenarios, improving performance requires not only effective spatial modeling, but also stronger semantic interaction and feature alignment.

3.2. Architecture Overview

We propose a remote sensing referring segmentation framework named MSMamba, as illustrated in Figure 2. The architecture consists of three components: a visual encoder enhanced by the proposed Visual–Text Fine-grained Block (VTFB), a text backbone, and a multi-scale fusion decoder (MSFD). Given an input image I and a referring expression E, the visual encoder follows the VMamba [35] paradigm, where VSS blocks and VTFBs are alternately stacked to extract hierarchical visual representations across stages. The MSFD takes the stage-wise features produced by the visual encoder as input and performs cross-scale interaction and adaptive fusion. A lightweight segmentation head is then applied to the fused representation to generate the final prediction corresponding to the target referred by E.

VTFB serves as the core unit for multimodal interaction within the visual encoder. It begins with a word-level semantic processor (WLSP) that parses the expression, locates attribute-related token indices in the BERT sequence, and constructs a bridge feature that encodes the attribute-level semantics of the referred target. The Global–Local Cross-Modal Alignment (GL-CA) then integrates linguistic cues with the visual representation to produce a global alignment feature and a local alignment feature. Finally, the visual feature, global alignment feature, bridge feature, and local alignment feature are concatenated in a fixed order and fed into an SS2D block [35] to propagate multimodal context over the spatial grid. After this interaction, we retain only the visual-related channels as the stage-wise features for downstream decoding.

3.3. Visual–Text Fine-Grained Block (VTFB)

3.3.1. Word-Level Semantic Processor

As shown in Figure 3, the Visual–Text Fine-grained Block (VTFB) enhances fine-grained vision–language interaction. VTFB first employs a Word-Level Semantic Processor (WLSP) to construct a stage-aware semantic bridge from descriptive words in the referring expression, which help distinguish the target from visually similar candidates. The bridge feature is mainly derived from attribute-level semantics, and its composition is further analyzed in the subsequent ablation study.

Given an input expression, we tokenize it using BERT and obtain contextual token embeddings

L \in R^{N \times D}

, where N is the token length and D is the embedding dimension. In parallel, WLSP parses the textual expression corresponding to L and extracts attribute-related words for constructing the semantic bridge. When spaCy [39] is available, we use its POS tags and dependency relations to select words in four categories: adjectives/modifiers, numerals/quantities, spatial/positional words, and morphological modifiers.

WLSP then aligns the extracted words with the BERT token sequence. Each word is tokenized into WordPiece tokens, and the positions of these tokens are located in the full BERT sequence using a two-step matching procedure. First, exact WordPiece sequence matching is performed, where the tokenized word is searched as a continuous subsequence in the BERT token sequence. If exact matching fails and the attribute word is tokenized into a single token, partial token matching is applied. In this step, the method searches for a BERT token that contains the attribute word or is contained within it, which helps handle minor tokenization inconsistencies. If a word cannot be aligned using these strategies, it is discarded. To ensure stable batch processing, at most M attribute words are kept per expression, with a validity mask indicating valid positions. Features corresponding to invalid words are set to zero. The selected token embeddings are gathered to form an attribute token sequence

z_{bridge} \in R^{M \times D}

, where M is the number of selected tokens. We apply mean pooling to obtain a single attribute vector

{\bar{z}}_{bridge} \in R^{D}

. At each stage,

{\bar{z}}_{bridge}

is projected to

C_{i}

channels and broadcast to

H_{i} \times W_{i}

, producing the bridge feature map

X_{B} \in R^{C_{i} \times H_{i} \times W_{i}}

, where

C_{i}

is the channel dimension and

H_{i}

and

W_{i}

respectively denote the feature-map height and width.

The attribute features provide auxiliary fine-grained cues, while the complete BERT sentence embedding is still preserved for multimodal fusion. This design strengthens attribute-aware understanding without making the model solely dependent on successful attribute extraction. When spaCy parsing is unavailable, a lightweight heuristic using predefined vocabulary and simple lexical rules serves as a deterministic fallback.

3.3.2. Global–Local Cross-Modal Alignment

Building on the stage-wise bridge feature constructed by WLSP, we further align language cues with visual features at both global and local levels. The Global–Local Cross-Modal Alignment (GL-CA) establishes semantic correspondences between text and visual features at both global and local levels. It fuses these cues with the visual stream through SS2D. At each stage i, we use the sentence-level

[CLS]

representation as a semantic query and perform cross-attention over the visual feature

V \in R^{C_{i} \times H_{i} \times W_{i}}

to obtain a global alignment feature. This feature aggregates visual evidence according to the overall intent of the expression. In parallel, we compute a local alignment feature by projecting pixel embeddings and token embeddings into a shared latent space and measuring their dot-product similarity. This captures fine-grained word-pixel correspondences that are important for boundary refinement and small-object discrimination. We then project the global and local alignment features to

C_{i}

channels, yielding

X_{G}, X_{L} \in R^{C_{i} \times H_{i} \times W_{i}}

.

Together with the bridge feature

X_{B}

, these features are concatenated with the visual feature along the channel dimension in a coarse-to-fine order:

X = Concat (V, X_{G}, X_{B}, X_{L}) \in R^{4 C_{i} \times H_{i} \times W_{i}} .

(4)

The influence of alternative concatenation orders is further analyzed in the subsequent ablation study. The fused tensor X is then processed by the SS2D block, which serves as the core selective scanning operator within the VSS block. SS2D propagates multimodal context over the spatial grid and performs cross-directional merging to produce a modality-aware representation with long-range interactions. By operating on the concatenated channels, SS2D enables explicit interaction between visual and textual features. After this interaction, we retain only the channels corresponding to the visual stream, as the textual cues have already been incorporated. The resulting tensor is used as the refined visual feature

V^{*} \in R^{C_{i} \times H_{i} \times W_{i}}

for downstream decoding.

3.4. Multi-Scale Fusion Decoder (MSFD)

Inspired by RMSIN [3], we propose a multi-scale fusion decoder (MSFD) to handle severe scale variation and complex spatial layouts in remote sensing imagery. As shown in Figure 4, MSFD consists of a Multi-branch Cross-scale Parsing (MCP) and a Selective Kernel Scale-Aware Gate (SKSAG) executed sequentially, where MCP incorporates a Bidirectional Feature Pyramid Network (BiFPN) [40] to enhance multi-level features before cross-scale interaction.

3.4.1. Multi-Branch Cross-Scale Parsing

The Multi-branch Cross-scale Parsing (MCP) first enhances multi-level features using BiFPN. BiFPN performs top-down and bottom-up aggregation to inject high-level semantics into shallow layers while preserving fine-grained structural details, producing BiFPN-enhanced features

{V_{BiFPN}^{i}}_{i = 1}^{4}

.

To enable cross-scale reasoning in a shared spatial domain, we align

{V_{BiFPN}^{i}}_{i = 1}^{4}

to a unified resolution, concatenate it along the channel dimension, and compress the result into a compact representation

V_{uni}

. Based on

V_{uni}

, we use two complementary branches. The Cross-Scale Aware (CSA) module first constructs multiple receptive-field representations from

V_{uni}

by keeping the original-resolution feature and generating two downsampled features with depthwise convolutions of different strides. These multi-scale features are then flattened and concatenated into a unified token sequence, allowing fine-resolution tokens to interact with coarse contextual tokens through self-attention. After the attention operation, only the tokens corresponding to the original resolution are reshaped back to the spatial feature map, producing

V_{attn}^{'}

. In parallel, a Depthwise Separable Convolution (DSConv) branch further enhances localized structural patterns from the value features of the original-resolution tokens and outputs

V_{conv}^{'}

. Their outputs are fused with a residual connection to obtain the first-stage cross-scale representation:

V_{mcp}^{'} = V_{attn}^{'} + V_{conv}^{'} + V_{uni} .

(5)

Next, we maintain scale diversity by applying grouped feedforward processing to

V_{mcp}^{'}

. Specifically,

V_{mcp}^{'}

is split into several channel groups, and each group is processed by a lightweight feedforward network consisting of point-wise convolutions, SwiGLU activation, depthwise convolution, and Global Response Normalization. The group outputs are concatenated and passed to a second CSA module for another round of cross-scale interaction and contextual aggregation. Finally, a unified feedforward network with the same design integrates the aggregated features to generate a cross-scale enhanced representation

V_{mcp}

, which is then fed into SKSAG for adaptive structure-semantic fusion.

3.4.2. Selective Kernel Scale-Aware Gate

At each decoding stage, SKSAG combines a global cue extracted from

V_{mcp}

with a local cue from

V_{BiFPN}

. Concretely, we split

V_{mcp}

along the channel dimension into four stage-specific groups according to the channel dimensions of the four feature levels, each corresponding to one stage in the unified construction. For the current stage, we take its corresponding channel group and upsample it to the stage resolution, yielding the stage-specific global feature

V_{glob}^{'}

. In parallel, we apply a Selective Kernel convolution to the BiFPN feature at the same stage to obtain the local structural feature

V_{loc}^{'}

.

We then predict a spatial-channel gate

w \in {[0, 1]}^{C \times H \times W}

to modulate

V_{loc}^{'}

and combine it with

V_{glob}^{'}

by element-wise addition:

V_{out} = w ⊙ V_{loc}^{'} + V_{glob}^{'} .

(6)

The resulting stage feature

V_{out}

is forwarded to the segmentation head for mask prediction. The segmentation head is implemented with lightweight convolutional layers that progressively upsample and fuse multi-scale features via bilinear interpolation to regress a single-channel logit map

\hat{M}

. During training,

\hat{M}

is supervised by the binary ground-truth mask using the sum of Dice loss and BCEWithLogits [29,32].

4. Experiments

4.1. Metrics and Datasets

To evaluate MSMamba on the RRSIS task, we report three complementary metrics: precision at multiple IoU thresholds (Pr@0.5-Pr@0.9), mean Intersection-over-Union (mIoU), and overall Intersection-over-Union (oIoU). Pr reflects segmentation correctness under different strictness levels, where higher thresholds (e.g., Pr@0.8/0.9) are more sensitive to boundary quality and the completeness of thin or small structures. mIoU averages IoU over samples to provide a balanced view across object scales, while oIoU aggregates pixels globally and emphasizes overall mask quality. Together, these metrics provide a comprehensive assessment of referring segmentation performance.

We conduct experiments on four public datasets: RefSegRS [1], RRSIS-D [3], RISBench [31], and RRSIS-HR [32]. RefSegRS and RRSIS-D mainly evaluate referring segmentation under challenging target conditions, where objects are often small, low-saliency, or visually ambiguous in cluttered scenes with large-scale and orientation variations. In contrast, RISBench and RRSIS-HR place greater emphasis on language complexity, featuring attribute-rich annotations and longer descriptions that require fine-grained semantic parsing and multi-constraint grounding. Together, these benchmarks provide a rigorous testbed for assessing the generalization and robustness of MSMamba.

4.1.1. RefSegRS

RefSegRS contains 4420 image–expression–mask triplets, split into 2172 training, 431 validation, and 1817 testing samples. Each image is of size

512 \times 512

and covers diverse remote sensing categories, including buildings, vehicles, vegetation, and water bodies. This dataset is characterized by containing many small or low-salience objects embedded in cluttered backgrounds, making it ideal for evaluating the performance of fine-grained boundary recovery and robust segmentation under challenging visual conditions.

4.1.2. RRSIS-D

RRSIS-D comprises 17,402 triplets, with 12,181 for training, 1740 for validation, and 3481 for testing. Images are high-resolution at

800 \times 800

and exhibit substantial scale variation and diverse object orientations. The scenes span urban infrastructure, agricultural landscapes, and natural environments, providing a strong testbed for multi-scale aggregation and robustness to rotation and viewpoint-induced ambiguity.

4.1.3. RISBench

RISBench is a large-scale benchmark with 52,472 triplets (26,300 training, 10,013 validation, and 16,159 testing). All images are resized to

512 \times 512

for consistent evaluation, while the underlying spatial resolution varies from 0.1 m to 30 m, covering a wide range of scales and details. RISBench defines 26 semantic classes, and each class is annotated with eight attributes, enabling evaluation of multi-attribute reasoning and cross-scene robustness in attribute-rich referring settings.

4.1.4. RRSIS-HR

RRSIS-HR contains 2650 triplets, split into 2118 training, 268 validation, and 264 testing samples. Each image is

1024 \times 1024

and covers areas ranging from 0.06 km² to 25 km², with large variations in object sizes and scene complexity. The referring expressions are longer and more descriptive (19.6 words on average, with a minimum of 6 and a maximum of 41), which particularly emphasizes long-text semantic parsing, multi-instance disambiguation, and fine-grained segmentation on high-resolution imagery.

4.2. Implementation Details

Our framework is implemented in PyTorch 2.1.1 and built upon Mamba and VMamba. The visual backbone is initialized with ImageNet-pretrained weights, and the entire network is trained end-to-end. For language encoding, we adopt BERT-base [41] from the HuggingFace Transformers library [42] to extract contextualized text features. All experiments are conducted on two NVIDIA RTX 4090 GPUs using distributed data parallel training. We train the model for 40 epochs with a total batch size of 8 (4 per GPU). Following standard practice in prior work for fair comparison, all inputs are resized to

512 \times 512

for both training and inference. We optimize the network using AdamW with a weight decay of

1 \times 10^{- 4}

. The learning rate is set to

5 \times 10^{- 5}

for newly introduced modules and

2.5 \times 10^{- 5}

for the backbone and VSSM components, and is scheduled with cosine annealing and a 5 epoch warm-up.

4.3. Performance Comparison

In the RRSIS task, our method achieves highly competitive performance. We compare MSMamba with representative RIS models originally developed for natural images, including LAVT (2022) [23], CARIS (2023) [26], and RIS-DMMI (2023) [28]. We further compare MSMamba with some recently proposed methods for referring remote sensing image segmentation, such as FIANet (2024) [29], RMSIN (2024) [3], MAFN (2025) [30], and CADFormer (2025) [32].

For fairness, we reimplement the compared methods under a unified evaluation protocol whenever possible. For early baselines, we follow the results reported by the corresponding benchmark papers and explicitly annotate their sources in Table 1, Table 2, Table 3 and Table 4. For RefSegRS, RRSIS-D, and RISBench, these results are adopted from [31], which introduced the RISBench benchmark and reported the corresponding baseline results. For RRSIS-HR, they are adopted from [32], which introduced the RRSIS-HR benchmark. Notably, MSMamba shows clear gains on high-precision and fine-grained metrics, with improvements in Pr@0.8, Pr@0.9, and mIoU.

4.3.1. Quantitative Evaluations on RefSegRS

The RefSegRS dataset is characterized by targets containing a large number of small object instances and cluttered backgrounds. As shown in Table 1, MSMamba achieves highly competitive performance on this benchmark and leads on most evaluation metrics, outperforming competing Transformer-based methods. In terms of mIoU and oIoU, MSMamba provides consistent gains over the strongest baseline methods, indicating that our method provides more reliable segmentation results in cluttered backgrounds.

Notably, the advantages of MSMamba are most significant under high-IoU precision metrics, which directly reflect boundary accuracy and small-object recognition. Compared with FIANet, MSMamba improves Pr@0.8 and Pr@0.9 on the test set by +25.53% and +14.04%, respectively, and increases mIoU by +6.65%. Such gains suggest that MSMamba is not only better at coarse localization, but also significantly stronger at recovering thin structures, weak-texture regions, and complex boundaries. We attribute this improvement primarily to two factors: (1) Mamba-based state space modeling, which maintains stable long-range context aggregation in complex scenes; and (2) the VTFB module, where WLSP constructs an attribute-aware semantic bridge. Additionally, GL-CA performs global and local cross-modal alignment to strengthen fine-grained grounding for small instances.

4.3.2. Quantitative Evaluations on RRSIS-D

The RRSIS-D dataset contains objects with large-scale variations and strong visual ambiguity, and the corresponding referring expressions are usually long and contain rich modifiers. As shown in Table 2, our method achieves highly competitive results. The performance improvement of our model is more evident at higher IoU thresholds. Compared to the strongest competitor MAFN, MSMamba improves Pr@0.8 and Pr@0.9 on the test set by +1.67% and +2.50%, indicating better boundary quality under challenging scale variations.

We also observe clear gains in mIoU. Compared to MAFN, MSMamba increases mIoU on the test set by +1.46%. Compared to other Swin-B baselines, MSMamba improves mIoU by +3.23% over FIANet and by +3.16% over RMSIN, indicating more accurate pixel-level segmentation. Our method did not show significant improvement on oIoU. This is expected because oIoU aggregates pixels globally and is often dominated by large-area targets and background regions. In contrast, our design mainly boosts small targets, thin structures, and boundary-sensitive regions. Importantly, MSMamba maintains competitive oIoU, suggesting that enhancing fine-grained localization does not compromise the overall segmentation quality on large objects. This behavior is consistent with the complementary roles of VTFB and MSFD: VTFB strengthens attribute-aware grounding from global localization to local refinement, while MSFD improves cross-scale fusion via MCP and SKSAG to better preserve subtle structures and boundary details.

4.3.3. Quantitative Evaluations on RISBench

RISBench provides a larger training set with rich attribute annotations, and the reference expressions typically contain multiple intertwined semantic constraints. As shown in Table 3, MSMamba consistently outperforms the strongest Transformer-based baseline model, MAFN. Specifically, MSMamba improves Pr@0.8 and Pr@0.9 on the test set by +4.01% and +7.44%, indicating stronger fine-grained discrimination under complex semantics and better preservation of boundary details. Meanwhile, MSMamba achieves the best mIoU and improves it by +1.43% over MAFN on the test set, demonstrating robust mask quality across different scenarios and object scales. For oIoU, MSMamba remains highly competitive, improving it by +0.39% over MAFN on the test set and staying close to the best-performing baseline.

4.3.4. Quantitative Evaluations on RRSIS-HR

As shown in Table 4, MSMamba achieves highly competitive performance on RRSIS-HR and leads across all metrics. Compared with CADFormer, which obtains the strongest overall performance among the previous baselines, MSMamba improves Pr@0.5 by 3.41% and Pr@0.8 by 9.09%. Under stricter standards, the advantage of our model becomes more evident. Compared with the best previous result on Pr@0.9, MSMamba improves this metric by 3.78%, indicating stronger boundary-sensitive localization capability in high-resolution scenarios. We also observe clear improvements in both oIoU and mIoU, which are increased by 5.41% and 3.02% compared with CADFormer, respectively.

We attribute these gains to the proposed VTFB module. This module integrates fine-grained linguistic features with visual features through word-level semantic decomposition and global–local cross-modal alignment. This approach enhances the perception of attributes and relationships during the encoding process, which is crucial for long sentences and complex expressions in RRSIS-HR. This is also reflected by the larger improvements at high IoU thresholds, especially Pr@0.8 and Pr@0.9. We note that RRSIS-HR is a newly released and challenging benchmark, and the original paper only provides limited baseline results. To further strengthen the comparison, we additionally evaluate MAFN on RRSIS-HR. Since MAFN is specifically designed for referring remote sensing image segmentation and performs competitively on other RRSIS datasets, the compared methods on RRSIS-HR are now all remote-sensing-specific referring segmentation models, making the evaluation more consistent and task-relevant.

4.3.5. Computational Efficiency Analysis

We report the parameter size, FLOPs, GPU memory consumption, and inference time of MSMamba to examine the balance between model complexity and segmentation quality. As shown in Table 5, MSMamba uses 264.77 M parameters, 416.80 G FLOPs, and 19.54 GB of GPU memory. During inference, MSMamba achieves an average inference time of 59.1 ms per image. These results indicate that MSMamba maintains a moderate computational cost while delivering strong segmentation performance, demonstrating an effective balance between efficiency and accuracy. This advantage is attributed to the state-space-based vision–language interaction in VTFB, which improves multimodal alignment and contextual modeling with relatively low computational overhead. The current inference performance is partially limited by incomplete CUDA operator support for Mamba modules, which imposes some efficiency bottlenecks that could be further optimized in future implementations.

4.3.6. Qualitative Comparison

We conduct qualitative comparisons with CADFormer on RefSegRS, RRSIS-D, and RRSIS-HR. On RefSegRS (Figure 5), we mainly analyze the small-target setting. As shown in Columns 1–3, our method produces accurate masks for small and highly similar vehicle targets (e.g., bus/truck/trailer) in cluttered road scenes. This benefit is primarily attributed to Mamba, whose long-range sequence modeling supports coherent context aggregation over large-area imagery and improves small-object delineation.

We further conduct a size-based analysis on the RefSegRS test set. Specifically, all test samples are first sorted according to the pixel area of the target mask, and then divided into five groups with approximately balanced sample numbers. The five groups are defined as Tiny (

0 \leq area < 555

), Small (

555 \leq area < 1770

), Medium (

1770 \leq area < 4774

), Large (

4774 \leq area < 18848

), and Huge (

area \geq 18848

). As shown in Figure 6, MSMamba achieves consistently strong segmentation accuracy across different object sizes, demonstrating its robustness to scale variation. More importantly, our method shows clear advantages on small-scale targets. For the Tiny and Small groups, MSMamba outperforms the second-best method FIANet by 6.59% and 8.22% in mIoU, respectively. This indicates that MSMamba is particularly effective in segmenting small and visually ambiguous objects in cluttered remote sensing scenes.

We further focus on long and attribute-rich expressions on RRSIS-D and RRSIS-HR. As illustrated in Figure 7, our predictions on RRSIS-D remain instance-specific and better aligned with the referred region when the expression provides only weakly discriminative attribute cues. Figure 8 provides quantitative support for this observation by reporting mIoU across different expression-length bins. After splitting samples into four bins by expression length, our model maintains strong mIoU and shows a more significant advantage in the case of long texts, indicating that richer attribute cues in longer expressions are effectively utilized. Consistently, on RRSIS-HR (Figure 9, Columns 1 and 4), where descriptions are typically longer and contain multiple attributes and hierarchical modifiers, our method yields cleaner masks for small targets and maintains more regular geometric boundaries, with fewer fragments and reduced leakage. We attribute these improvements to fine-grained text decomposition in VTFB. WLSP explicitly extracts attribute-related tokens and constructs an attribute-aware semantic bridge, enabling the model to exploit attribute-level constraints more effectively as the description becomes longer and more complex.

4.4. Ablation Study

We conduct ablation experiments on a validation subset of RRSIS-D to evaluate the effectiveness of key components in our proposed network.

4.4.1. Evaluation of GL-CA and Bridge Feature Design

Table 6 analyzes the roles of the global alignment feature

X_{G}

and the local alignment feature

X_{L}

in GL-CA. The bridge feature

X_{B}

is designed as an intermediate cue between

X_{G}

and

X_{L}

. When either

X_{G}

or

X_{L}

is removed,

X_{B}

no longer serves its intended role, so we disable

X_{B}

in the first three settings for consistency. Experiments show that enabling either

X_{G}

or

X_{L}

alone results in lower scores. This is because

X_{G}

primarily provides coarse-grained intent guidance, while

X_{L}

focuses on fine-grained local details. When both branches are used simultaneously, the mIoU increases by 0.44% and the oIoU increases by 1.22%. This indicates that the dual-branch structure exhibits complementarity in coarse global localization and local structure refinement. Building on this coarse-to-fine complementarity, inserting

X_{B}

between

X_{G}

and

X_{L}

yields a further 0.91% gain in mIoU, increasing it to 66.93%.

Table 7 compares different ways to construct the bridge feature

X_{B}

. We consider two prompt sources: an attribute prompt derived from attribute-level modifiers and an entity prompt taken from the entity token (as in FIANet). Using only the attribute prompt performs best and improves all metrics, with gains of +0.91% mIoU and +1.56% Pr@0.9. In contrast, the entity-level prompt alone shows limited effectiveness, as it brings almost no overall improvement and even leads to drops on Pr@0.8. Combining both prompt sources is also inferior to the attribute-only setting, indicating that the entity cue may dilute the contribution of attribute-level semantics. Therefore, we use the attribute-level prompt to construct

X_{B}

by default. To further examine the importance of the extracted attribute words, we conduct an additional perturbation experiment by replacing the most relevant attribute words with the least relevant words in the expression when constructing the bridge feature. This perturbation leads to clear performance degradation, with mIoU, oIoU, Pr@0.8, and Pr@0.9 decreasing by 2.54%, 1.37%, 2.06%, and 1.26%, respectively. These results further verify that semantically relevant attribute-level cues are important for constructing an effective bridge feature.

Table 8 analyzes the position of the bridging feature

X_{B}

in the concatenated sequence. We always place the visual feature V first to maintain visual information dominance in subsequent multimodal interactions. We also tested the effect of the order of semantic cues on model performance. As shown in the last row of the table, experiments demonstrated that the coarse-to-fine order

X_{G} \to X_{L}

performs better than the reverse order

X_{L} \to X_{G}

. Therefore, we used

V \to X_{G} \to X_{L}

as the default order and investigated the impact of the insertion position of

X_{B}

. Placing

X_{B}

after V yields poorer results because the bridging cues are introduced before a stable global reference frame is formed, which interferes with the early visual encoding process. Appending

X_{B}

after

X_{L}

also underperforms, since the local refinement has already been applied and the extra cue may disturb the learned fine structures. In contrast, inserting

X_{B}

between

X_{G}

and

X_{L}

yields the best performance (66.93% mIoU/77.86% oIoU) by completing a coherent coarse-to-fine semantic pathway, where

X_{B}

serves as a transition cue from global grounding to local refinement.

4.4.2. Evaluation of Multi-Scale Fusion Decoder

Table 9 compares four decoding configurations. Without any explicit multi-scale decoding module, the model already achieves a strong baseline (66.24% mIoU/77.70% oIoU), suggesting that the Mamba-based backbone captures certain cross-scale cues through long-range dependency modeling. Introducing a standard FPN [43] improves high-precision metrics (Pr@0.8 +0.69% and Pr@0.9 +0.17%) but degrades region-level quality (mIoU −0.63% and oIoU −0.48%), indicating that conventional top-down fusion may introduce scale interference or over-smoothing under extreme scale variation. We additionally evaluate a standard CIM [3] module as a direct replacement decoder. Compared with the baseline, CIM improves mIoU by 0.19% and Pr@0.8 by 0.75%, while oIoU decreases by 0.12% and Pr@0.9 drops by 0.35%. This suggests that CIM brings a solid gain in mIoU, which is beneficial for remote sensing targets, but it is less effective at improving fine-grained delineation under strict overlap criteria.

In contrast, our proposed MSFD achieves the best overall performance, improving mIoU by +0.69% and oIoU by +0.16% while maintaining Pr@0.8 at the best level. MSFD remains highly competitive on Pr@0.9, with only a small gap of 0.23% to the best Pr@0.9 result. These results demonstrate that combining MCP and SKSAG provides a more balanced and effective scale-aware fusion strategy for remote sensing scenarios.

Table 10 compares the proposed learnable gate in SKSAG with two simple fusion strategies. In the table, Add denotes directly adding the MCP feature and the corresponding BiFPN feature, while Concat denotes concatenating these two features along the channel dimension and then using a

1 \times 1

convolution to restore the original channel number. Compared with the proposed SKSAG, the other two settings both lead to clear drops in mIoU and oIoU, indicating weaker segmentation performance.

The performance degradation mainly results from the limited adaptability of simple fusion operations. Direct addition treats the two feature sources equally and cannot adjust their contributions according to different spatial regions or feature channels. In the Concat setting, the two feature sources are concatenated along the channel dimension and then passed through a

1 \times 1

convolution. Although this introduces learnable channel mixing, it still lacks an explicit gating mechanism to selectively emphasize useful global and local cues. In contrast, the proposed SKSAG uses a spatial-channel gate, which is both spatial-aware and channel-aware, enabling more adaptive fusion between MCP and BiFPN features and better multi-scale representation.

5. Conclusions

In this study, we explore MSMamba, a Mamba-driven framework for referring remote sensing image segmentation. It is designed for large-area remote sensing imagery, where models must handle efficient long-range context modeling, fine-grained grounding under visual ambiguity, and robust cross-scale fusion. Built on a VMamba encoder, MSMamba incorporates a Visual–Text Fine-grained Block (VTFB) to improve fine-grained vision–language alignment. VTFB decomposes referring expressions into attribute-level cues and leverages complementary global and local alignment signals to inject structured linguistic information through state space modeling. For decoding, we further introduce a multi-scale fusion decoder (MSFD), in which Multi-branch Cross-scale Parsing enhances cross-scale interaction and Selective Kernel Scale-Aware Gate enables scale-aware fusion to better preserve thin structures and boundary details.

Experiments on the RefSegRS, RRSIS-D, RISBench, and RRSIS-HR datasets demonstrate that our proposed method is highly competitive across various metrics. It shows a strong advantage, particularly at Pr@0.8 and Pr@0.9, and achieves a reliable improvement in mIoU, especially for thin structures, small targets, and long attribute-rich expressions. These results validate the effectiveness of combining state space modeling with explicit semantic injection for referring remote sensing image segmentation in large-area scenes with complex and attribute-rich descriptions. Further analysis across different expression lengths also shows that the proposed framework exhibits more evident advantages in long-text scenarios.

Despite these gains, precise boundary delineation for large objects remains challenging, and the visual encoder pre-trained on natural images may still suffer from a domain gap when transferred to remote sensing imagery. In future work, we will advance our framework along three directions. First, we will explore stronger language encoders and more explicit multi-constraint reasoning to better handle complex expressions and visually similar instances. Second, we will incorporate large vision–language models and remote-sensing-specific foundation models pre-trained on large-scale aerial data to reduce the domain gap and further improve boundary-sensitive segmentation. Third, we will investigate cross-dataset generalization and domain adaptation strategies to enhance the robustness of MSMamba across different remote sensing benchmarks, addressing variations in category definitions, expression styles, and annotation protocols.

Author Contributions

Conceptualization, T.Z., J.L. (Junbai Li) and Z.W.; Methodology, T.Z., J.L. (Junbai Li) and Z.W.; Software, J.L. (Junbai Li), Y.F. and Z.W.; Validation, T.Z., J.L. (Junbai Li) and Y.F.; Formal analysis, T.Z., J.L. (Junbai Li) and Z.W.; Investigation, T.Z., J.L. (Junbai Li) and Y.F.; Resources, T.Z. and J.L. (Jiangyun Li); Data curation, J.L. (Junbai Li) and Y.F.; Writing—original draft, J.L. (Junbai Li); Writing—review & editing, T.Z., Y.F., Z.W. and J.L. (Jiangyun Li); Visualization, J.L. (Junbai Li) and Y.F.; Supervision, T.Z., Z.W., L.L. and J.L. (Jiangyun Li); Project administration, T.Z., L.L. and J.L. (Jiangyun Li); Funding acquisition, T.Z. and J.L. (Jiangyun Li). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of China under Grant 42201386, in part by Fundamental Research Funds for the Central Universities of USTB: FRF-TP-24-060A.

Data Availability Statement

Datasets and codes are available at https://github.com/ljb1227zy/MSMamba (accessed on 8 June 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, Z.; Mou, L.; Hua, Y.; Zhu, X.X. Rrsis: Referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Zhang, T.; Wen, Z.; Kong, B.; Liu, K.; Zhang, Y.; Zhuang, P.; Li, J. Referring Remote Sensing Image Segmentation via Multi-Scale Spatially-Guided Joint Prediction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 19, 2796–2811. [Google Scholar] [CrossRef]
Liu, S.; Ma, Y.; Zhang, X.; Wang, H.; Ji, J.; Sun, X.; Ji, R. Rotated multi-scale interaction network for referring remote sensing image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26658–26668. [Google Scholar]
Sun, Y.; Wang, D.; Li, L.; Ning, R.; Yu, S.; Gao, N. Application of remote sensing technology in water quality monitoring: From traditional approaches to artificial intelligence. Water Res. 2024, 267, 122546. [Google Scholar] [CrossRef] [PubMed]
Yang, H.; Jiang, Z.; Zhang, Y.; Wu, Y.; Luo, H.; Zhang, P.; Wang, B. A high-resolution remote sensing land use/land cover classification method based on multi-level features adaptation of segment anything model. Int. J. Appl. Earth Obs. Geoinf. 2025, 141, 104659. [Google Scholar] [CrossRef]
Harb, M.M.; Dell’Acqua, F. Remote sensing in multirisk assessment: Improving disaster preparedness. IEEE Geosci. Remote Sens. Mag. 2017, 5, 53–65. [Google Scholar] [CrossRef]
Coutts, A.M.; Harris, R.J.; Phan, T.; Livesley, S.J.; Williams, N.S.; Tapper, N.J. Thermal infrared remote sensing of urban heat: Hotspots, vegetation, and an assessment of techniques for use in urban planning. Remote Sens. Environ. 2016, 186, 637–651. [Google Scholar] [CrossRef]
Wang, H.; Zhuang, P.; Zhang, X.; Li, J. DBMGNet: A Dual-Branch Mamba-GCN Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4410517. [Google Scholar] [CrossRef]
Zhuang, P.; Zhang, X.; Wang, H.; Zhang, T.; Liu, L.; Li, J. FAHM: Frequency-Aware Hierarchical Mamba for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6299–6313. [Google Scholar] [CrossRef]
Li, J.; Wang, H.; Zhang, X.; Wang, J.; Zhang, T.; Zhuang, P. DVR: Towards Accurate Hyperspectral Image Classifier via Discrete Vector Representation. Remote Sens. 2025, 17, 351. [Google Scholar] [CrossRef]
Li, J.; Wen, Z.; Zhang, Y.; Wang, W.; Cai, Y.; Zhang, T.; He, X.; Liu, J. Generalized referring expression segmentation driven by instance-oriented queries. Pattern Recognit. 2025, 172, 112524. [Google Scholar] [CrossRef]
Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from natural language expressions. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–10 October 2016; pp. 108–124. [Google Scholar]
Liu, C.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Yuille, A. Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 1271–1280. [Google Scholar]
Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5745–5753. [Google Scholar]
Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1307–1315. [Google Scholar]
Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-word-aware network for referring expression image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 38–54. [Google Scholar]
Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 July 2019; pp. 10502–10511. [Google Scholar]
Margffoy-Tuay, E.; Pérez, J.C.; Botero, E.; Arbeláez, P. Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 630–645. [Google Scholar]
Liu, S.; Hui, T.; Huang, S.; Wei, Y.; Li, B.; Li, G. Cross-modal progressive comprehension for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4761–4775. [Google Scholar] [CrossRef] [PubMed]
Ding, H.; Liu, C.; Wang, S.; Jiang, X. Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16321–16330. [Google Scholar]
Kim, N.; Kim, D.; Lan, C.; Zeng, W.; Kwak, S. Restr: Convolution-free referring image segmentation using transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18145–18154. [Google Scholar]
Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Montreal, BC, Canada, 11–17 October 2021; pp. 15506–15515. [Google Scholar]
Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18155–18165. [Google Scholar]
Ouyang, S.; Wang, H.; Xie, S.; Niu, Z.; Tong, R.; Chen, Y.W.; Lin, L. SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation. In Proceedings of the IJCAI, Macao, China, 19–25 August 2023; pp. 1294–1302. [Google Scholar]
Zhang, Z.; Zhu, Y.; Liu, J.; Liang, X.; Ke, W. Coupalign: Coupling word-pixel with sentence-mask alignments for referring image segmentation. Adv. Neural Inf. Process. Syst. 2022, 35, 14729–14742. [Google Scholar]
Liu, S.A.; Zhang, Y.; Qiu, Z.; Xie, H.; Zhang, Y.; Yao, T. CARIS: Context-aware referring image segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 779–788. [Google Scholar]
Chng, Y.X.; Zheng, H.; Han, Y.; Qiu, X.; Huang, G. Mask grounding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26573–26583. [Google Scholar]
Hu, Y.; Wang, Q.; Shao, W.; Xie, E.; Li, Z.; Han, J.; Luo, P. Beyond one-to-one: Rethinking the referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4067–4077. [Google Scholar]
Lei, S.; Xiao, X.; Zhang, T.; Li, H.-C.; Shi, Z.; Zhu, Q. Exploring fine-grained image-text alignment for referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5604611. [Google Scholar] [CrossRef]
Shi, L.; Zhang, J. Multimodal-aware fusion network for referring remote sensing image segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8001805. [Google Scholar] [CrossRef]
Dong, Z.; Sun, Y.; Liu, T.; Zuo, W.; Gu, Y. Cross-modal bidirectional interaction model for referring remote sensing image segmentation. arXiv 2024, arXiv:2410.08613. [Google Scholar]
Liu, M.; Jiang, X.; Zhang, X. CADFormer: Fine-Grained Cross-modal Alignment and Decoding Transformer for Referring Remote Sensing Image Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 14557–14569. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 29 March–9 July 2024. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar] [CrossRef]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. Vmamba: Visual state space model. Adv. Neural Inf. Process. Syst. 2024, 37, 103031–103063. [Google Scholar]
Zhao, S.; Chen, H.; Zhang, X.; Xiao, P.; Bai, L.; Ouyang, W. Rs-mamba for large remote sensing image dense prediction. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5633314. [Google Scholar] [CrossRef]
Ma, X.; Zhang, X.; Pun, M.O. Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6011405. [Google Scholar] [CrossRef]
Zhu, Q.; Fang, Y.; Cai, Y.; Chen, C.; Fan, L. Rethinking scanning strategies with vision mamba in semantic segmentation of remote sensing imagery: An experimental study. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 18223–18234. [Google Scholar] [CrossRef]
Altinok, D. Mastering spaCy: An End-to-End Practical Guide to Implementing NLP Applications Using the Python Ecosystem; Packt Publishing Ltd.: Birmingham, UK, 2021. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4171–4186. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]

Figure 1. Comparison of multimodal interaction paradigms for referring remote sensing image segmentation: (a) self-attention over concatenated visual and text tokens, (b) cross-attention using pooled text queries, and (c) Mamba-based selective scanning with linear complexity for context-aware multi-scale alignment.

Figure 2. Overview of MSMamba for referring remote sensing image segmentation. A VMamba backbone with Visual State Space (VSS) blocks and interleaved Visual–Text Fine-grained Block (VTFB) modules extracts multimodally enhanced features from the input image and text. The Word-Level Semantic Processor (WLSP) in VTFB decomposes word-level semantics, and the Global–Local Cross-Modal Alignment (GL-CA) module performs dual-scale text-image alignment. The aligned multimodal features are fused with visual features via 2D Selective Scan (SS2D). A multi-scale fusion decoder (MSFD) then aggregates multi-stage features to predict the referred-object mask.

Figure 3. Architecture of the proposed Visual–Text Fine-grained Block, comprising a Word-Level Semantic Processor (WLSP), a Global–Local Cross-Modal Alignment (GL-CA) module, and an SS2D block for fine-grained vision–language interaction.

Figure 4. Illustration of the proposed multi-scale fusion decoder (MSFD), consisting of a Multi-branch Cross-scale Parsing (MCP) and a Selective Kernel Scale-Aware Gate (SKSAG) for scale-aware feature fusion and segmentation.

Figure 5. Qualitative comparisons between MSMamba and previous state-of-the-art methods on the RefSegRS dataset. The yellow dashed boxes in the figures highlight false positive regions that do not belong to the target.

Figure 6. mIoU trends of different methods on the RefSegRS test set grouped by target mask size. Targets are divided into five groups according to pixel area.

Figure 7. Qualitative comparisons between MSMamba and previous state-of-the-art methods on the RRSIS-D dataset.

Figure 8. mIoU trends with respect to expression length on the RRSIS-D dataset, where expressions are grouped into four length bins.

Figure 9. Qualitative comparisons between MSMamba and previous state-of-the-art methods on the RRSIS-HR dataset. The yellow dashed boxes in the figures highlight false positive regions that do not belong to the target.

Table 1. Quantitative comparison with state-of-the-art methods on the RefSegRS dataset. R-101 and Swin-B denote ResNet-101 and Swin Transformer-Base visual backbones, respectively. All metrics are reported in %, and the best result for each column is highlighted in bold. Methods marked with ^† are directly reported from [31].

Method	Visual Encoder	Text Encoder	Pr@0.5		Pr@0.6		Pr@0.7		Pr@0.8		Pr@0.9		oIoU		mIoU
Method	Visual Encoder	Text Encoder	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test
BRINet ^†	R-101	LSTM	36.86	20.72	35.53	14.26	19.93	9.87	10.66	2.98	2.84	1.14	61.59	58.22	38.73	31.51
LSCM ^†	R-101	LSTM	56.82	31.54	41.24	20.41	21.85	9.51	12.11	5.29	2.51	0.84	62.82	61.27	40.59	35.54
RRN ^†	R-101	LSTM	55.43	30.26	42.98	23.01	23.11	14.87	13.72	7.17	2.64	0.98	69.24	65.06	50.81	41.88
ETRIS ^†	R-101	CLIP	54.99	35.77	35.03	23.00	25.06	13.98	12.53	6.44	1.62	1.10	72.89	65.96	54.03	43.11
CRIS ^†	R-101	CLIP	53.13	35.77	36.19	24.11	24.36	14.36	11.83	6.38	2.55	1.21	72.14	65.87	53.74	43.26
CrossVLT ^†	Swin-B	BERT	67.52	41.94	43.85	25.43	25.99	15.19	14.62	3.71	1.87	1.76	76.12	69.73	55.27	42.81
LAVT ^†	Swin-B	BERT	80.97	51.84	58.70	30.27	31.09	17.34	15.55	9.52	4.64	2.09	78.50	71.86	61.53	47.40
RIS-DMMI ^†	Swin-B	BERT	86.17	63.89	74.71	44.30	38.05	19.81	18.10	6.49	3.25	1.00	74.02	68.58	65.72	52.15
MAFN	Swin-B	BERT	93.74	70.72	86.54	55.37	71.00	31.37	28.77	11.45	7.42	2.31	80.17	72.03	72.88	57.56
LGCE	Swin-B	BERT	92.58	74.79	90.49	62.30	80.05	38.69	41.07	16.68	12.76	4.73	85.80	77.28	75.67	60.10
CADFormer	Swin-B	BERT	89.79	75.34	81.67	63.18	62.65	38.97	23.43	15.63	5.57	2.97	79.11	74.08	70.46	60.88
RMSIN	Swin-B	BERT	90.72	77.44	86.31	65.44	71.93	42.43	29.70	17.39	7.19	3.08	79.30	74.53	72.26	62.04
FIANet	Swin-B	BERT	95.82	82.61	92.34	73.75	87.24	54.71	54.52	25.21	10.67	4.62	83.94	76.62	78.28	66.12
MSMamba (ours)	VMamba	BERT	95.36	83.65	94.90	78.48	90.49	68.74	83.06	50.74	45.01	18.66	86.85	79.45	85.11	72.77

Note: The gray background indicates the proposed method. Bold values indicate the best result in each column.

Table 2. Comparison with state-of-the-art methods on the RRSIS-D dataset. R-101 and Swin-B denote ResNet-101 and Swin Transformer-Base visual backbones, respectively. All metrics are reported in %. The best result in each column is highlighted in bold. Methods marked with ^† use the scores reported in [31]. Methods marked with * use the scores reported in the original paper [30], since our re-implementations under the common setting obtain slightly lower performance.

Method	Visual Encoder	Text Encoder	Pr@0.5		Pr@0.6		Pr@0.7		Pr@0.8		Pr@0.9		oIoU		mIoU
Method	Visual Encoder	Text Encoder	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test
RRN ^†	R-101	LSTM	51.09	51.07	42.47	42.11	33.04	32.77	20.80	21.57	6.14	6.37	66.53	66.43	46.06	45.64
BRINet ^†	R-101	LSTM	58.79	56.90	49.54	48.77	39.65	38.61	28.21	27.03	9.19	8.93	70.73	69.68	51.41	49.45
LSCM ^†	R-101	LSTM	57.12	56.02	48.04	46.25	37.87	37.70	26.35	25.28	7.93	7.86	69.28	69.10	50.36	49.92
CRIS ^†	R-101	CLIP	56.44	54.84	47.87	46.77	39.77	38.06	29.31	28.15	11.84	11.52	70.98	70.46	50.75	49.69
ETRIS ^†	R-101	CLIP	62.10	61.07	53.73	50.99	43.12	40.94	30.79	29.30	12.90	11.43	72.75	71.06	55.21	54.21
LAVT ^†	Swin-B	BERT	65.23	63.98	58.79	57.57	50.29	49.30	40.11	38.06	23.05	22.29	76.27	76.16	57.72	56.82
CrossVLT ^†	Swin-B	BERT	67.07	66.42	59.54	59.41	50.80	49.76	40.57	38.67	23.51	23.30	76.25	75.48	59.78	58.48
RIS-DMMI ^†	Swin-B	BERT	70.40	68.74	63.05	60.96	54.14	50.33	41.95	38.38	23.85	21.63	77.01	76.20	61.70	60.25
LGCE	Swin-B	BERT	71.32	69.69	64.54	63.49	55.06	53.49	43.51	41.11	25.40	23.96	77.46	76.84	62.29	60.60
FIANet	Swin-B	BERT	73.33	73.43	66.21	66.96	55.46	54.64	42.47	41.20	23.68	23.01	77.04	76.16	63.35	62.99
RMSIN	Swin-B	BERT	73.39	72.16	66.26	65.96	56.84	54.75	43.45	41.54	24.60	23.96	77.03	76.32	64.28	63.06
CADFormer	Swin-B	BERT	75.57	74.72	67.87	67.74	55.92	56.10	43.39	41.71	24.31	23.67	77.47	77.24	65.12	64.23
MAFN *	Swin-B	BERT	76.32	75.27	69.31	68.14	58.33	56.79	44.54	43.49	24.71	23.76	78.33	77.41	66.03	64.76
MSMamba (ours)	VMamba	BERT	77.24	77.68	71.72	71.76	59.94	60.01	46.55	45.16	28.16	26.26	77.86	77.54	66.93	66.22

Note: The gray background indicates the proposed method, and bold values indicate the best result in each column.

Table 3. Quantitative comparison with state-of-the-art methods on the RISBench dataset. R-101 and Swin-B denote ResNet-101 and Swin Transformer-Base visual backbones, respectively. All metrics are reported in %, and the best result for each column is highlighted in bold. Methods marked with ^† are directly reported from [31], where the RISBench benchmark was introduced.

Method	Visual Encoder	Text Encoder	Pr@0.5		Pr@0.6		Pr@0.7		Pr@0.8		Pr@0.9		oIoU		mIoU
Method	Visual Encoder	Text Encoder	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test
BRINet ^†	R-101	LSTM	52.11	52.87	45.17	45.39	37.98	38.64	30.88	30.79	10.28	11.86	46.27	48.73	41.54	42.91
RRN ^†	R-101	LSTM	54.62	55.04	46.88	47.31	39.57	39.86	32.64	32.58	11.57	13.24	47.28	49.67	42.65	43.18
LSCM ^†	R-101	LSTM	55.87	55.26	47.24	47.14	40.22	40.10	33.55	33.29	12.78	13.91	47.99	50.08	43.21	43.69
ETRIS ^†	R-101	CLIP	59.87	60.98	49.91	51.88	35.88	39.87	20.10	24.49	8.54	11.18	64.09	67.61	51.13	53.06
CRIS ^†	R-101	CLIP	63.42	63.67	54.32	55.73	41.15	44.42	24.66	28.80	10.27	13.27	66.26	69.11	53.64	55.18
LAVT ^†	Swin-B	BERT	68.27	69.40	62.71	63.66	54.46	56.10	43.13	44.95	21.61	25.21	69.39	74.15	60.45	61.93
CrossVLT ^†	Swin-B	BERT	70.05	70.62	64.29	65.05	56.97	57.40	44.49	45.80	21.47	26.10	69.77	74.33	61.54	62.84
LGCE	Swin-B	BERT	71.01	71.45	65.62	66.30	58.66	58.77	47.05	47.43	24.41	27.84	69.39	73.50	62.86	63.67
RIS-DMMI ^†	Swin-B	BERT	71.27	72.05	66.02	66.48	58.22	59.07	45.57	47.16	22.43	26.57	70.58	74.82	62.62	63.93
FIANet	Swin-B	BERT	75.70	75.51	70.89	70.73	64.37	63.59	53.22	52.52	30.12	31.96	69.86	74.30	67.19	67.44
CADFormer	Swin-B	BERT	75.94	76.27	70.90	71.35	64.17	64.21	52.62	52.91	29.07	31.59	70.23	74.34	66.96	67.80
RMSIN	Swin-B	BERT	75.86	76.37	70.88	71.17	64.32	64.17	53.34	53.22	30.44	32.50	70.81	75.51	67.44	68.32
MAFN	Swin-B	BERT	76.87	76.98	72.32	72.46	65.57	65.73	54.47	54.77	31.11	33.09	70.90	74.90	67.95	68.79
MSMamba (ours)	VMamba	BERT	77.34	77.56	73.30	73.38	68.00	67.60	59.48	58.78	40.16	40.53	71.93	75.29	69.66	70.22

Note: The gray background indicates the proposed method, and bold values indicate the best result in each column.

Table 4. Quantitative comparison on the RRSIS-HR dataset. Swin-B denotes the Swin Transformer-Base visual backbone. All metrics are reported in %, and the best result in each column is highlighted in bold. Methods marked with ^† are directly reported from [32], where the RRSIS-HR benchmark was introduced.

Method	Visual Encoder	Text Encoder	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	oIoU	mIoU
LAVT ^†	Swin-B	BERT	23.11	20.08	13.64	5.30	0.38	27.94	22.78
FIANet ^†	Swin-B	BERT	31.06	27.65	22.35	15.15	1.89	28.89	27.13
LGCE ^†	Swin-B	BERT	35.98	31.06	23.86	15.15	3.79	38.20	33.48
RMSIN ^†	Swin-B	BERT	50.00	46.97	39.77	29.92	6.44	45.97	43.70
MAFN	Swin-B	BERT	46.21	44.32	40.53	32.95	17.05	46.18	44.34
CADFormer	Swin-B	BERT	63.26	57.95	46.21	34.09	9.09	53.64	54.88
MSMamba (ours)	VMamba	BERT	66.67	61.74	57.58	43.18	20.83	59.05	57.90

Note: The gray background indicates the proposed method, and bold values indicate the best result in each column.

Table 5. Comparison of model complexity, GPU memory consumption, and inference speed.

Method	Params (M)	FLOPs (G)	GPU Memory (GB)	Inference Time (ms)
RMSIN	240.04	433.02	14.08	37.0
FIANet	256.17	435.87	12.97	44.7
MAFN	350.06	450.66	22.74	85.4
CADFormer	359.25	466.28	17.81	41.8
MSMamba (Ours)	264.77	416.80	19.54	59.1

Note: Bold values indicate the best result in each column.

Table 6. Ablation study of GL-CA, analyzing the effects of global alignment and local alignment.

Global	Local	Bridge Feature	mIoU	oIoU
✓	×	without	65.78	76.67
×	✓	without	65.58	76.56
✓	✓	without	66.02	77.78
✓	✓	with	66.93	77.86

Note: The gray background indicates the proposed setting, and bold values indicate the best result in each column. ✓ and × denote whether the corresponding component is used or not, respectively.

Table 7. Effect of different prompt sources for constructing the bridge feature

X_{B}

.

Table 7. Effect of different prompt sources for constructing the bridge feature

X_{B}

.

Attribute Level	Entity Level	mIoU	oIoU	Pr@0.8	Pr@0.9
×	×	66.02	77.78	45.80	26.60
✓	×	66.93	77.86	46.55	28.16
×	✓	66.06	77.84	45.34	27.07
✓	✓	66.23	77.26	46.09	26.67

Note: The gray background indicates the proposed setting, and bold values indicate the best result in each column. ✓and × denote whether the corresponding prompt source is used or not, respectively.

Table 8. Ablation study on the insertion position of the bridge feature

X_{B}

within the WLSP concatenation order.

Table 8. Ablation study on the insertion position of the bridge feature

X_{B}

within the WLSP concatenation order.

Position				Metrics
1	2	3	4	mIoU	oIoU
V	$X_{B}$	$X_{G}$	$X_{L}$	66.21	77.44
V	$X_{G}$	$X_{B}$	$X_{L}$	66.93	77.86
V	$X_{G}$	$X_{L}$	$X_{B}$	65.48	76.71
V	$X_{L}$	$X_{B}$	$X_{G}$	65.46	77.58

Note: The gray background indicates the proposed setting, and bold values indicate the best result in each column.

Table 9. Ablation study on different decoding configurations.

Config	mIoU	oIoU	Pr@0.8	Pr@0.9
NONE	66.24	77.70	45.80	28.22
FPN	65.61	77.22	46.49	28.39
CIM	66.43	77.58	46.55	27.87
MSFD (Ours)	66.93	77.86	46.55	28.16

Note: The gray background indicates the proposed setting, and bold values indicate the best result in each column.

Table 10. Ablation study on different fusion strategies in SKSAG.

Fusion Strategy	Pr@0.8	Pr@0.9	mIoU	oIoU
Add	45.86	27.36	65.24	76.21
Concat	45.34	27.59	65.14	76.16
SKSAG (Ours)	46.55	28.16	66.93	77.86

Note: The gray background indicates the proposed setting, and bold values indicate the best result in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, T.; Li, J.; Feng, Y.; Wen, Z.; Liu, L.; Li, J. MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation. Remote Sens. 2026, 18, 1949. https://doi.org/10.3390/rs18121949

AMA Style

Zhang T, Li J, Feng Y, Wen Z, Liu L, Li J. MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation. Remote Sensing. 2026; 18(12):1949. https://doi.org/10.3390/rs18121949

Chicago/Turabian Style

Zhang, Tianxiang, Junbai Li, Yanqiang Feng, Zhaokun Wen, Li Liu, and Jiangyun Li. 2026. "MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation" Remote Sensing 18, no. 12: 1949. https://doi.org/10.3390/rs18121949

APA Style

Zhang, T., Li, J., Feng, Y., Wen, Z., Liu, L., & Li, J. (2026). MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation. Remote Sensing, 18(12), 1949. https://doi.org/10.3390/rs18121949

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Referring Image Segmentation

2.2. Referring Remote Sensing Image Segmentation

3. Proposed Methodology

3.1. Preliminaries: State Space Model (SSM)

3.2. Architecture Overview

3.3. Visual–Text Fine-Grained Block (VTFB)

3.3.1. Word-Level Semantic Processor

3.3.2. Global–Local Cross-Modal Alignment

3.4. Multi-Scale Fusion Decoder (MSFD)

3.4.1. Multi-Branch Cross-Scale Parsing

3.4.2. Selective Kernel Scale-Aware Gate

4. Experiments

4.1. Metrics and Datasets

4.1.1. RefSegRS

4.1.2. RRSIS-D

4.1.3. RISBench

4.1.4. RRSIS-HR

4.2. Implementation Details

4.3. Performance Comparison

4.3.1. Quantitative Evaluations on RefSegRS

4.3.2. Quantitative Evaluations on RRSIS-D

4.3.3. Quantitative Evaluations on RISBench

4.3.4. Quantitative Evaluations on RRSIS-HR

4.3.5. Computational Efficiency Analysis

4.3.6. Qualitative Comparison

4.4. Ablation Study

4.4.1. Evaluation of GL-CA and Bridge Feature Design

4.4.2. Evaluation of Multi-Scale Fusion Decoder

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI