Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation

Dong, Shan; Xie, Jianlin; Chen, Liang; Chen, He; Qi, Baogui; Ge, Yunqiu

doi:10.3390/rs18071015

Open AccessArticle

Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation

by

Shan Dong

¹,

Jianlin Xie

¹,

Liang Chen

¹,

He Chen

^1,*,

Baogui Qi

² and

Yunqiu Ge

³

¹

National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing 100081, China

²

Innovative Equipment Research Institute of Beijing Institute of Technology in Sichuan Tianfu New Area, Chengdu 610213, China

³

School of Integrated Circuits, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(7), 1015; https://doi.org/10.3390/rs18071015

Submission received: 13 February 2026 / Revised: 23 March 2026 / Accepted: 25 March 2026 / Published: 28 March 2026

(This article belongs to the Special Issue Deep Learning for Multi-Source Remote Sensing Image Interpretation: Exploring, Rethinking, and Limiting Breakthroughs)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A structural consistency regularization mechanism is proposed to align linguistic dependency structures with visual spatial configurations, constraining cross-modal attention patterns to suppress grounding drift and enhance semantic–spatial stability.
A Grounding Modulated Segmentation strategy for SAM is proposed, which generates grounding-aware prompts and injects grounding cues to enhance the target-aware contextual modeling of the SAM encoder, thereby providing reliable spatial priors and improving robustness to scale variations and cluttered backgrounds.

What are the implications of the main finding?

The structural consistency regularization and Grounding Modulated Segmentation strategy validate the effectiveness of explicit cross-modal alignment and grounding-aware priors guidance for Referring Remote Sensing Image Segmentation.
The findings provide insights for multimodal remote sensing understanding and pre-trained model deployment, with potential applications in high-precision Earth observation and intelligent remote sensing interpretation.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is a representative multimodal understanding task for remote sensing, which segments designated targets from remote images according to free-form natural language descriptions. However, complex remote sensing characteristics, such as cluttered backgrounds, large-scale variations, small scattered targets and repetitive textures, lead to unstable visual grounding and further spatial grounding drift, resulting in inaccurate segmentation results. Existing approaches typically perform implicit visual–linguistic fusion across encoding and decoding stages, entangling spatial grounding with mask refinement. This tightly coupled formulation lacks explicit structural constraints and is prone to cross-modal ambiguity, especially in complex remote sensing layouts. To address these limitations, we propose a Structurally consistent and Grounding-aware Stagewise Reasoning Framework (SGSRF) that follows a grounding-first, segmentation-second paradigm. The framework decomposes inference into three cascaded stages with progressively imposed structural constraints. First, Cross-modal Consistency Refinement (CCR) lays the foundation for stable spatial grounding by enhancing visual–textual structural alignment via CLIP-based features and Structural Consistency Regularization (SCR), producing well-aligned multimodal representations and reliable grounding cues. Second, Grounding-aware Prompt (GPG) Generation bridges grounding and segmentation by converting aligned representations into complementary sparse and dense prompts, which serve as explicit grounding guidance for the segmentation model. Third, Grounding Modulated Segmentation (GMS) leverages the Segment Anything Model (SAM) to generate fine-grained mask prediction under the joint guidance of prompts and grounding cues, improving spatial grounding stability and robustness to background interference and scale variation. Extensive experiments on three remote sensing benchmarks, namely RefSegRS, RRSIS-D, and RISBench, demonstrate that SGSRF achieves state-of-the-art performance. The proposed stagewise paradigm integrates structural alignment, explicit grounding, and prompt-driven segmentation into a unified framework, providing a practical and robust solution for RRSIS in real-world Earth observation applications.

Keywords:

remote sening; referring segmentation; multimodal understanding; spatila grounding; segment anything model

1. Introduction

With the rapid development of Earth observation technologies, remote sensing imagery has become a fundamental data source for high-level geospatial analysis. Multimodal learning, which integrates heterogeneous information such as visual content and textual descriptions, has emerged as an effective paradigm to overcome the limitations of single-modal interpretation and enhance both flexibility and accuracy in remote sensing understanding.

Existing multimodal tasks in remote sensing [1,2,3], including visual question answering and image–text retrieval, share a common objective of establishing reliable semantic correspondence between visual scenes and natural language under complex spatial structures, which reflects a general consistency without requiring explicit localization. However, these tasks also face fundamental challenges, such as modality gaps, ambiguous semantic grounding where textual semantics cannot be reliably mapped to relevant visual content, and weak spatial anchoring where the identified regions lack precise and stable localization, particularly in scenarios with large spatial extent, repetitive patterns, and subtle inter-object differences. Among these tasks, RRSIS aims to delineate target objects or regions in aerial imagery according to free-form natural language descriptions. As a representative cross-modal task for remote sensing understanding, RRSIS enables flexible, user-driven interaction beyond predefined categories. It supports a wide range of practical remote sensing scenarios, including building identification in urban planning [4], water and vegetation analysis for environmental monitoring [5], land-use management [6], and disaster-affected region extraction [7]. Benefiting from its intrinsic cross-modal modeling capability, RRSIS also plays an important role in high-level vision tasks such as image editing [8] and visual question answering [1,2].

Compared with other multimodal tasks, RRSIS imposes stricter requirements on spatial precision and structural consistency, where even minor grounding errors can be amplified in pixel-level segmentation results. Consequently, RRSIS intrinsically involves two interdependent processes: spatial grounding, which further requires precise localization and the modeling of spatial relationships among targets in complex scenesand pixel-level segmentation. Spatial grounding establishes the semantic–spatial correspondence between linguistic semantics and spatial locations in the image, while segmentation refines grounded regions into precise masks. Since segmentation operates on grounded regions, its performance is ultimately constrained by grounding stability. Therefore, the principal bottleneck of RRSIS lies not in mask generation itself but in unreliable cross-modal semantic–spatial alignment where textual tokens and visual regions are required to maintain consistent and structurally coherent interactions across modalities. Existing approaches attempt to adapt referring image segmentation frameworks to remote sensing imagery, yet substantial limitations persist. Most methods inherit architectures [9,10] developed for natural scenes and focus on improving feature fusion through multi-scale aggregation [11], rotation-aware operations [12], long-term semantic guidance [13], or bidirectional visual–language interaction [14]. However, these strategies primarily enhance representation capacity rather than explicitly stabilizing semantic–spatial correspondence, resulting in limited robustness under the complex spatial layouts of remote sensing imagery. Recent advances in foundation models provide new opportunities for RRSIS. Foundation-model-based approaches introduce SAM [15,16] under a “localization–segmentation” paradigm [17,18,19]. While SAM provides strong mask priors and generalization ability, its performance critically depends on prompt quality. Localization errors in early stages are difficult to rectify, and encoder representations lack target-aware contextual anchoring. As a result, segmentation remains susceptible to incomplete masks and boundary inaccuracies, indicating that prompt-driven segmentation alone does not resolve the grounding instability.

The core challenge of RRSIS stems from the complexity of visual–textual correspondence in remote sensing imagery. Grounding in remote sensing imagery is considerably more challenging than in natural scenes. From the visual perspective, remote sensing images exhibit large spatial coverage, significant scale variation, repetitive textures, and high appearance similarity among objects. Large composite regions (Figure 1a) often possess boundaries entangled with the surrounding context, making it difficult for single-scale features to simultaneously capture global semantics and local details. Scenes containing multiple similar objects (Figure 1b) require grounding based on relative spatial relationships rather than appearance cues. Small-scale targets (Figure 1c) are easily overwhelmed by background clutter and lack strong visual anchors. From the language perspective, referring expressions typically encode spatial orientations, adjacency relations, and functional semantics, forming structured spatial logic that must be translated into explicit visual cues, whether it is the functional and spatial constraints of the composite target in Figure 1a, the relative spatial orientation in Figure 1b, or the combination of functional landmarks and spatial positions in Figure 1c. However, textual features lack inherent visual anchors, making grounding highly dependent on cross-modal modeling. From the cross-modal modeling perspective, many existing approaches rely on appearance-driven fusion or single-scale feature aggregation, which struggle to establish stable visual–textual associations. Such designs inadequately address boundary ambiguity in large composite regions, fail to explicitly model spatial relations among similar objects, and cannot reliably anchor small targets guided by functional language cues, often leading to spatial grounding drift.

Therefore, effective RRSIS requires a reasoning paradigm that first stabilizes cross-modal grounding and then performs segmentation under reliable spatial priors, which provide high-level guidance on the expected location and spatial extent of the target, rather than entangling both processes in a single step. To this end, we propose a SGSRF that explicitly follows a grounding-first, segmentation-second pipeline. The framework decomposes RRSIS into three progressively constrained stages. Cross-modal Consistency Refinement (CCR) lays the foundation for accurate spatial grounding and establishes reliable semantic–spatial alignment. A CLIP-based dual encoder extracts initial visual and textual features, which are further enhanced through cross-attention with learnable context tokens for global semantic interaction. An Attention-induced Structural Consistency Regularization (ASCR) mechanism aligns linguistic dependency structures with visual spatial configurations, mitigating cross-modal drift and producing consistency-enhanced representations along with reliable spatial grounding cues, which serve as explicit spatial signals indicating candidate target regions for downstream segmentation for subsequent stages. Grounding-aware Prompt Generation (GPG) bridges grounding and segmentation by transforming aligned multimodal representations into complementary sparse and dense prompts. These prompts encode semantic constraints and spatial priors of the referred target, providing explicit grounding guidance and compensating for the inherent prompt-generation limitation of SAM. Grounding Modulated Segmentation (GMS) performs fine-grained mask prediction under dual guidance. GPG-generated prompts condition the prompt-driven decoding process, while CCR-derived grounding cues are injected into multi-scale SAM2 encoder features to reinforce target-aware contextual modeling. This coordinated guidance corrects spatial deviations, mitigates background interference and scale variation, and enables more complete and structurally coherent masks in complex remote sensing scenes. Extensive experiments are conducted on three representative and widely used RRSIS benchmarks: RefSegRS, RRSIS-D, and RISBench. These datasets cover diverse aerial scenes, multi-scale objects, complex backgrounds, and rich linguistic expressions, allowing us to fully validate the generalization, robustness, and effectiveness of the proposed framework across various real-world remote sensing scenarios.

In summary, the main contributions of this work are three-fold:

We reconceptualize RRSIS as a grounding-first stagewise reasoning problem, revealing that unstable semantic–spatial correspondence is the fundamental bottleneck. Based on this insight, we design a Structurally Consistent and Grounding-Aware framework that explicitly decouples grounding from segmentation.
We introduce a structural consistency regularization mechanism that aligns linguistic dependency structures with visual spatial configurations, constraining cross-modal attention patterns to suppress grounding drift and enhance semantic–spatial stability.
We develop a dual-guidance strategy that combines grounding-aware prompt generation with grounding-modulated encoder adaptation in SAM, enabling segmentation under reliable spatial priors and significantly improving robustness to scale variation and background clutter.

2. Related Works

2.1. Referring Segmentation in Natural Images

Referring image segmentation (RIS) in natural images aims to predict pixel-level masks for targets described by free-form text via joint visual–linguistic understanding. As a pioneering work in this field, ref. [20] established the early three-stage framework: visual–linguistic feature extraction, multimodal feature fusion, and mask prediction. It adopted CNNs [21] for visual feature encoding, LSTMs [22] for text semantic embedding, and FCNs for multimodal fusion and segmentation, which defined the core challenges of RIS: efficient multimodal feature extraction, effective cross-modal fusion, and fine-grained target segmentation.

Early works mostly followed the CNN+LSTM paradigm, with improvements focusing on multimodal fusion and the visual reasoning mechanism under language conditional constraints. Researchers promoted cross-modal interaction via progressive refinement [23,24], which iteratively optimizes feature fusion results to enhance the consistency between visual regions and textual descriptions; bidirectional cross-modal modeling [25,26], which enables mutual information exchange between visual and linguistic features rather than one-way guidance; and dynamic filters strategies [23,27], which adapt visual features to diverse language queries by dynamically adjusting filter parameters according to textual semantics. Meanwhile, semantic alignment was enhanced through task decoupling [28], which separates the localization and segmentation sub-tasks to reduce mutual interference; keyword guidance [29,30], which highlights core textual information to guide visual feature selection; and language structure-aware visual reasoning [31,32,33], which leverages the syntactic and semantic structure of text to guide fine-grained visual reasoning.

Driven by the strong representation ability of Transformers, recent RIS models shifted to Transformer-based architectures. CMSA [34] explored cross-modal self-attention for long-range dependency modeling, which effectively captures the global correlations between visual pixels and textual tokens to address the limitation of local feature interaction in CNNs. EFN [35] employed co-attention to unify visual–linguistic features in a shared space, realizing the mutual enhancement of dual-modal features through interactive attention mechanisms. Methods such as DMMI [36] and FAN [37] used transformer encoders for joint visual–linguistic encoding and bidirectional decoders for enhanced interaction, enabling bidirectional information flow between visual and textual modalities to improve feature alignment. LAVT [9] verified the effectiveness of early fusion with language-aware visual enhancement, introducing pixel-wise attention modules and language gates to integrate textual information into visual feature extraction. CrossVLT [38] further strengthened bidirectional early fusion with feature alignment modules, reducing cross-modal semantic deviation. PBDF-Net [39] introduced prompt learning and prompt-guided cross-modal interaction (PCI) to alleviate semantic misunderstanding in one-way fusion, enhancing the utilization of high-level contextual information. To address the inflexibility of fixed query vectors in Transformer decoders, Refs. [10,40,41] proposed language-aware dynamic query generation, which adaptively generates query vectors based on textual descriptions to better fit the randomness and diversity of language.

Recent studies [42,43] also leveraged large-scale pre-trained vision–language models (VLMs) to improve cross-domain generalization. ETRIS [44] and BARLERIA [45] adopted parameter-efficient tuning with adapters for early cross-modal fusion. Prompt-RIS [46] built an end-to-end pipeline by integrating CLIP [47] and SAM [15], where cross-modal prompt learning was used to strengthen text-pixel alignment and generate high-quality prompts for SAM. Nevertheless, due to the unique challenges of remote sensing imagery such as complex object boundaries, large-scale target variations and high-precision localization demands, directly applying generic natural-image RIS methods to remote sensing datasets leads to severe performance degradation, highlighting the necessity of developing task-specific models for remote sensing scenes.

2.2. Referring Remote Sensing Image Segmentation

RRSIS aims to extract pixel-level masks of ground targets from aerial images guided by natural language expressions. Compared with natural image scenarios, RRSIS faces additional challenges such as complex backgrounds, large-scale variations, small and scattered targets, and pronounced vision–language semantic gaps. As a result, research on RRSIS remains relatively limited.

RRSIS [11] first introduced a dedicated remote sensing benchmark for referring segmentation and adapted vision–language segmentation frameworks to aerial imagery. To address the limited scale and diversity of early datasets, RMSIN [12] constructed the larger RRSIS-D dataset and proposed a rotated multi-scale interaction network to handle scale variation and target rotation. CroBIM [48] further expanded the benchmark to RISBench and emphasized bidirectional cross-modal interaction, demonstrating the importance of mutual vision–language reasoning for complex geospatial relationships.

Beyond dataset construction, several works focused on enhancing fine-grained image–text alignment in remote sensing scenarios. FIANet [49] decomposed referring expressions into contextual, object-level, and spatial components to facilitate fine-grained multimodal fusion. BTDNet [14] improved small-target localization and boundary accuracy through bidirectional alignment and joint prediction strategies. LSCF [13] and MAFN [50] highlighted semantic forgetting issues caused by early fusion and introduced long-term semantic guidance mechanisms to maintain consistent vision–language alignment during decoding.

Recent Transformer-based RRSIS methods further revealed that one-way language-guided visual fusion is insufficient for complex remote sensing scenes. SBANet [51] proposed scale-wise bidirectional alignment to synchronously update visual and linguistic features, enabling balanced interaction between global context and local details.

In recent years, large-scale foundation models have also been introduced into RRSIS. CLIP [47] provides transferable vision–language representations, while several methods adapt SAM to remote sensing imagery through parameter-efficient tuning. RSTG-SAM [17] integrates SAM [15], and RS2-SAM2 [18] integrates SAM2 [16] into staged pipelines for coarse target localization and fine-grained segmentation. However, RSTG-SAM suffers from insufficient cross-modal alignment by using independent backbones (ResNet-50 and RoBERTa) with limited fine-grained representation ability. Meanwhile, RS2-SAM2 adopts a unified multimodal encoder with low input resolution (224 × 224) to reduce computation, which leads to weak spatial cues and missing small targets during prompt generation, even though the SAM2 image encoder operates at high resolution.

RSRefSeg2 [19] integrates CLIP and SAM2 to exploit their complementary strengths in cross-modal representation learning and segmentation generalization for remote sensing imagery. In most existing SAM-based methods, SAM is incorporated at the decoding stage, where segmentation is guided by prompts derived from cross-modal representations. However, most existing approaches still rely on tightly coupled pipelines or suboptimal cross-modal alignment, which limits robustness under complex remote sensing conditions.

Beyond referring segmentation, remote sensing multimodal learning also includes tasks such as visual question answering (VQA) [52,53] and image–text retrieval [54,55], which require reasoning over semantic relationships between aerial imagery and natural language. Representative works [3,56,57,58] demonstrate that these tasks share a common challenge with RRSIS, namely establishing reliable vision–language correspondence under complex spatial structures and semantic ambiguity. This broader perspective further highlights the importance of robust cross-modal alignment and grounding mechanisms for remote sensing multimodal understanding.

3. Methodology

3.1. Method Overview

Motivated by the challenges analyzed in Section 1, we contend that RRSIS is intrinsically limited by unreliable semantic–spatial alignment between textual descriptions and visual observations. When semantic alignment and mask prediction are coupled in a single-step inference paradigm, background redundancy, significant scale variations, and repetitive ground object textures often exacerbate cross-modal semantic ambiguity. This ambiguity further results in spatial grounding drift and inaccurate boundary delineation, which are core bottlenecks restricting RRSIS performance. To mitigate this structural flaw, we formulate RRSIS as a grounding-prior stagewise reasoning task, as opposed to a direct end-to-end mask prediction paradigm. Specifically, the proposed framework SGSRF decomposes the inference process into three decoupled yet cascaded stages. These stages include semantic alignment refinement, explicit spatial grounding, and Grounding Modulated Segmentation, with each stage imposing structured constraints on subsequent reasoning to ensure the progressive optimization of cross-modal consistency and segmentation accuracy. The overall architecture is illustrated in Figure 2.

Stage I: Cross-modal Consistency Refinement (CCR). The first stage aims to establish stable semantic–spatial correspondence. Given an input image I and referring expression T, a CLIP-based dual encoder extracts initial embeddings

R_{I}

and textual embeddings

R_{T}

:

R_{I}, R_{T} = Φ_{CLIP} (I, T)

(1)

To mitigate cross-modal drift, the ASCR module explicitly aligns linguistic dependency structures with visual spatial configurations. This process yields consistency-enhanced multimodal representations

R_{I}^{'}

and

R_{T}^{'}

and reliable grounding cues G:

R_{I}^{'}, R_{T}^{'}, G = Φ_{ASCR} (R_{I}, R_{T})

(2)

By constraining attention patterns according to structural correspondence, CCR produces grounded representations that serve as stable semantic anchors for subsequent localization.

Stage II: Grounding-aware Prompts Generation (GPG). Built upon the aligned multimodal representations

R^{'}

, the second stage translates grounded semantics into structured localization signals. Specifically, the GPG module generates complementary sparse and dense prompts

E_{s}

and

E_{d}

:

E_{s}, E_{d} = Φ_{LPG} (R_{I}^{'}, R_{T}^{'})

(3)

The sparse prompts

E_{s}

encode high-level semantic constraints, while the dense prompts

E_{d}

capture spatial priors derived from grounding cues. By externalizing localization into prompt space, this stage reduces the reliance on implicit alignment during mask decoding and enables explicit grounding guidance for segmentation.

Stage III: Grounding Modulated Segmentation (GMS). In the final stage, segmentation is performed under dual grounding guidance. A SAM2-based model predicts the mask

\hat{M}

while incorporating both prompt signals and contextual grounding cues:

\hat{M} = Φ_{GMS} (I, G, E_{s}, E_{d})

(4)

The prompts

(E_{s}, E_{d})

condition the prompt encoder and decoding process, whereas grounding cues G are injected into multi-scale encoder features to reinforce target-aware contextual modeling. This coordinated modulation corrects spatial deviations, mitigates background interference and scale variation, and enables accurate boundary delineation in complex remote sensing scenes.

Overall, the proposed framework progressively transforms raw visual–textual inputs into structurally grounded representations, explicit localization prompts, and finally target-aware segmentation masks, forming a coherent grounding-first reasoning pipeline.

3.2. Cross-Modal Consistency Refinement

CCR addresses the core bottleneck of RRSIS, namely unreliable semantic–spatial correspondence between remote sensing imagery and referring expressions. Remote sensing scenes with cluttered backgrounds, repetitive textures and pronounced scale variations coupled with referring descriptions containing fine-grained semantic modifiers and spatial relations often lead conventional token–patch alignment to produce inconsistent visual activations, causing unstable spatial anchoring grounding drift and degraded prompt generation and segmentation quality. The goal of CCR is to generate consistency-enhanced cross-modal representations and reliable spatial grounding cues as structured semantic and spatial priors for subsequent stages achieved via a stepwise refinement strategy. First, a CLIP-based dual encoder extracts initial coarse-grained cross-modal features with Low-Rank Adaptation (LoRA) introduced to lightweight adapt visual and textual encoders for domain discrepancy between natural-image pre-training and remote sensing imagery. Second, cross-modal interactions are strengthened to facilitate global semantic exchange enhancing textual tokens’ scene-level context capture and visual tokens’ target awareness. Finally, ASCR enforces relational consistency between linguistic dependency structures and visual response patterns constraining cross-modal attention at the structural level to promote coherent spatial responses of semantically related words and suppress spurious background activations. This progressive refinement enables CCR to output semantically aligned and spatially stable cross-modal representations, establishing reliable grounding cues to support precise prompt generation and downstream segmentation.

3.2.1. Initial Cross-Modal Representation Encoding

Remote sensing images and natural language descriptions exhibit significant modality and semantic discrepancies, especially when descriptions emphasize fine-grained target attributes. This gap limits the direct application of visual–language pre-trained models (VLMs) from natural scenes to cross-modal alignment in RRSIS. To address this issue, we employ a pre-trained CLIP model as the dual-modality encoder and take advantage of its powerful cross-modal representation ability with lightweight adaptation for remote sensing scenarios. To further reduce the domain gap between natural images and remote sensing imagery, we integrate Low-Rank Adaptation (LoRA) into both visual and textual encoders of CLIP. Given a remote sensing image

I \in R^{H \times W \times 3}

(where H and W are height and width) and textual description T (a sequence of

N_{t}

tokens), the CLIP image and text encoders extract visual and textual tokens, respectively, as follows:

F = E_{i} (I), L = E_{t} (T)

(5)

where

F \in R^{N_{i} \times d}

and

L \in R^{N_{t} \times d}

denotes the visual and linguistic token embeddings, respectively.

LoRA freezes the pre-trained parameters of CLIP and injects two low-rank decomposition matrices

W_{Q}^{Δ}

and

W_{K}^{Δ}

into the query and key projection layers of each multi-head attention block, where d denotes the input feature dimension and r denotes the LoRA rank (

r = 16

). The value projection remains unchanged to preserve the original feature representation ability.

The updated query and key projections are formulated as

\begin{matrix} Q & = X W_{Q} + X W_{Q}^{Δ} W_{Q}^{Δ ⊤} \end{matrix}

(6)

\begin{matrix} K & = X W_{K} + X W_{K}^{Δ} W_{K}^{Δ ⊤} \end{matrix}

(7)

where

X \in R^{N \times d}

is the input feature of the attention layer, N is the sequence length,

W_{Q} \in R^{d \times d}

,

W_{K} \in R^{d \times d}

are the original projection matrices in CLIP, and

W_{Q}^{Δ} \in R^{d \times r}

,

W_{K}^{Δ} \in R^{d \times r}

are the low-rank matrices to be optimized. This design efficiently adapts CLIP to remote sensing scenes while maintaining its cross-modal alignment ability, with significantly fewer tunable parameters than full fine-tuning.

3.2.2. Attention-Induced Structural Consistency Regularization Module

Although the CLIP encoder and LoRA adaptation provide preliminary cross-modal alignment, RRSIS still suffers from a critical limitation. Textual descriptions rarely describe the entire remote sensing scene, but instead precisely locate the target region through multiple semantic modifiers and spatial relations. However, the complex characteristics of remote sensing imagery, such as cluttered backgrounds, repetitive textures, and large-scale variations, often cause semantically related text tokens to attend to inconsistent or erroneous visual regions when relying only on conventional cross-modal attention. This issue further leads to unstable spatial anchoring and localization drift. Most existing cross-modal alignment methods establish token–patch or region–word correspondences via mechanisms such as cross-attention (e.g., TRANSVG [59], IMCA [60]), token-level similarity matching (e.g., TokenFlow [61], CHAN [62], BOOM [63]), or region-based grounding (e.g., MSSAN [64,65]). These methods typically model fine-grained interactions through point-to-point matching between textual tokens and visual regions, but largely neglect the structural relationships among text tokens—such as syntactic dependencies, modifier–head relations, and compositional semantics—and fail to leverage complementary structural cues in the visual space. Robust cross-modal grounding, especially in complex remote sensing scenes, requires iterative interactions between text and image structures, where linguistic dependencies and visual spatial patterns dynamically reinforce and constrain each other to achieve consistent and semantically coherent alignment. From the linguistic perspective, text self-attention naturally encodes rich semantic and syntactic dependencies, including attribute–entity relations, modifier–head relations, and relative spatial constraints. In principle, these relational structures should correspond to consistent and cooperative visual response patterns in RRSIS, since semantically related text tokens should attend to the same or spatially adjacent visual regions. Based on this motivation, we propose the ASCR module. The core idea is to construct token-level relational structures from visual response distributions induced by cross-modal cross-attention, and explicitly align these structures with the linguistic dependencies encoded in text self-attention via structural consistency regularization. Such a mechanism stabilizes cross-modal semantic alignment at the structural level, effectively suppressing conflicting visual matching and alleviating localization drift.

The detailed implementation of ASCR is divided into four key steps: Context-enhanced Linguistic Structure Modeling, bidirectional cross-modal interaction, vision-induced token relational structure construction, and structural consistency regularization. As illustrated in Figure 3, the specific process and parameter settings are elaborated as follows:

Context-enhanced Linguistic Structure Modeling (CLSM): To enhance the ability of textual tokens to capture target-specific contextual information in remote sensing scenes, we introduce a set of learnable context tokens

C \in R^{N_{g} \times d}

, where

N_{g}

denotes the number of context tokens. These context tokens are initialized from pooled representations of the original text embeddings L, and augmented with learnable positional embeddings during attention operation. They are concatenated with the original textual tokens

L \in R^{N_{t} \times d}

to form an enhanced textual sequence:

U = [L; C] \in R^{(N_{t} + N_{g}) \times d}

(8)

Self-attention is then applied to U to generate the initial linguistic semantic description

H_{u 1}

, which encodes both the semantic and syntactic relations among text tokens and context tokens. The textual self-attention process is formulated as

A_{u u} = Softmax (\frac{Q_{u 1} K_{u 1}^{⊤}}{\sqrt{d_{t}}})

(9)

H_{u 1} = A_{u u} V_{u 1}

(10)

Q_{u 1}, K_{u 1}, V_{u 1} = U W_{Q 1}, U W_{K 1}, U W_{V 1}

(11)

where

W_{Q}^{1}, W_{K}^{1}, W_{V}^{1} \in R^{d \times d}

are learnable projection matrices, and

d_{t}

denotes the dimension of the query and key vectors. The self-attention matrix

A_{u u} \in R^{(N_{t} + N_{c}) \times (N_{t} + N_{c})}

encodes the latent linguistic structural and syntactic relations among the tokens in U, which serves as the reference structure for subsequent consistency regularization.

Bidirectional Cross Modal Interaction (BCMI): To realize bidirectional information interaction between the enhanced textual sequence and visual tokens, we perform two-stage cross-attention between

H_{u 1}

(enhanced textual features) and F (visual tokens from the CLIP image encoder). Such interaction enables the textual tokens to guide the extraction of target-related visual features, while the visual tokens to enrich the semantic representation of textual tokens, thus enhancing the semantic–spatial correspondence between the two modalities.

The first stage of cross-attention (text-guided visual feature refinement) is formulated as

A_{u f} = Softmax (\frac{Q_{u 2} K_{f}^{⊤}}{\sqrt{d}})

(12)

H_{u} = A_{u f} V_{f}

(13)

Q_{u 2}, K_{f}, V_{f} = H_{u 1} W_{Q}^{u}, F W_{K}^{f}, F W_{V}^{f}

(14)

where

W_{u}^{Q} \in R^{d \times d}

,

W_{f}^{K} \in R^{d \times d}

, and

W_{f}^{V} \in R^{d \times d}

are learnable projection matrices for the query, key, and value, respectively.

A_{u f} \in R^{(N_{t} + N_{c}) \times N_{i}}

is the cross-attention matrix between textual tokens and visual tokens, and

H_{u} \in R^{(N_{t} + N_{c}) \times d}

is the refined textual feature after interacting with visual features.

The second stage of cross-attention (visual-guided textual feature refinement) is formulated as

A_{f u} = Softmax (\frac{Q_{f} K_{u 2}^{⊤}}{\sqrt{d}})

(15)

H_{f} = A_{f u} V_{u 2}

(16)

Q_{f}, K_{u 2}, V_{u 2} = F W_{Q}^{f}, H_{u} W_{K}^{u}, H_{u} W_{V}^{u}

(17)

where

W_{f}^{Q} \in R^{d \times d}

,

W_{u}^{K} \in R^{d \times d}

, and

W_{u}^{V} \in R^{d \times d}

are learnable projection matrices.

A_{f u} \in R^{N_{i} \times (N_{t} + N_{c})}

is the cross-attention matrix between visual tokens and textual tokens, and

H_{f} \in R^{N_{i} \times d}

is the refined visual feature after interacting with textual features. Through cross-attention interactions, the learnable context tokens C in U extract target-relevant global contextual information. The

A_{u u} \in R^{(N_{t} + N_{g}) \times (N_{t} + N_{g})}

encodes the latent linguistic structural and syntactic relations among the tokens in U, which serves as the reference structure for subsequent consistency regularization.

Through this two-stage cross-attention interaction, the learnable context tokens C in U can effectively aggregate target-relevant global contextual information from the visual tokens F, further enhancing the ability of textual tokens to distinguish the target from cluttered background. The cross-attention matrix

A_{f u}

characterizes the visual response distribution induced by language queries, which forms the basis for constructing the vision-induced relational structure.

Vision-induced Token Relational Structure Construction (VTRSC): Based on the two-stage cross-attention interaction, the learnable context tokens C in U can effectively extract target-relevant global contextual information from the visual tokens F, further enhancing the ability of textual tokens to distinguish the target from the background. The cross-attention matrix

A_{f u}

reflects the distribution of visual responses to language descriptions induced by cross-modal cross-attention, which serves as the foundation for constructing the vision-induced token relational structure. We compute the cosine similarity between the column vectors of

A_{f u}

corresponding to the original textual tokens to obtain the vision-induced token-to-token similarity matrix

C_{u u}

, where the context tokens c are excluded during computation to avoid introducing bias into the alignment process. The specific formulation is as follows:

C_{u u} (i, j) = cos (A_{f u} (:, i), A_{f u} (:, j)), 1 \leq j < i \leq N_{t}

(18)

with

C_{u u} (i, i) = 1

, where

cos (\cdot, \cdot)

denotes the cosine similarity function, and

C_{u u} (i, i) = 1

(the similarity of a token to itself is 1). Each element

C_{u u} (i, j)

in the matrix characterizes the similarity or overlap between the visual regions attended to by the i-th and j-th text tokens in the visual space. A higher value of

C_{u u} (i, j)

indicates that the two language tokens attend to the same or spatially proximate visual regions, while a lower value indicates that they attend to different visual regions. The schematic of this process is illustrated in Figure 4a. For clear visualization, textual tokens are grouped by semantics (target entity, spatial relation, reference entity, and location constraint) and color coded to intuitively present the target-relation-reference-location semantic relational structure. In the attention matrix

A_{f u}

, each row represents the correlation between the color-coded textual tokens and the image embeddings. In the

C_{u u}

matrix, color intensity denotes the correlation and similarity between each pair of tokens.

Structural Consistency Regularization (SCR): The core of SCR lies in imposing a consistency constraint between the linguistic structural matrix

A_{u u}

(derived from textual self-attention, encoding the inherent semantic and syntactic dependencies of language) and the vision-induced relational matrix

C_{u u}

(derived from cross-modal cross-attention, capturing the association patterns between targets and contexts in visual space), as illustrated in Figure 4a. This module adopts a bidirectional alignment mechanism to synergize linguistic semantics and visual spatial relations. On the one hand, the linguistic structure provides explainable semantic priors to avoid ambiguous visual matching, preventing unconstrained visual matching from falling into ambiguity. On the other hand, the visual relational structure injects spatial constraints to refine textual representations, correcting potential biases in text representations. With such bidirectional regularization, our model effectively suppresses inconsistent visual alignments, and significantly improves the stability, interpretability, and robustness of cross-modal grounding in RRSIS.

To simplify computation, we only calculate the cosine similarity of corresponding elements in the lower-triangular matrices of

A_{u u}

and

C_{u u}

as the consistency regularization loss. Specifically, we denote the lower-triangular part (excluding the main diagonal) of a matrix M as

tril (M, - 1)

, where

tril (\cdot, - 1)

is the lower-triangular matrix extraction operator that retains elements below the main diagonal and sets all other elements to 0. For

A_{u u} \in R^{N_{t} \times N_{t}}

and

C_{u u} \in R^{N_{t} \times N_{t}}

(where

N_{t}

is the length of the textual token sequence), we extract the i-th row vectors of

tril (A_{u u}, - 1)

and

tril (C_{u u}, - 1)

as

p_{i}

and

q_{i}

, respectively, which is formally defined as

p_{i} = tril (A_{u u}, - 1) {(i, :)}^{T}

(19)

q_{i} = tril (C_{u u}, - 1) {(i, :)}^{T}

(20)

Notably, the first token in the textual sequence

L

(usually the [CLS] token) represents global category information, which is excluded from the consistency computation to reduce the influence of global errors on fine-grained alignment. Thus, we restrict the index i to the range

i \in [2, N_{t}]

(retaining only the vectors corresponding to the actual text tokens that describe the target’s attributes and spatial relations). The consistency regularization loss is formulated as

L_{c o n s} = \frac{1}{B} \sum_{b = 1}^{B} \sum_{i = 1}^{N_{t}} [1 - cos (p_{i}^{(b)}, q_{i}^{(b)})]

(21)

where B is the batch size. This constraint aligns the structural patterns encoded in the textual modality with the visual response patterns, without enforcing exact positional attention, thereby mitigating contradictory cross-modal activations and improving grounding stability.

The effectiveness visualization of the ASCR module is presented in Figure 4b; the first panel shows the Ground Truth, where the correct target is the middle tennis court (marked in red) immediately below the upper-left tennis court, corresponding to the spatial reference in the sentence “The tennis court is below the tennis court on the upper left.” The second panel presents the result without structural consistency constraints: the semantic binding between “below” and the reference phrase “the tennis court on the upper left” is extremely weak. The model incorrectly associates “below” with the standalone phrase “the tennis court”, leading to the failure of hierarchical spatial relation parsing. Consequently, the query is erroneously localized to the bottom-most tennis court, resulting in severe localization ambiguity. With the introduction of SCR, text tokens form distinct semantic groups (red for target entity, purple for spatial relation, light blue for reference entity, yellow + green for location constraint), and the structural matrix exhibits compact semantic blocks with aggregated similarity. By enforcing the consistency between linguistic syntax and visual spatial layout, the model accurately localizes the target tennis court in the middle. The overlaid attention heatmaps demonstrate that each semantic group is precisely aligned to its corresponding visual region, achieving high-precision instance-level localization and hierarchical spatial relation parsing.

Finally, the ASCR module outputs three components: the optimized visual tokens

H_{f}

, the optimized text tokens

H_{l}

, and the context tokens G which are extracted from

H_{u}

according to their original positions in U. The outputs are computed as

H_{f}, H_{u} = A (F, L), [H_{l}, G] = H_{u}

(22)

Here,

A (\cdot)

denotes the attention-induced cross-modal structural alignment operation applied to the input visual tokens F and text tokens L. These three components are then fed into the subsequent module to provide stable grounding cues for the SAM2 encoder.

This design stabilizes grounding in cluttered, repetitive remote sensing scenes, producing reliable multimodal anchors that serve as high-quality semantic priors for downstream spatial prompting and pixel-level segmentation.

3.3. Grounding-Aware Prompts Generation

Grounding-aware Prompt Generation (GPG) serves as a critical bridge connecting cross-modal semantic alignment (visual and spatial grounding) and pixel-level mask prediction. It transforms the consistency-enhanced cross-modal representations produced by ASCR into complementary dense and sparse segmentation prompts. These prompts encode the target’s semantic constraints and spatial priors, providing the SAM2 decoder with accurate localization guidance to mitigate the impact of cluttered backgrounds and scale variations in remote sensing scenes.

As illustrated in Figure 5, the detailed implementation begins with constructing a unified prompt token set that integrates dense and sparse components, aiming to fully leverage the enhanced multimodal representations. The specific processes and parameter settings are elaborated as follows.

First, we construct the prompt token set:

P = [P_{dense}; P_{sparse}] \in R^{d \times (1 + N_{p})}

(23)

where

P_{dense} \in R^{d \times 1}

is a single dense token that provides preliminary spatial guidance for coarse target localization, and

P_{sparse} \in R^{d \times N_{p}}

comprises

N_{p}

learnable sparse tokens. The init-prompts tokens are randomly initialized and similarly equipped with learnable positional embeddings to enhance spatial expressiveness. Both dense and sparse tokens are fed into a series of transformer layers, which perform self-attention, prompt-to-image cross-attention, and prompt-to-text cross-attention operations. These attention mechanisms effectively integrate vision–language information from the enhanced visual tokens

H_{f}

and text tokens

H_{l}

, ensuring that the generated prompts are fully target-aware and spatially aligned with both visual and textual semantics.

Next, the sparse prompt tokens

{\hat{P}}_{sparse}

are projected through a multi-layer perceptron (MLP) to generate embeddings that are compatible with the SAM2 decoder input. The formulation is as follows:

E_{s} = MLP ({\hat{P}}_{sparse}) \in R^{d \times N_{p}}

(24)

where

E_{s}

denotes the final sparse prompt embeddings, which serve as high-confidence anchor points to guide the SAM2 decoder in precise local feature refinement.

The dense prompt token

{\hat{P}}_{dense}

is designed to generate a coarse, unnormalized spatial probability map that captures the overall target region. This map is derived from the element-wise multiplication of the optimized visual tokens

V^{'}

(output from the ASCR module) and the MLP-projected dense token, which enhances the spatial relevance of the visual features. The formulation is

{\hat{Y}}_{dense} = V^{'} ⊙ MLP ({\hat{P}}_{dense}) \in R^{1 \times H_{v} \times W_{v}}

(25)

where

V^{'}

are the optimized visual tokens from the AICR module, and ⊙ denotes element-wise multiplication. This map provides preliminary target localization, facilitating downstream segmentation refinement.

To align the dense prompt with the spatial dimension of the SAM2 decoder input, the dense spatial map

{\hat{Y}}_{dense}

is further encoded using a convolutional block, following the structure of the SAM2 prompt encoder. The formulation is

E_{d} = Conv ({\hat{Y}}_{dense}) \in R^{d \times H_{e} \times W_{e}}

(26)

where

Conv (\cdot)

consists of two 2 × 2 stride-2 convolutions followed by a 1 × 1 convolution. The resulting dense prompt embedding

E_{d}

is spatially aligned with the SAM2 decoder input, ensuring effective integration with the sparse prompt embeddings

E_{s}

for joint segmentation guidance.

3.4. Grounding Modulated Segmentation

Grounding Modulated Segmentation (GMS) leverages the powerful mask generation capability of the SAM2 model to achieve fine-grained pixel-level segmentation. The core design principle of this stage is to integrate the target-specific spatial contextual information into the SAM2 framework through two sequential functional steps: Grounding-Token-Guided Multi-Resolution Feature Enhancement and Prompt-Driven Segmentation. In the first step, the spatial grounding cues G derived from the ASCR module are injected into the multi-scale SAM2 encoder to further enhance the target-aware contextual information and strengthen the spatial grounding of the SAM2 encoder features. The second step utilizes the target-aware sparse and dense prompts output from the Target-aware Segmentation Prompts Generation stage to drive the SAM2 decoder, enabling accurate pixel-level classification and contour delineation. This design compensates for potential localization deviations in prompt-activated regions, improves the structural coherence of object boundaries, and ultimately generates more complete and accurate segmentation masks

\hat{M}

under the conditions of complex remote sensing scenes such as cluttered backgrounds and large target scale variations.

3.4.1. Grounding-Cues Modulated Multi-Resolution Feature Enhancement

To bridge the domain gap between natural-image pre-training and remote sensing imagery, we employ LoRA modules (rank

r = 16

) within the vision encoder of SAM2, and directly fine-tune its prompt encoder and mask decoder. Such lightweight adaptation maintains the pre-trained representation capacity while enabling domain-specific feature adjustment. The SAM2 encoder generates multi-resolution visual features that are inherently suitable for handling the significant scale variations prevalent in remote sensing imagery. These multi-resolution features capture complementary information: high-resolution features encode fine-grained spatial details, while low-resolution features capture global semantic context. To further inject target-specific spatial context into visual features and strengthen their target-aware representation ability, we propose a Grounding-cues Modulated multi-resolution Feature Enhancement (GMFE) module. This module employs the grounding cues G derived from the ASCR module to query the multi-resolution SAM2 encoder features via cross-attention. Such a cross-attention mechanism enhances the activation of features corresponding to the target region and suppresses interference from irrelevant background clutters. It effectively improves the discriminability and target awareness of multi-resolution encoder features, laying a solid foundation for subsequent fine-grained segmentation. The operation of GMFE is illustrated in Figure 2. Formally, the multi-resolution features output by the SAM2 encoder are denoted as follows:

F^{(0)} \in R^{M_{0} \times d_{s}}, F^{(1)} \in R^{M_{1} \times d_{s}}, F^{(2)} \in R^{M_{2} \times d_{s}}

(27)

where

F^{(0)}

,

F^{(1)}

, and

F^{(2)}

correspond to the first high-resolution feature map, the second high-resolution feature map, and the final image embedding of the SAM2 encoder, respectively.

M_{0}

,

M_{1}

,

M_{2}

denote the number of tokens for each resolution, and

d_{s}

denotes the feature dimension of the SAM2 encoder. The cross-attention operation between the multi-resolution encoder features and the grounding cues G is formulated as

Z_{g}^{(s)} = Attn (Q = F^{(s)}, K = G, V = F^{(s)}), s \in {0, 1, 2}

(28)

where the attention function

Attn (Q, K, V)

follows the standard scaled dot-product attention mechanism, formulated as

Attn (Q, K, V) = Softmax (\frac{Q K^{⊤}}{\sqrt{d}}) V

(29)

This cross-attention process enhances object-related contextual consistency across different resolutions and strengthens the target-aware representations within the SAM2 encoder. The enhanced multi-resolution features are denoted as

Z_{g}^{(0)}, Z_{g}^{(1)}, Z_{g}^{(2)}

(30)

Notably, multi-resolution querying gathers complementary localization cues ranging from coarse semantic context in low-resolution features to fine-grained spatial details in high-resolution features. This is particularly critical in remote sensing imagery where objects exhibit substantial scale variations and are often interspersed with cluttered backgrounds. Such feature enhancement ensures that the SAM2 encoder features are fully aligned with the target region specified by the textual description.

3.4.2. Prompts Driven Segmentation

In the final step, the SAM2 decoder integrates the enhanced multi-scale encoder features

Z_{g}^{(0)}, Z_{g}^{(1)}, Z_{g}^{(2)}

, and the target-aware segmentation prompt embeddings (sparse

E_{s}

and dense

E_{d}

) to generate the refined pixel-level segmentation mask. The decoder leverages the complementary information provided by the prompts: sparse prompts

E_{s}

provide high-confidence anchor points for local feature refinement, while dense prompts

E_{d}

provide coarse spatial guidance to ensure global consistency. The formulation of the final segmentation mask is as follows:

\hat{M} = SAM 2 Decoder (Z_{g}^{(0)}, Z_{g}^{(1)}, Z_{g}^{(2)}, E_{d}, E_{s})

(31)

where

\hat{M} \in R^{H \times W}

denotes the final predicted pixel-level segmentation mask, with H and W consistent with the height and width of the input remote sensing image I. Although SAM2 is capable of producing multiple candidate masks, we empirically observe that selecting the first predicted mask yields stable and competitive performance; therefore, only this output is retained during inference. This mask accurately delineates the target region specified by the textual description T, with improved robustness to cluttered backgrounds, scale variations, and spatial inconsistencies—key challenges in RRSIS.

3.5. Loss Function

We optimize the model with a joint objective consisting of the proposed consistency regularization and the SAM segmentation loss:

L = L_{s a m} + λ_{c o n s} L_{c o n s}

(32)

where

λ_{c o n s}

controls the strength of the proposed regularization.

The consistency loss

L_{c o n s}

is defined in Equation (21), which aligns the token-to-token relational structure encoded by textual self-attention with the vision-induced similarity structure derived from cross-modal cross-attention. This regularization term is only applied during the training phase and introduces no additional computational overhead in the inference stage. This term is applied during training only and introduces no extra overhead during inference.

We supervise the final prediction

\hat{M}

with ground-truth mask Y:

L_{s a m} = L_{s e g} (\hat{M}, Y)

(33)

where

L_{s e g}

is a combined loss function of Binary Cross-Entropy (BCE) loss and Dice loss. This combination effectively alleviates the class imbalance problem common in remote sensing image segmentation and refines the boundary of the target region, with the formulation

L_{s e g} (\hat{M}, Y) = BCE (\hat{M}, Y) + Dice (\hat{M}, Y)

(34)

4. Experiments

4.1. Experimental Datasets

The proposed approach is comprehensively evaluated on three widely adopted RRSIS, namely RefSegRS [11], RRSIS-D [12], and RISBench [48]. These datasets jointly cover diverse aerial scenes, object categories, spatial resolutions, and linguistic complexities, thereby providing a rigorous testbed for assessing cross-modal grounding and fine-grained segmentation performance. A quantitative overview of the datasets is reported in Table 1.

RefSegRS [11] is developed upon the SkyScapes dataset [66] by augmenting high-resolution aerial imagery with pixel-level instance masks and corresponding referring expressions. The dataset comprises 4420 image–text–mask triplets collected from 285 distinct scenes, which are split into 2172 training samples (151 scenes), 431 validation samples (31 scenes), and 1817 test samples (103 scenes). Fourteen hierarchical semantic classes are annotated, covering common urban elements such as roads, buildings, cars, and vans. In addition to category labels, each target instance is described using multiple attribute dimensions to enrich semantic variability. All images are uniformly resized to

512 \times 512

pixels, while maintaining a fine spatial resolution of 0.13 m per pixel.

RRSIS-D [12] is constructed by transforming bounding-box annotations from the RSVGD dataset [67] into high-quality instance masks using the Segment Anything Model (SAM). The dataset includes 17,402 image–description–mask triplets, distributed across training (12,181), validation (1740), and test (3481) splits. It covers 20 semantic categories, including both compact objects (e.g., aircraft) and large-scale structures (e.g., stadiums and highway service areas), each further supplemented with attribute annotations to enhance descriptive expressiveness. A prominent characteristic of RRSIS-D lies in its extreme object scale variation, with target regions ranging from small localized instances to extensive areas exceeding hundreds of thousands of pixels. All images are resized to

800 \times 800

pixels, with spatial resolutions spanning from 0.5 to 30 m per pixel.

RISBench [48] is derived from the VRSBench dataset [68] by integrating referring expressions with visual localization annotations and corresponding segmentation masks. It consists of 52,472 samples, partitioned into 26,300 training, 10,013 validation, and 16,158 test instances. Each image is standardized to a resolution of

512 \times 512

pixels, while encompassing a wide range of ground sampling distances (0.1–30 m per pixel) to reflect diverse geographic contexts and observation scales. The dataset annotates 26 semantic categories, each associated with multiple attribute types to support multi-granularity semantic reasoning. The textual descriptions exhibit notable linguistic diversity, with an average length of 14.31 words and a vocabulary size of 4431 unique terms, posing substantial challenges for precise language-guided segmentation.

4.2. Implementation Details

SGSRF adopts a stagewise referring segmentation framework built upon pre-trained vision–language and segmentation foundation models. CLIP is used for cross-modal representation learning, and SAM2 is employed as the segmentation backbone. Unless otherwise specified, SAM2.1-Hiera-Large is used as the segmentation model and SigLIP2-So400M-Patch16-512 [69] as the vision–language encoder.

In ASCR, the number of learnable context tokens is set to

N_{g} = 3

, enabling the model to capture target-related global contextual cues. For the prompt generation module, the number of sparse prompt tokens is

T_{p} = 9

, following common practice in prompt-based SAM variants. The weight of the structural consistency regularization is

λ_{c o n s} = 0.5

.

During training, images are resized to 512 × 512 for CLIP and 1024 × 1024 for SAM2, following the native input settings of the two models. No additional data augmentation is employed, as arbitrary spatial transformations would disrupt the precise semantic–spatial correspondence between textual descriptions and visual targets. All newly introduced modules are trained from scratch, while foundation model parameters follow the adaptation strategy described in the Method section. AdamW is adopted as the optimizer. The initial learning rates are set to of

2 \times 10^{- 4}

for RefSegRS,

1 \times 10^{- 4}

for RRSIS-D,

2 \times 10^{- 5}

for RISBench, with a batch size of 8 and a maximum of 100 epochs. A linear warm-up schedule followed by cosine annealing decay is used for learning rate scheduling. All experiments are implemented in PyTorch 2.4.1+cu121 based on the OpenMMLab framework. Training is conducted on NVIDIA A100 GPUs using BF16 mixed precision and DeepSpeed ZeRO Stage-2 for memory-efficient distributed training.

4.3. Evaluation Protocol and Metrics

We evaluate referring segmentation performance using generalized Intersection over Union (gIoU), cumulative Intersection over Union (cIoU), and precision at different IoU thresholds (Pr@X).

Let

P_{i}

and

G_{i}

denote the predicted segmentation mask and the ground-truth mask of the i-th image, respectively, and let N be the total number of test samples. The IoU-based metrics are defined as

gIoU = \frac{1}{N} \sum_{i = 1}^{N} \frac{|P_{i} \cap G_{i}|}{|P_{i} \cup G_{i}|},

(35)

cIoU = \frac{\sum_{i = 1}^{N} |P_{i} \cap G_{i}|}{\sum_{i = 1}^{N} |P_{i} \cup G_{i}|} .

(36)

The gIoU metric computes the mean IoU across test samples and assigns equal importance to each instance, whereas cIoU aggregates intersections and unions over the entire dataset before computing the ratio. As a result, cIoU tends to be biased toward larger target regions. In this work, gIoU is treated as the primary evaluation metric, as it better reflects per-instance segmentation quality under large-scale variations commonly observed in remote sensing imagery.

To further assess localization accuracy under different strictness levels, we report precision at IoU threshold X, denoted as Pr@X, where

X \in {0.5, 0.6, \dots, 0.9}

. Pr@X measures the proportion of test samples whose predicted masks achieve an IoU higher than the specified threshold:

\Pr @ X = \frac{1}{N} \sum_{i = 1}^{N} I (\frac{|P_{i} \cap G_{i}|}{|P_{i} \cup G_{i}|} > X),

(37)

where

I (\cdot)

is the indicator function.

Lower IoU thresholds mainly reflect coarse localization capability, while higher thresholds impose stricter requirements on spatial alignment and boundary accuracy. Reporting Pr@X at multiple thresholds provides a comprehensive evaluation of segmentation robustness under increasing localization precision demands.

All metrics are computed on both validation and test sets under identical evaluation settings for fair comparison.

4.4. Comparison Results

4.4.1. Quantitative Results on the RefSegRS Dataset

Table 2 reports quantitative comparisons on the RefSegRS benchmark, including threshold-based precision (Pr@0.5–0.9) and composite metrics (cIoU, gIoU), against 12 state-of-the-art referring segmentation methods. Our method achieves the best overall test-set performance, with 77.85% gIoU and 83.55% cIoU, and outperforms competitors across most criteria. Notably, gains are more pronounced at stricter thresholds, with Pr@0.9 reaching 37.15%, indicating stronger fine-grained mask alignment and boundary delineation.

Compared with early CNN–RNN baselines (e.g., RNN, CMSA, BRINet, LSCM, and CMPC+), our approach yields substantial improvements. These methods rely on shallow cross-modal fusion and limited spatial reasoning, leading to coarse localization and fragmented masks in cluttered, scale-variant aerial scenes, and their performance degrades rapidly with increasing IoU thresholds, while our method remains robust. Against transformer-based RRSIS methods (LGCE, RMSIN, FIANet, BTDNet, and LSCF), which use stronger backbones and BERT encoders, our method surpasses the top baseline LSCF by +0.41% in test gIoU (77.85% vs. 77.44%), with larger gains at stricter thresholds (e.g., +3.38% at Pr@0.9). While LSCM achieves a slightly higher Pr@0.6 (Val) score, our method outperforms it on Pr@0.6 (Test), indicating stronger generalization and robustness to distribution shifts. LSCM leads in Pr@0.7 on both splits, reflecting its strength in moderate-precision localization. For all other metrics across validation and test sets, our method ranks first, demonstrating superior fine-grained spatial detail capture and stable high-precision segmentation.

Finally, compared with foundation-model-based approaches (RS2-SAM2, RSRefSeg2), our method achieves consistently better results, especially at high thresholds. Specifically, we improve test gIoU by +1.68% (77.85% vs. 76.17%) and test Pr@0.9 by +4.18% (37.15% vs. 32.97%) over RSRefSeg2, demonstrating more accurate localization-to-mask transmission and reliable fine-grained segmentation under strict IoU constraints. These results validate the effectiveness of our design in enhancing high-precision referring segmentation on RefSegRS.

4.4.2. Quantitative Results on the RRSIS-D Dataset

The RRSIS-D dataset features a larger scale than RefSegRS, covering 20 object categories and providing approximately four times the number of image–text–mask triplets, posing greater challenges for cross-modal reasoning and segmentation. Table 3 reports quantitative comparisons on this benchmark.

Overall, our method achieves the best overall performance on the test set, with a gIoU of 69.67% and a cIoU of 79.37%, and consistently outperforms competitors across most evaluation criteria. Notably, a distinctive phenomenon is observed: our method achieves higher test-set scores than validation-set scores on Pr@0.5, Pr@0.6, and cIoU, a result not exhibited by any other competing methods. Specifically, on Pr@0.5, our method attains 80.23% on the validation set and 80.93% on the test set. on Pr@0.6, it reaches 75.5% (Val) and 76.07% (Test), with a slight but clear gain; and on cIoU, it achieves 79.18% (Val) and 79.37% (Test), maintaining a steady improvement. This phenomenon can be reasonably explained by the inherent difficulty discrepancy between the validation and test sets in the RRSIS-D dataset. The validation set is constructed to contain more challenging and heterogeneous samples, such as tiny targets, heavy occlusion, cluttered backgrounds, ambiguous text descriptions, and non-typical scenes, which are deliberately included to facilitate rigorous model selection during training. In contrast, the test set follows a more natural and representative distribution of real-world remote sensing scenes with moderate difficulty. Benefiting from the strong generalization ability and robustness of our framework, our model performs stably on both sets and achieves slight improvements on the test set.This unique performance characteristic further validates that our model effectively avoids overfitting to the validation distribution and generalizes better to real-world remote sensing scenarios, which is particularly valuable for interpreting complex semantic–spatial relations in the RRSIS-D dataset.

Specifically, our method ranks first on Pr@0.5, Pr@0.6, Pr@0.7, and Pr@0.8 for both splits, demonstrating stable performance across diverse precision requirements. On Pr@0.9 (Test), RSRefSeg2 obtains a slightly higher score, reflecting its strength in ultra-high-precision localization. For cIoU, RS2-SAM2 achieves a marginal advantage, while our method leads in both validation and test gIoU as well as test cIoU, confirming its superiority in comprehensive segmentation accuracy. These results validate the effectiveness of our approach in handling the larger-scale and more complex semantic–spatial relations of the RRSIS-D dataset.

4.4.3. Per-Class Performance on the RRSIS-D Dataset

Table 4 reports per-class performance on the RRSIS-D test set. Overall, our method (SGSRF) achieves the highest average performance, reaching 72.26%, slightly surpassing the strongest baseline RSRefSeg2. This result indicates that the proposed framework delivers more consistent improvements across diverse object categories rather than relying on gains from a small subset of classes.

For many categories involving clear semantic structure or distinctive visual patterns, our method shows notable advantages. Specifically, StaPS achieves the best performance on storage tank, chimney, train station, ship, bridge, vehicle, windmill, and airplane. These categories often contain multiple similar instances or appear in cluttered environments, where accurate instance-level grounding is crucial. The observed improvements suggest that our approach better distinguishes the referred instance and transfers localization cues into precise segmentation masks. Compared with transformer-based methods (RMSIN, LGCE, and FIANet), SGSRF consistently improves performance on most categories, particularly those requiring fine-grained spatial reasoning (e.g., “bridge” and “vehicle”). In comparison with RSRefSeg2, our method exhibits stronger performance on several small or structurally complex targets, such as “ship”, “bridge”, and “windmill”, highlighting enhanced robustness under challenging referring scenarios. It is also worth noting that SGSRF performs competitively but not optimally on certain large-area categories (e.g., “airport”, “basketball court”, and “harbor”). These categories typically involve extensive spatial extent or ambiguous boundaries, where coarse region-level cues may dominate the referring process. Nevertheless, the overall performance remains comparable, and the gains on instance-centric categories contribute to a higher average score. In summary, the per-class analysis demonstrates that SGSRF provides balanced improvements across categories with varying scales and structural complexity, reinforcing its effectiveness for robust RRSIS on the RRSIS-D dataset.

4.4.4. Quantitative Results on the RISBench Dataset

The RISBench benchmark features a larger volume of image–text pairs, a broader set of object categories, and longer, more complex referring expressions compared with existing datasets, presenting higher demands for cross-modal semantic parsing and spatial grounding. Table 5 summarizes quantitative comparisons on this dataset, including threshold-based precision and composite segmentation metrics against state-of-the-art methods.

Overall, our method attains the top performance across nearly all metrics. It only falls slightly behind RSRefSeg2 on Pr@0.5 and gIoU of the validation split, while achieving the highest scores on all remaining evaluation metrics for both validation and test sets. More importantly, our method exhibits a more significant performance increment from validation to test set across every metric, in contrast to other competing approaches. This further verifies its superior generalization and robustness to distribution shifts, as well as stronger adaptability to the long-text, large-scale, and multi-category characteristics of RISBench. The consistent advantages across most metrics validate the effectiveness of our framework in handling complex referring semantics under this more challenging remote sensing benchmark.

4.4.5. Qualitative Visualizations on the RefSegRS Dataset

Figure 6 provides qualitative comparisons on the RefSegRS dataset, where we visualize predicted masks from representative methods and our approach under diverse referring expressions. Overall, our method produces segmentation masks that are more faithful to the referred regions, exhibiting clearer boundaries, fewer spurious activations, and better consistency with the linguistic intent, particularly in scenes with cluttered backgrounds and ambiguous semantics. For challenging linear-structured targets such as the “paved road along with tree”, LGCE, RMSIN, FIANet, and RSRefSeg2 fail to extract complete road segments in the first row of image examples. In the second-row examples, these methods misclassify building regions with similar visual characteristics to roads as road areas. In contrast, our method effectively suppresses surrounding interferences such as vegetation while better preserving the continuity of road segments, yielding more coherent and compact segmentation masks. For referring expressions demanding fine-grained discrimination (e.g., “road marking”), some methods tend to over-segment large road surfaces or cause predictions to spill over into adjacent regions. By comparison, our method precisely concentrates on the thin road markings with significantly reduced edge spillover, indicating an improved alignment between referring semantic cues and pixel-level segmentation contours. Clear advantages are also observed for small and sparse targets (e.g., “light-duty vehicle in the parking area” and “light-duty vehicle”). In such cases, existing methods may incorrectly segment nearby vehicles or produce additional false-positive responses beyond the referred instance, while our method yields more accurate instance isolation with more complete target coverage and fewer background activations.

4.4.6. Qualitative Visualizations on the RRSIS-D Dataset

Figure 7 illustrates qualitative comparisons on the RRSIS-D dataset, which contains more complex spatial layouts, denser object distributions, and stronger background interference than RefSegRS. Overall, our method produces segmentation masks that are more consistent with the referring expressions, especially in scenarios involving relative spatial relations, multiple similar instances, and small or partially occluded targets.

For expressions requiring relative positional reasoning (e.g., “a ship on the lower right” or “the airplane on the right of the airplane on the left”), several baseline methods tend to respond to visually salient but incorrect instances, resulting in target confusion or incomplete coverage. In contrast, our approach more reliably distinguishes the referred instance by maintaining consistent spatial correspondence, yielding masks that better align with the specified relational constraints. In scenes with large structures and surrounding distractors (e.g., “an overpass is on the right of the vehicle at the bottom”), existing methods may over-activate extended regions or leak into adjacent roads and background areas. Our method produces more compact and coherent masks, effectively suppressing irrelevant regions while preserving the main structure indicated by the referring expression. We further observe advantages in cases involving fine-grained instance differentiation (e.g., “the tennis court below the tennis court on the upper left”). While existing methods may partially segment multiple similar objects or miss subtle positional cues, our approach yields more accurate instance-level separation, reflecting improved grounding of relational language into spatially precise mask predictions. Compared with the foundation-model baseline RSRefSeg2, our method shows reduced over-segmentation and clearer boundary delineation across diverse scenes. These qualitative results complement the quantitative improvements reported in Table 4, demonstrating that our framework enhances both spatial reasoning robustness and high-precision segmentation performance on the challenging RRSIS-D benchmark.

4.4.7. Qualitative Visualizations on the RISBench Dataset

Figure 8 presents qualitative comparisons on the RISBench dataset, which is characterized by diverse object categories, complex referring expressions, and frequent small or partially visible targets. Overall, our method generates segmentation masks that better align with the linguistic descriptions, showing improved instance discrimination, boundary tightness, and robustness to cluttered backgrounds.

For expressions involving relative spatial descriptions and exclusion cues (e.g., “the helicopter positioned on the right side of the image, away from the main grouping of planes”), several baseline methods tend to activate multiple nearby instances or focus on visually salient but irrelevant objects. In contrast, our approach more effectively suppresses distractors and accurately isolates the referred instance, indicating stronger grounding of relational language into spatial responses. In scenarios with small or partially visible targets (e.g., “the ship in the center of the image is small and docked next to the quay” and “the small vehicle positioned at the right-most edge of the image”), existing methods often yield incomplete masks with coarse boundaries, and may mistakenly include adjacent instances with similar appearance. In contrast, our method produces more compact and complete predictions with tighter boundaries, indicating improved sensitivity to fine-grained cues and more reliable localization-to-segmentation transfer. We also observe clear improvements in cases involving ambiguous boundaries or truncated regions (e.g., “the harbor located in the top-right corner extends partly beyond the captured image frame”). Rather than precisely grounding the referred region, several competing methods may localize an incorrect area or even select a wrong instance, leading to mismatched masks despite plausible visual responses. Our predictions better adhere to the intended location and extent implied by the expression, resulting in more accurate instance grounding and mask coverage. Compared with RSRefSeg2, our method more consistently identifies the correct referred instance and delivers finer mask delineation across diverse RISBench scenes, highlighting improved spatial grounding and boundary refinement under complex referring expressions. These qualitative observations are consistent with the quantitative gains reported in Table 5, further validating the effectiveness of our framework for handling complex referring expressions and high-precision segmentation on RISBench.

4.4.8. Model Complexity and Efficiency Analysis

We analyze the computational overhead and inference latency of the proposed framework. As shown in Table 6, our method and Rsrefseg2 both exhibit considerably larger model size, higher GFLOPs, and longer inference time than traditional lightweight methods such as RMSIN, LGCE, and FIANet. This is mainly attributed to the adoption of large-scale pre-trained foundation models including SAM2 and CLIP as backbones, which inherently bring considerable parameters and computations. Although such a design sacrifices certain efficiency, it endows the model with stronger feature representation ability, cross-modal generalization capacity, and more robust segmentation performance in complex remote sensing scenes. Compared with Rsrefseg2, our method achieves comparable parameters, computational complexity (GFLOPs), and inference speed. Notably, it introduces fewer parameters and lower computational overhead despite incorporating complex attention mechanisms and regularization terms. This confirms that the proposed attention mechanisms and structural consistency regularization can substantially boost segmentation performance with negligible additional computation, verifying the rationality and efficiency of our framework. Overall, our framework achieves a reasonable accuracy–complexity trade-off, common and acceptable in high-performance remote sensing interpretation systems based on large pre-trained models.

4.5. Ablation Study

4.5.1. Effectiveness of Model Components

Table 7 reports the ablation results evaluated on the RRSIS-D test set. Compared with the validation set, the test set has a larger sample size and more diverse data distribution, which supports a more comprehensive and reliable assessment of each component’s contribution to generalization. To avoid selection bias from validation-based checkpoint selection, all ablation variants are trained with a fixed unified training schedule and evaluated directly on the test set under identical settings. A progressive ablation study on GPG and prompt configurations is first conducted. A baseline using CLIP textual features for sparse prompt (SP) generation (without GPG) achieves a gIoU of 66.48%, reflecting a purely semantic-driven scheme lacking spatial adaptation. Introducing GPG to generate SPs from aligned image–text representations yields a consistent gain (gIoU = 66.74%), demonstrating that GPG enhances the semantic–spatial coupling of prompt embeddings to improve location awareness. Further incorporating dense prompts (DPs) into the GPG framework leads to a substantial performance boost (gIoU = 68.72%), as dense visual prompts provide complementary spatial priors and region-level structural cues for SAM, which is critical for handling irregular shapes, scale variations, and fragmented layouts in remote sensing scenes. Subsequently, ablation of the CCR representation-level refinement components (bidirectional cross-modal interaction (BCMI), Context-enhanced Linguistic Structure Modeling (CLSM), and structural consistency regularization (SCR)) is performed. Replacing one-way language guidance with BCMI improves gIoU to 69.32%, as mutual feature interaction enables more accurate semantic–spatial alignment and reduces coarse matching errors. Adding CLSM brings moderate but consistent gains, as context tokens extract target-relevant global context and enhance the network’s ability to capture implicit spatial-language dependencies (e.g., relative position or attribute constraints). Incorporating Grounding-cues Modulated multi-resolution Feature Enhancement (GMFE) further improves performance, verifying that grounding information enhances the spatial representation of encoder features and preserves target-aware context under complex scene conditions.

The impact of the SCR loss is then evaluated. BCMI + SCR outperforms BCMI alone, confirming that structural constraints stabilize cross-modal correspondence. Additional gains are achieved with BCMI + CLSM + SCR, as richer linguistic structure strengthens consistency supervision. The combination of BCMI + CLSM + SCR + GMFE achieves the optimal performance, indicating complementary effects of SCR and GMFE. SCR stabilizes semantic–spatial alignment at the representation level, while GMFE propagates reliable grounding cues to segmentation features.

Notably, CCR improves performance more significantly, suggesting its primary contribution to precise spatial grounding rather than coarse localization. Incorporating grounding cues into GMFE yields the best overall performance, with the full model achieving a gIoU of 70.31% and consistent improvements in Pr@0.8 and Pr@0.9, highlighting enhanced high-precision segmentation. These results confirm the necessity of jointly optimizing cross-modal consistency alignment, prompt generation, and GMFE for robust RRSIS.

4.5.2. Sensitivity Analysis of Key Hyperparameters

To verify the rationality of our hyperparameter settings, we conduct a sensitivity analysis on three critical configurations: the weight of the structural consistency regularization

λ_{cons}

, the number of context tokens, and the number of sparse prompt tokens. Detailed results are reported in Table 8.

For context tokens, we evaluate the number of tokens as

{1, 3, 6, 9, 12}

and compare two initialization schemes: zero initialization (Z) and random initialization (R). Our default setting initializes context tokens from pooled representations of the original text embeddings. Experimental results show that 6 context tokens yield the best overall performance. Insufficient tokens lack sufficient expressive power, while excessive tokens introduce redundant information and lead to performance degradation. For sparse prompt tokens, which are randomly initialized in all experiments, we test

{3, 6, 9, 12}

. The model achieves the best trade-off between representation capability and efficiency when the number is set to 9. For the structural consistency regularization weight

λ_{cons}

, we conducted experiments with values of

0, 0.1, 0.3, 0.5,

and

0.7

. The results show that removing the constraint (

λ_{cons} = 0

) leads to a clear performance drop, directly verifying the effectiveness of the proposed regularization. Smaller coefficients (

0.1, 0.3

) only bring limited improvements, as weak constraints fail to fully enhance both global localization and fine-grained segmentation. An overlarge coefficient (

0.7

) causes over-regularization and degrades overall performance. Only at

λ_{cons} = 0.5

does the model achieve the best balance between global localization (gIoU) and fine-grained segmentation accuracy (cIoU), verifying that this value is well-optimized for our benchmarks. Overall, the results validate the effectiveness and rationality of the proposed hyperparameter settings.

4.6. Failure Case Analysis

Despite the superior performance of our proposed SGSRF in RRSIS, we identify several extreme conditions where the model still struggles with stable cross-modal grounding and accurate segmentation, as visualized in Figure 9. In case (a), the target vehicle is occluded by dense trees, resulting in insufficient visual features for the cross-modal consistency regularization (CCR) module to establish reliable text–image alignment. The model fails to capture the occluded target’s spatial configuration and incorrectly grounds to another visible vehicle, demonstrating that severe occlusion can break the semantic–spatial stability of cross-modal attention. In case (b), the target vehicle is extremely small and embedded in a complex harbor background with mixed buildings and water areas; the limited distinct visual cues prevent the CCR module from extracting effective structural patterns, leading the model to misidentify a parking lot as the target, revealing the challenge of tiny targets in cluttered backgrounds. In case (c), the target vehicle has a color highly similar to the surrounding road, making it visually indistinguishable from the background; the grounding-aware prompt generation (GPG) module cannot capture sufficient target-specific features, resulting in missed detection, which highlights the limitation of low inter-class contrast between target and background. In case (d), the target is a boat towed by a truck on a highway, which violates the common contextual prior that boats typically appear near water; meanwhile, the boat’s appearance is highly similar to the towing vehicle, causing the Grounding Modulated Segmentation (GMS) module to confuse the target with the vehicle and fail to align the text query with the correct visual entity, exposing the model’s sensitivity to contextual outliers and high inter-class similarity. In case (e), two vehicles of nearly identical size with similar shadow patterns are present, making it difficult for the GPG module to distinguish the subtle size difference required by the text query “the tiny vehicle”; the model cannot accurately map the size-related semantic description to the corresponding visual target, revealing the challenge of ambiguous size relationships between similar objects. In case (f), among five visually similar airplanes, the model struggles to judge the subtle size differences specified by the query “similar in size to the airplane on the top”; the GMS module’s multi-scale feature encoding cannot fully capture fine-grained size variations, leading to incorrect grounding, which demonstrates the limitation in handling fine-grained size ambiguity among homogeneous objects. In case (g), the text query “the storage tank is on the lower right of the large storage tank” is insufficiently precise, as two small storage tanks both lie in the lower-right region of the large tank; the CCR module cannot resolve the spatial ambiguity in the text, resulting in unstable cross-modal alignment and incorrect segmentation, indicating the model’s dependence on precise textual spatial descriptions. These failure cases reveal that the SGSRF framework still faces challenges under extreme conditions including severe occlusion, tiny targets in cluttered backgrounds, high inter-class similarity, contextual inconsistency, ambiguous size relationships, and vague textual spatial descriptions. Future work will focus on enhancing occlusion-resistant feature learning, multi-scale size-aware modeling, and robust text–image alignment under ambiguous queries to further improve the framework’s generalization.

5. Discussion

The proposed SGSRF framework is built upon the core hypothesis that the performance bottleneck of RRSIS lies in unstable cross-modal semantic–spatial alignment, particularly under the structural complexity of remote sensing imagery. Existing approaches typically rely on token–patch similarity learning or cross-attention-based fusion, which mainly capture local correspondences but often overlook global structural consistency, making them prone to grounding drift in scenarios with repetitive textures, large-scale variations, and weak object boundaries. To address this limitation, SGSRF reformulates RRSIS as a stagewise reasoning process that decouples semantic alignment, spatial grounding, and mask prediction, enabling progressive constraint of multimodal interactions. In particular, the proposed ASCR module introduces structural consistency regularization by aligning linguistic dependency structures with visual spatial layouts, extending cross-modal alignment from local similarity matching to structure-aware reasoning. The experimental results consistently support this design, demonstrating that explicit decoupling and structured intermediate supervision significantly improve both alignment stability and segmentation accuracy, especially under strict IoU thresholds and complex referring expressions involving spatial relations.

Building upon this stagewise formulation, the effectiveness of the proposed pipeline can be further interpreted from a functional decomposition perspective. The GPG module translates aligned multimodal representations into structured sparse–dense prompts, bridging the gap between semantic understanding and spatially explicit segmentation. Meanwhile, the GMFE module enhances SAM2 by injecting grounding cues into its encoding process, compensating for its limited sensitivity to textual semantics. In contrast to prior works that directly apply prompts to SAM-like models, this grounding-guided adaptation provides more reliable spatial priors and improves robustness against background interference. These findings suggest that prompt quality and grounding fidelity, rather than model scale alone, are critical factors in achieving high-precision segmentation.

These design choices are reflected in the empirical results. Extensive experiments on RefSegRS, RRSIS-D, and RISBench validate the superiority of SGSRF across multiple evaluation protocols. The consistent gains in gIoU and high-threshold metrics (e.g., Pr@0.9) indicate that the proposed framework not only improves coarse localization but also enhances fine-grained boundary accuracy. Importantly, these improvements are achieved without introducing significant computational overhead, as confirmed by the comparable parameters, GFLOPs, and inference speed relative to RSRefSeg2. This suggests that incorporating structural constraints and stagewise reasoning is an efficient alternative to simply scaling model capacity.

Despite these advantages, the current framework still exhibits limitations. Failure cases are observed in scenarios involving severe occlusion, extremely small targets embedded in cluttered backgrounds, low inter-class contrast, and ambiguous or underspecified textual descriptions. These challenges indicate that current grounding mechanisms are still sensitive to incomplete visual evidence and vague semantic cues. Future research could explore occlusion-robust feature learning, finer-grained size-adaptive representations, and more advanced relative spatial reasoning mechanisms. In addition, adaptive alignment strategies that dynamically adjust grounding confidence under ambiguous queries may further improve robustness.

From a broader perspective, this work suggests that explicit structural modeling and progressive grounding are essential for reliable multimodal understanding in remote sensing. The proposed framework extends beyond RRSIS and can be viewed as a general paradigm for cross-modal tasks requiring both semantic reasoning and spatial precision, such as remote sensing visual question answering (VQA) and image–text retrieval. By bridging representation learning, spatial grounding, and task-specific decoding, SGSRF provides a practical pathway for integrating foundation models into complex Earth observation applications.

6. Conclusions

This study addresses the issues of unstable semantic–spatial alignment and grounding drift in RRSIS under complex aerial scenarios. We propose the SGSRF framework, which decomposes the RRSIS task into three cascaded stages: cross-modal consistency optimization, localization-aware prompt generation, and localization-modulated segmentation. A structural consistency regularization mechanism is designed to align linguistic dependency structures with visual spatial layouts, effectively mitigating localization drift. Simultaneously, a SAM-oriented localization-modulated segmentation strategy is introduced, injecting spatial cues to enhance robustness against scale variations and complex backgrounds. Experiments on three benchmark datasets demonstrate that SGSRF achieves state-of-the-art performance while maintaining a favorable balance between accuracy and efficiency, meeting practical application requirements. The proposed strategies validate the effectiveness of explicit cross-modal alignment and localization-prior guidance, providing insights for multimodal remote sensing understanding and the deployment of pre-trained models. Future work will focus on improving model robustness in occluded scenes and for extremely small targets, as well as extending the framework to other cross-modal tasks such as remote sensing visual question answering and image–text retrieval.

Author Contributions

Conceptualization, S.D. and J.X.; methodology, S.D.; software, S.D.; formal analysis, H.C.; investigation, L.C.; resources, H.C.; writing—original draft preparation, S.D.; writing—review and editing, B.Q.; visualization, Y.G.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental and Interdisciplinary Disciplines Break-through Plan of the Ministry of Education of China under Grant JYB2025XDXM115 and the Ye Qisun Science Foundation of the National Natural Science Foundation of China under Grant U2341202.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Correction Statement

This article has been republished with a minor change. The change does not affect the scientific content of the article and further details are available within the backmatter of the website version of this article.

References

Kawar, B.; Zada, S.; Lang, O.; Tov, O.; Chang, H.; Dekel, T.; Mosseri, I.; Irani, M. Imagic: Text-based real image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6007–6017. [Google Scholar]
Zhang, X.; Li, Y.; Li, F.; Jiang, H.; Wang, Y.; Zhang, L.; Zheng, L.; Ding, Z. Ship-Go: SAR ship images inpainting via instance-to-image generative diffusion models. ISPRS J. Photogramm. Remote Sens. 2024, 207, 203–217. [Google Scholar] [CrossRef]
Sha, Y.; Feng, Y.; He, M.; Jin, Y.; You, S.; Ji, Y.; Wu, F.; Liu, S.; Che, S. Cross-Modality Consistency Network for Remote Sensing Text-Image Retrieval. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 17539–17551. [Google Scholar] [CrossRef]
Weng, Q. Remote sensing of impervious surfaces in the urban areas: Requirements, methods, and trends. Remote Sens. Environ. 2012, 117, 34–49. [Google Scholar] [CrossRef]
Blaschke, T.; Lang, S.; Lorup, E.; Strobl, J.; Zeil, P. Object-oriented image processing in an integrated GIS/remote sensing environment and perspectives for environmental applications. Environ. Inf. Plan. Politics Public 2000, 2, 555–570. [Google Scholar]
Zang, N.; Cao, Y.; Wang, Y.; Huang, B.; Zhang, L.; Mathiopoulos, P.T. Land-use mapping for high-spatial resolution remote sensing image via deep learning: A review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5372–5391. [Google Scholar] [CrossRef]
Zheng, Z.; Zhong, Y.; Wang, J.; Ma, A.; Zhang, L. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sens. Environ. 2021, 265, 112636. [Google Scholar] [CrossRef]
Lin, Y.; Xie, Y.; Chen, D.; Xu, Y.; Zhu, C.; Yuan, L. Revive: Regional visual representation matters in knowledge-based visual question answering. Adv. Neural Inf. Process. Syst. 2022, 35, 10560–10571. [Google Scholar]
Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 18155–18165. [Google Scholar]
Ding, H.; Liu, C.; Wang, S.; Jiang, X. VLT: Vision-language transformer and query generation for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7900–7916. [Google Scholar] [CrossRef]
Yuan, Z.; Mou, L.; Hua, Y.; Zhu, X.X. Rrsis: Referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Liu, S.; Ma, Y.; Zhang, X.; Wang, H.; Ji, J.; Sun, X.; Ji, R. Rotated multi-scale interaction network for referring remote sensing image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 26658–26668. [Google Scholar]
Ma, Q.; Li, L.; Lu, X.; Jiao, L.; Liu, F.; Ma, W.; Liu, X.; Sun, L. LSCF: Long-term Semantic-guidance ConvFormer for Referring Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5628313. [Google Scholar] [CrossRef]
Zhang, T.; Wen, Z.; Kong, B.; Liu, K.; Zhang, Y.; Zhuang, P.; Li, J. Referring remote sensing image segmentation via bidirectional alignment guided joint prediction. arXiv 2025, arXiv:2502.08486. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4015–4026. [Google Scholar]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
Yuan, L.; Kong, B.; Wen, Z.; Liu, K.; Zhang, Y.; Liu, L. RSTG-SAM: A Two-Stage Text-Guided SAM-Based Model for Referring Remote Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6013205. [Google Scholar] [CrossRef]
Rong, F.; Lan, M.; Zhang, Q.; Zhang, L. Customized SAM 2 for Referring Remote Sensing Image Segmentation. arXiv 2025, arXiv:2503.07266. [Google Scholar] [CrossRef]
Chen, K.; Liu, C.; Chen, B.; Zhang, J.; Zou, Z.; Shi, Z. Rsrefseg 2: Decoupling referring remote sensing image segmentation with foundation models. arXiv 2025, arXiv:2507.06231. [Google Scholar] [CrossRef]
Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from natural language expressions. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 108–124. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
Margffoy-Tuay, E.; Pérez, J.C.; Botero, E.; Arbeláez, P. Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 630–645. [Google Scholar]
Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5745–5753. [Google Scholar]
Chen, D.J.; Jia, S.; Lo, Y.C.; Chen, H.T.; Liu, T.L. See-through-text grouping for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7454–7463. [Google Scholar]
Hu, Z.; Feng, G.; Sun, J.; Zhang, L.; Lu, H. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4424–4433. [Google Scholar]
Chen, Y.W.; Tsai, Y.H.; Wang, T.; Lin, Y.Y.; Yang, M.H. Referring expression object segmentation with caption-aware consistency. arXiv 2019, arXiv:1910.04748. [Google Scholar] [CrossRef]
Jing, Y.; Kong, T.; Wang, W.; Wang, L.; Li, L.; Tan, T. Locate then segment: A strong pipeline for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9858–9867. [Google Scholar]
Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1307–1315. [Google Scholar]
Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-word-aware network for referring expression image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 38–54. [Google Scholar]
Hui, T.; Liu, S.; Huang, S.; Li, G.; Yu, S.; Zhang, F.; Han, J. Linguistic structure guided context modeling for referring image segmentation. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 59–75. [Google Scholar]
Yang, S.; Xia, M.; Li, G.; Zhou, H.Y.; Yu, Y. Bottom-up shift and reasoning for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 11266–11275. [Google Scholar]
Liu, S.; Hui, T.; Huang, S.; Wei, Y.; Li, B.; Li, G. Cross-modal progressive comprehension for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4761–4775. [Google Scholar] [CrossRef] [PubMed]
Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10502–10511. [Google Scholar]
Feng, G.; Hu, Z.; Zhang, L.; Lu, H. Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15506–15515. [Google Scholar]
Hu, Y.; Wang, Q.; Shao, W.; Xie, E.; Li, Z.; Han, J.; Luo, P. Beyond one-to-one: Rethinking the referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 4067–4077. [Google Scholar]
Liu, Y.; Xu, R.; Tang, Y. Fully Aligned Network for Referring Image Segmentation. In 2024 IEEE International Conference on Visual Communications and Image Processing (VCIP); IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
Cho, Y.; Yu, H.; Kang, S.J. Cross-aware early fusion with stage-divided vision and language transformer encoders for referring image segmentation. IEEE Trans. Multimed. 2023, 26, 5823–5833. [Google Scholar] [CrossRef]
Wu, J.; Zhang, Y.; Kampffmeyer, M.; Zhao, X. Prompt-guided bidirectional deep fusion network for referring image segmentation. Neurocomputing 2025, 616, 128899. [Google Scholar] [CrossRef]
Nguyen-Truong, H.; Nguyen, E.R.; Vu, T.A.; Tran, M.T.; Hua, B.S.; Yeung, S.K. Vision-aware text features in referring image segmentation: From object understanding to context understanding. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2025; pp. 4988–4998. [Google Scholar]
Shah, N.A.; VS, V.; Patel, V.M. Lqmformer: Language-aware query mask transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–21 June 2024; pp. 12903–12913. [Google Scholar]
Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11686–11695. [Google Scholar]
Yue, P.; Lin, J.; Zhang, S.; Hu, J.; Lu, Y.; Niu, H.; Ding, H.; Zhang, Y.; Jiang, G.; Cao, L.; et al. Adaptive selection based referring image segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 1101–1110. [Google Scholar]
Xu, Z.; Chen, Z.; Zhang, Y.; Song, Y.; Wan, X.; Li, G. Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17503–17512. [Google Scholar]
Wang, Y.; Li, J.; Zhang, X.; Shi, B.; Li, C.; Dai, W.; Xiong, H.; Tian, Q. Barleria: An efficient tuning framework for referring image segmentation. In Proceedings of the Twelfth International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Shang, C.; Song, Z.; Qiu, H.; Wang, L.; Meng, F.; Li, H. Prompt-driven referring image segmentation with instance contrasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 4124–4134. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning; PmLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
Dong, Z.; Sun, Y.; Liu, T.; Zuo, W.; Gu, Y. Cross-modal bidirectional interaction model for referring remote sensing image segmentation. arXiv 2024, arXiv:2410.08613. [Google Scholar] [CrossRef]
Lei, S.; Xiao, X.; Zhang, T.; Li, H.C.; Shi, Z.; Zhu, Q. Exploring fine-grained image-text alignment for referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5604611. [Google Scholar] [CrossRef]
Shi, L.; Zhang, J. Multimodal-aware fusion network for referring remote sensing image segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8001805. [Google Scholar] [CrossRef]
Li, K.; Vosselman, G.; Yang, M.Y. Scale-wise Bidirectional Alignment Network for referring remote sensing image segmentation. ISPRS J. Photogramm. Remote Sens. 2025, 226, 350–363. [Google Scholar] [CrossRef]
Lobry, S.; Marcos, D.; Murray, J.; Tuia, D. RSVQA: Visual question answering for remote sensing data. IEEE Trans. Geosci. Remote Sens. 2020, 58, 8555–8566. [Google Scholar] [CrossRef]
Saha, A.; Maji, S.K. Enhanced RSVQA Insight Through Synergistic Visual-Linguistic Attention Models. IEEE Geosci. Remote Sens. Lett. 2025, 22, 8002905. [Google Scholar] [CrossRef]
Yang, X.; Li, C.; Wang, Z.; Xie, H.; Mao, J.; Yin, G. Remote sensing cross-modal text-image retrieval based on attention correction and filtering. Remote Sens. 2025, 17, 503. [Google Scholar] [CrossRef]
Wang, Y.; Tang, X.; Ma, J.; Zhang, X.; Liu, F.; Jiao, L. Cross-modal remote sensing image–text retrieval via context and uncertainty-aware prompt. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 11384–11398. [Google Scholar] [CrossRef] [PubMed]
Lyu, Y.; Yan, H.; Liu, Y.; Chen, H. Discriminative Representation Learning for Remote Sensing Visual Question Answering. ACM Trans. Multimedia Comput. Commun. Appl. 2025. [Google Scholar] [CrossRef]
Zi, X.; Xiao, J.; Shi, Y.; Tao, X.; Li, J.; Braytee, A.; Prasad, M. RSVLM-QA: A Benchmark Dataset for Remote Sensing Vision Language Model-based Question Answering. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland, 27–31 October 2025; pp. 12905–12911. [Google Scholar]
Xu, L.; Wang, L.; Zhang, J.; Ha, D.; Zhang, H. A Review of Cross-Modal Image–Text Retrieval in Remote Sensing. Remote Sens. 2025, 17, 3995. [Google Scholar] [CrossRef]
Deng, J.; Yang, Z.; Chen, T.; Zhou, W.; Li, H. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 1769–1779. [Google Scholar]
Zheng, Y.H.; Lin, G.S.; Chang, K.Y. Transformer-based Visual Grounding with Inter-Modality Cross Attention. In 2025 19th International Conference on Machine Vision and Applications (MVA); IEEE: Piscataway, NJ, USA, 2025; pp. 1–6. [Google Scholar]
Zou, X.; Wu, C.; Cheng, L.; Wang, Z. Tokenflow: Rethinking fine-grained cross-modal alignment in vision-language retrieval. arXiv 2022, arXiv:2209.13822. [Google Scholar]
Pan, Z.; Wu, F.; Zhang, B. Fine-grained image-text matching by cross-modal hard aligning network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19275–19284. [Google Scholar]
Luo, Z.; Meng, M.; Wu, J. Dynamic patch selection and dual-granularity alignment for cross-modal retrieval. Neurocomputing 2026, 600, 132999. [Google Scholar] [CrossRef]
Wang, W.; Di, X.; Liu, M.; Gao, F. Multi-level Symmetric Semantic Alignment Network for image–text matching. Neurocomputing 2024, 599, 128082. [Google Scholar] [CrossRef]
Li, Z.; Zhang, L.; Zhang, K.; Zhang, Y.; Mao, Z. Improving image-text matching with bidirectional consistency of cross-modal alignment. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6590–6607. [Google Scholar] [CrossRef]
Azimi, S.M.; Henry, C.; Sommer, L.; Schumann, A.; Vig, E. Skyscapes fine-grained semantic understanding of aerial scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7393–7403. [Google Scholar]
Zhan, Y.; Xiong, Z.; Yuan, Y. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604513. [Google Scholar] [CrossRef]
Li, X.; Ding, J.; Elhoseiny, M. Vrsbench: A versatile vision-language benchmark dataset for remote sensing image understanding. Adv. Neural Inf. Process. Syst. 2024, 37, 3229–3242. [Google Scholar]
Tschannen, M.; Gritsenko, A.; Wang, X.; Naeem, M.F.; Alabdulmohsin, I.; Parthasarathy, N.; Evans, T.; Beyer, L.; Xia, Y.; Mustafa, B.; et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv 2025, arXiv:2502.14786. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]

Figure 1. Illustration of RRSIS challenges. (a) Challenge of large-scale composite regional targets with ambiguous boundaries. (b) Challenge of identical-type target confusion and fine-grained referential discrimination. (c) Challenge of small-target localization and accurate recognition under complex backgrounds.

Figure 2. The overall framework of SGSRF disentangles semantic alignment, spatial grounding, and pixel-level segmentation into three decoupled yet cascaded stages. CCR enforces cross-modal structural consistency to produce grounding cues; GPG externalizes localization priors into prompt space; and GMS performs grounding-modulated mask prediction under dual guidance from prompts and contextual cues.

Figure 3. The architecture of the proposed ASCR. Green: initialized context tokens (input)/optimized grounding cues (output); Pink: text embeddings (input)/refined text descriptions (output). The enhanced textual sequence U (concatenated from L and C) first undergoes self-attention to produce the linguistic structural matrix

A_{u u}

. The VTRSC module constructs the vision-induced token relational matrix

C_{u u}

from the visual-to-text cross-attention matrix

A_{f u}

. The SCR aligns

A_{u u}

and

C_{u u}

to compute the consistency

L_{c o n s}

.

Figure 3. The architecture of the proposed ASCR. Green: initialized context tokens (input)/optimized grounding cues (output); Pink: text embeddings (input)/refined text descriptions (output). The enhanced textual sequence U (concatenated from L and C) first undergoes self-attention to produce the linguistic structural matrix

A_{u u}

. The VTRSC module constructs the vision-induced token relational matrix

C_{u u}

from the visual-to-text cross-attention matrix

A_{f u}

. The SCR aligns

A_{u u}

and

C_{u u}

to compute the consistency

L_{c o n s}

.

Figure 4. Schematic diagram of of ASCR. (a) Pipeline of VTRSC and SCR:

C_{u u}

is derived from

A_{f u}

, and the similarity constraint between

A_{u u}

and

C_{u u}

is enforced via the consistency loss

L_{cons}

. For clarity, text tokens are grouped by semantics (target entity, spatial relation, reference entity, location constraint) and color-coded to intuitively present the structural dependencies among tokens. (b) Effectiveness visualization: without SCR, the model fails to parse hierarchical spatial relations, leading to ambiguous localization; with SCR, semantic groups align with visual layouts, enabling accurate instance-level localization and spatial relation parsing.

Figure 4. Schematic diagram of of ASCR. (a) Pipeline of VTRSC and SCR:

C_{u u}

is derived from

A_{f u}

, and the similarity constraint between

A_{u u}

and

C_{u u}

is enforced via the consistency loss

L_{cons}

. For clarity, text tokens are grouped by semantics (target entity, spatial relation, reference entity, location constraint) and color-coded to intuitively present the structural dependencies among tokens. (b) Effectiveness visualization: without SCR, the model fails to parse hierarchical spatial relations, leading to ambiguous localization; with SCR, semantic groups align with visual layouts, enabling accurate instance-level localization and spatial relation parsing.

Figure 5. The architecture of the proposed GPG. Red blocks denote dense prompts, and orange blocks denote sparse prompts. Initial prompt tokens P pass through a dual cross-attention block (

\times 2

) with

H_{f}

and

H_{l}

, then are processed to generate sparse prompt

E_{s}

. Parallelly,

H_{f}

is upsampled, fused with

H_{l}

, and convolved to produce dense prompt

E_{d}

. This dual-prompt design outputs complementary cues for the SAM2 decoder.

Figure 5. The architecture of the proposed GPG. Red blocks denote dense prompts, and orange blocks denote sparse prompts. Initial prompt tokens P pass through a dual cross-attention block (

\times 2

) with

H_{f}

and

H_{l}

, then are processed to generate sparse prompt

E_{s}

. Parallelly,

H_{f}

is upsampled, fused with

H_{l}

, and convolved to produce dense prompt

E_{d}

. This dual-prompt design outputs complementary cues for the SAM2 decoder.

Figure 6. Qualitative comparisons of different methods on the RefSegRS dataset. (The green box denotes the zoomed-in local region, and the red color indicates the target region described in the text).

Figure 7. Qualitative comparisons of different methods on the RRSIS-D dataset. (The red color indicates the target region described in the text).

Figure 8. Qualitative comparisons of different methods on the RISBench dataset. (The red color indicates the target region described in the text).

Figure 9. Visualization of typical failure cases of the proposed SGSRF framework. ((a–g) Different test cases selected to analyze the model’s performance, where the red regions denote the target objects corresponding to the referring text).

Table 1. Overview of Referring Remote Sensing Image Segmentation benchmarks. #Samples and #Classes denote the number of samples and the number of classes, respectively.

Dataset	Image Size	#Samples	#Classes	Key Challenges
RefSegRS	$512 \times 512$	4420	14	Urban clutter, attribute ambiguity
RRSIS-D	$800 \times 800$	17,402	20	Extreme scale variation
RISBench	$512 \times 512$	52,472	26	Linguistic diversity, large category space

Table 2. Quantitative comparison with state-of-the-art methods on the RefSegRS benchmark. (Bold values indicate the best performance among all comparison methods).

Method	Vision	Text	Pr@0.5		Pr@0.6		Pr@0.7		Pr@0.8		Pr@0.9		cIoU		gIoU
Method	Vision	Text	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test
RNN [24]	R-101 [21]	LSTM [70]	55.43	30.26	42.98	23.01	23.11	14.87	13.72	7.17	2.64	0.98	69.24	65.06	50.81	41.88
CMSA [34]	R-101 [21]	–	39.24	28.07	38.44	20.25	20.39	12.71	11.79	5.61	1.52	0.83	65.84	64.53	43.62	41.47
BRINet [26]	R-101 [21]	LSTM [70]	36.86	20.72	35.53	14.26	19.93	9.87	10.66	2.98	2.84	1.14	61.59	58.22	38.73	31.51
LSCM [31]	R-101 [21]	LSTM [70]	56.82	31.54	41.24	20.41	21.85	9.51	12.11	5.29	2.51	0.84	62.82	61.27	40.59	35.54
CMPC+ [33]	R-101 [21]	LSTM [70]	56.84	49.19	37.59	28.31	20.42	15.31	10.67	8.12	2.78	2.55	70.62	66.53	47.13	43.65
LGCE [11]	Swin-B [71]	BERT [72]	95.82	80.35	93.04	69.24	82.37	48.76	42.92	18.77	10.21	3.91	84.60	77.04	76.85	64.12
RMSIN [12]	Swin-B [71]	BERT [72]	93.74	76.88	89.33	62.47	75.87	36.87	33.41	13.54	7.89	2.53	79.93	73.03	74.18	60.27
FIANet [48]	Swin-B [71]	BERT [72]	96.06	82.44	93.50	74.68	89.79	57.73	70.53	27.85	15.55	5.67	85.90	76.95	80.96	66.80
BTDNet [14]	Swin-B [71]	BERT [72]	95.13	83.60	94.20	75.07	90.72	62.69	68.91	34.40	19.49	9.14	87.92	80.57	80.61	67.95
LSCF [13]	Swin-B [71]	BERT [72]	97.22	87.51	96.30	82.89	93.75	75.85	89.58	62.65	72.92	33.77	90.80	83.27	89.45	77.44
RS2-SAM2 [18]	SAM2-L [15]	BEIT-3 [72]	95.36	84.31	94.90	79.42	92.58	70.89	83.76	55.70	36.66	21.19	88.03	80.87	85.21	73.90
RSRefSeg2 [19]	SAM [15]	CLIP [47]	96.52	87.45	94.66	81.29	92.34	72.26	87.70	57.73	72.85	32.97	90.54	80.33	89.70	76.17
Ours	SAM [15]	CLIP [47]	97.22	88.06	95.13	82.94	93.04	73.80	90.26	59.60	77.49	37.15	91.42	83.55	91.06	77.85

Table 3. Quantitative comparison with state-of-the-art methods on the RRSIS-D benchmark. (Bold values indicate the best performance among all comparison methods).

Method	Vision	Text	Pr@0.5		Pr@0.6		Pr@0.7		Pr@0.8		Pr@0.9		cIoU		gIoU
Method	Vision	Text	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test
RNN [24]	R-101 [21]	LSTM [70]	51.09	51.07	42.47	42.11	33.04	32.77	20.80	21.57	6.14	6.37	66.53	66.43	46.06	45.64
CMSA [34]	R-101 [21]	–	55.68	55.32	48.04	46.45	38.27	37.43	26.55	25.39	9.02	8.15	69.68	69.39	48.85	48.54
LSCM [31]	R-101 [21]	LSTM [70]	57.12	56.02	48.04	46.25	37.87	37.70	26.37	25.28	7.93	8.27	69.28	69.05	50.36	49.92
BRINet [26]	R-101 [21]	LSTM [70]	58.79	56.90	49.54	48.77	39.65	39.12	28.21	27.03	9.19	8.73	70.73	69.88	51.14	49.65
CMPC+ [33]	R-101 [21]	LSTM [70]	59.19	57.65	49.36	47.51	38.67	36.97	25.91	24.33	8.16	7.78	70.14	68.64	51.41	50.24
LGCE [11]	Swin-B [71]	BERT [72]	73.10	73.48	66.38	66.62	55.86	55.04	42.87	41.51	23.33	23.33	77.10	76.28	63.49	63.12
RMSIN [12]	Swin-B [71]	BERT [72]	75.40	74.20	67.70	68.00	56.67	56.59	43.68	42.40	24.94	23.70	77.91	77.51	65.11	64.21
FIANet [48]	Swin-B [71]	BERT [72]	75.13	74.47	67.45	66.96	56.95	56.31	44.37	42.83	24.48	24.13	77.53	76.91	64.97	64.01
MAFN [50]	Swin-B [71]	BERT [72]	76.32	75.27	69.31	68.14	58.33	56.79	44.54	43.49	24.71	23.76	78.33	77.41	66.03	64.76
BTDNet [14]	Swin-B [71]	BERT [72]	77.87	75.93	71.26	69.92	60.11	59.29	47.53	46.25	27.70	27.46	79.29	79.23	66.89	66.04
LSCF [13]	Swin-B [71]	BERT [72]	75.17	74.30	67.93	67.69	57.99	56.32	44.94	43.08	25.98	25.67	78.14	77.42	65.15	64.25
RS2-SAM2 [18]	SAM2-L [15]	BEIT-3 [72]	79.25	77.56	74.08	72.34	63.85	61.76	50.57	47.92	30.40	29.73	80.16	78.99	68.81	66.72
RSRefSeg2 [19]	SAM [15]	CLIP [47]	79.71	79.49	74.83	74.06	65.17	64.55	52.64	50.13	32.36	31.34	79.19	78.50	69.68	68.78
Ours	SAM [15]	CLIP [47]	80.23	80.93	75.57	76.07	66.95	66.27	53.10	51.02	32.47	31.14	79.18	79.37	69.93	69.67

Table 4. Per-class performance comparison on the RRSIS-D test set. (Bold values indicate the best performance among all comparison methods).

Category	RMSIN	LGCE	FIANet	RSRefSeg2	StaRF (Ours)
airport	68.08	68.11	68.66	72.95	63.13
Golf field	56.11	56.43	57.07	79.09	79.06
Expressway service area	76.68	77.19	77.35	76.49	77.83
Baseball field	66.93	70.93	70.44	87.27	87.25
stadium	83.09	84.90	84.87	88.73	87.90
Ground track field	81.91	82.54	82.00	79.88	83.08
Storage tank	73.65	73.33	76.99	79.05	80.93
Basketball court	72.26	74.37	74.86	74.91	71.82
chimney	68.42	68.44	68.41	83.97	85.15
Tennis court	76.68	75.63	78.48	79.98	79.05
Overpass	70.14	67.67	70.01	66.17	68.91
Train station	62.67	58.19	61.30	67.74	71.19
Ship	64.64	63.48	65.96	71.65	74.11
Expressway toll station	65.71	61.63	64.82	75.03	73.37
dam	68.70	64.54	71.31	65.83	69.25
harbor	60.40	60.47	62.03	46.79	43.18
bridge	36.74	34.24	37.94	55.56	57.52
vehicle	47.63	43.12	49.66	55.10	55.84
windmill	41.99	40.76	46.72	62.06	62.74
airplane	60.17	56.43	60.32	72.95	73.86
average	65.13	64.12	66.46	72.06	72.26

Table 5. Quantitative comparison with state-of-the-art methods on the RISBench benchmark (Val/Test). (Bold values indicate the best performance among all comparison methods).

Method	Vision	Text	Pr@0.5		Pr@0.6		Pr@0.7		Pr@0.8		Pr@0.9		cIoU		gIoU
Method	Vision	Text	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test
RNN [24]	R-101 [21]	LSTM [70]	54.62	55.04	46.88	47.13	39.57	39.86	32.64	32.58	11.57	13.24	47.28	49.67	42.65	43.18
LSCM [31]	R-101 [21]	LSTM [70]	55.87	55.26	47.24	47.14	40.22	40.10	33.55	33.29	12.78	13.91	47.99	50.08	43.21	43.69
BRINet [26]	R-101 [21]	LSTM [70]	52.11	52.87	45.17	45.39	37.98	38.64	30.88	30.79	10.28	11.86	46.27	48.73	41.54	42.91
CMPC+ [33]	R-101 [21]	LSTM [70]	57.84	58.02	49.24	49.00	42.34	42.53	35.77	35.26	14.55	17.88	50.29	53.98	45.81	46.73
LGCE [11]	Swin-B [71]	BERT [72]	70.58	71.19	65.72	65.92	58.60	58.94	46.87	47.87	24.68	27.86	70.55	74.34	62.39	63.39
RMSIN [12]	Swin-B [71]	BERT [72]	74.43	74.56	69.37	69.47	62.75	62.39	51.30	51.56	26.90	30.71	69.41	73.81	65.75	66.79
FIANet [48]	Swin-B [71]	BERT [72]	74.66	74.80	70.21	69.86	63.41	63.16	52.49	52.70	29.51	31.87	70.17	74.23	66.16	66.83
LSCF [13]	Swin-B [71]	BERT [72]	75.67	76.08	70.99	71.29	64.58	64.96	54.92	55.13	36.38	36.73	69.93	74.88	67.88	68.53
RSRefSeg2 [19]	SAM [15]	CLIP [47]	78.14	78.50	74.70	74.94	70.38	70.30	63.74	63.43	50.50	49.91	70.54	74.77	72.05	72.62
Ours	SAM [15]	CLIP [47]	78.11	79.48	74.85	76.03	70.43	71.36	63.87	63.95	50.94	50.68	70.61	75.74	72.03	73.30

Table 6. Computational complexity comparison with state-of-the-art methods. (Bold values indicate the best performance among all comparison methods).

Method	Total Parameters	Trainable Parameters	GFLOPs	Inference Time	Inference FPS
RMSIN	240.04 M	240.04 M	1712.19	71.3	14.02
LGCE	276.26 M	276.26 M	1745.23	66.23	15.1
FIANet	251.92 M	251.92 M	1712.90	73.49	13.61
Rsrefseg2	1447.13 M	97.87 M	2563.51	267.29	3.74
Ours	1407.38 M	58.12 M	2570.23	263.2	3.8

Table 7. Ablation study on the RRSIS-D validation set. (√: The corresponding module is used; –: The module is not used. Bold values indicate the best performance among all comparison methods).

Foundation Model			GPG		ASCR			GMFE	RRSIS-D Metrics
SAM2	CLIP-T	CLIP-I	SP	DP	BCMI	CLSM	SCR	GMFE	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	cIoU	gIoU
√	√	–	√	–	–	–	–	–	76.96	71.42	61.59	47.92	29.65	76.28	66.48
√	√	√	√	–	–	–	–	–	77.31	72.08	62.02	48.18	29.99	77.35	66.72
√	√	√	√	√	–	–	–	–	79.84	74.63	64.78	49.64	30.62	78.54	68.72
√	√	√	√	√	√	–	–	–	80.21	75.09	65.01	51.11	31.34	79.20	69.32
√	√	√	√	√	√	√	–	–	80.58	75.18	65.12	50.53	31.72	79.49	69.33
√	√	√	√	√	√	√	–	√	80.84	75.87	66.02	50.13	31.23	79.50	69.62
√	√	√	√	√	√	–	√	–	80.95	76.01	65.58	50.96	30.97	79.31	69.53
√	√	√	√	√	√	√	√	–	81.10	75.98	65.44	50.65	31.69	79.83	69.56
√	√	√	√	√	√	√	√	√	81.67	76.59	66.50	51.59	31.74	80.54	70.31

Table 8. Ablation study on the number of embeddings and the consistency constraint coefficient

λ

.

Table 8. Ablation study on the number of embeddings and the consistency constraint coefficient

λ

.

Module	Embeddings	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9	cIoU	gIoU
	6 (Z)	80.82	75.93	66.47	52.22	31.79	79.86	69.98
	6 (R)	80.82	76.13	66.76	51.79	31.84	80.50	70.26
	1	81.69	76.21	65.98	51.36	31.53	80.09	70.05
context	3	81.18	76.50	66.33	51.73	31.99	80.22	70.06
	6	81.67	76.59	66.50	51.59	31.74	80.54	70.31
	9	80.87	76.01	65.69	51.44	31.30	79.57	70.08
	12	80.43	75.51	66.43	51.66	31.56	80.41	69.99
Sparse	3	80.72	76.15	65.82	50.98	31.25	79.68	70.06
	6	80.88	76.31	66.09	51.15	31.51	79.86	70.29
	9	81.67	76.59	66.50	51.59	31.74	80.54	70.31
	12	80.97	75.58	65.44	50.19	31.53	79.93	70.06
$λ_{c o n s}$	0	80.84	75.87	66.02	50.13	31.23	79.50	69.62
	0.1	81.01	76.04	65.53	50.70	31.49	79.73	69.60
	0.3	81.50	76.13	65.73	50.56	31.26	80.05	69.85
	0.5	81.67	76.59	66.50	51.59	31.74	80.54	70.31
	0.7	80.75	75.67	65.99	50.76	31.94	80.01	69.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dong, S.; Xie, J.; Chen, L.; Chen, H.; Qi, B.; Ge, Y. Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation. Remote Sens. 2026, 18, 1015. https://doi.org/10.3390/rs18071015

AMA Style

Dong S, Xie J, Chen L, Chen H, Qi B, Ge Y. Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation. Remote Sensing. 2026; 18(7):1015. https://doi.org/10.3390/rs18071015

Chicago/Turabian Style

Dong, Shan, Jianlin Xie, Liang Chen, He Chen, Baogui Qi, and Yunqiu Ge. 2026. "Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation" Remote Sensing 18, no. 7: 1015. https://doi.org/10.3390/rs18071015

APA Style

Dong, S., Xie, J., Chen, L., Chen, H., Qi, B., & Ge, Y. (2026). Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation. Remote Sensing, 18(7), 1015. https://doi.org/10.3390/rs18071015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation

Highlights

Abstract

1. Introduction

2. Related Works

2.1. Referring Segmentation in Natural Images

2.2. Referring Remote Sensing Image Segmentation

3. Methodology

3.1. Method Overview

3.2. Cross-Modal Consistency Refinement

3.2.1. Initial Cross-Modal Representation Encoding

3.2.2. Attention-Induced Structural Consistency Regularization Module

3.3. Grounding-Aware Prompts Generation

3.4. Grounding Modulated Segmentation

3.4.1. Grounding-Cues Modulated Multi-Resolution Feature Enhancement

3.4.2. Prompts Driven Segmentation

3.5. Loss Function

4. Experiments

4.1. Experimental Datasets

4.2. Implementation Details

4.3. Evaluation Protocol and Metrics

4.4. Comparison Results

4.4.1. Quantitative Results on the RefSegRS Dataset

4.4.2. Quantitative Results on the RRSIS-D Dataset

4.4.3. Per-Class Performance on the RRSIS-D Dataset

4.4.4. Quantitative Results on the RISBench Dataset

4.4.5. Qualitative Visualizations on the RefSegRS Dataset

4.4.6. Qualitative Visualizations on the RRSIS-D Dataset

4.4.7. Qualitative Visualizations on the RISBench Dataset

4.4.8. Model Complexity and Efficiency Analysis

4.5. Ablation Study

4.5.1. Effectiveness of Model Components

4.5.2. Sensitivity Analysis of Key Hyperparameters

4.6. Failure Case Analysis

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Correction Statement

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI