CCR addresses the core bottleneck of RRSIS, namely unreliable semantic–spatial correspondence between remote sensing imagery and referring expressions. Remote sensing scenes with cluttered backgrounds, repetitive textures and pronounced scale variations coupled with referring descriptions containing fine-grained semantic modifiers and spatial relations often lead conventional token–patch alignment to produce inconsistent visual activations, causing unstable spatial anchoring grounding drift and degraded prompt generation and segmentation quality. The goal of CCR is to generate consistency-enhanced cross-modal representations and reliable spatial grounding cues as structured semantic and spatial priors for subsequent stages achieved via a stepwise refinement strategy. First, a CLIP-based dual encoder extracts initial coarse-grained cross-modal features with Low-Rank Adaptation (LoRA) introduced to lightweight adapt visual and textual encoders for domain discrepancy between natural-image pre-training and remote sensing imagery. Second, cross-modal interactions are strengthened to facilitate global semantic exchange enhancing textual tokens’ scene-level context capture and visual tokens’ target awareness. Finally, ASCR enforces relational consistency between linguistic dependency structures and visual response patterns constraining cross-modal attention at the structural level to promote coherent spatial responses of semantically related words and suppress spurious background activations. This progressive refinement enables CCR to output semantically aligned and spatially stable cross-modal representations, establishing reliable grounding cues to support precise prompt generation and downstream segmentation.
3.2.2. Attention-Induced Structural Consistency Regularization Module
Although the CLIP encoder and LoRA adaptation provide preliminary cross-modal alignment, RRSIS still suffers from a critical limitation. Textual descriptions rarely describe the entire remote sensing scene, but instead precisely locate the target region through multiple semantic modifiers and spatial relations. However, the complex characteristics of remote sensing imagery, such as cluttered backgrounds, repetitive textures, and large-scale variations, often cause semantically related text tokens to attend to inconsistent or erroneous visual regions when relying only on conventional cross-modal attention. This issue further leads to unstable spatial anchoring and localization drift. Most existing cross-modal alignment methods establish token–patch or region–word correspondences via mechanisms such as cross-attention (e.g., TRANSVG [
59], IMCA [
60]), token-level similarity matching (e.g., TokenFlow [
61], CHAN [
62], BOOM [
63]), or region-based grounding (e.g., MSSAN [
64,
65]). These methods typically model fine-grained interactions through point-to-point matching between textual tokens and visual regions, but largely neglect the structural relationships among text tokens—such as syntactic dependencies, modifier–head relations, and compositional semantics—and fail to leverage complementary structural cues in the visual space. Robust cross-modal grounding, especially in complex remote sensing scenes, requires iterative interactions between text and image structures, where linguistic dependencies and visual spatial patterns dynamically reinforce and constrain each other to achieve consistent and semantically coherent alignment. From the linguistic perspective, text self-attention naturally encodes rich semantic and syntactic dependencies, including attribute–entity relations, modifier–head relations, and relative spatial constraints. In principle, these relational structures should correspond to consistent and cooperative visual response patterns in RRSIS, since semantically related text tokens should attend to the same or spatially adjacent visual regions. Based on this motivation, we propose the ASCR module. The core idea is to construct token-level relational structures from visual response distributions induced by cross-modal cross-attention, and explicitly align these structures with the linguistic dependencies encoded in text self-attention via structural consistency regularization. Such a mechanism stabilizes cross-modal semantic alignment at the structural level, effectively suppressing conflicting visual matching and alleviating localization drift.
The detailed implementation of ASCR is divided into four key steps: Context-enhanced Linguistic Structure Modeling, bidirectional cross-modal interaction, vision-induced token relational structure construction, and structural consistency regularization. As illustrated in
Figure 3, the specific process and parameter settings are elaborated as follows:
Context-enhanced Linguistic Structure Modeling (CLSM): To enhance the ability of textual tokens to capture target-specific contextual information in remote sensing scenes, we introduce a set of learnable context tokens
, where
denotes the number of context tokens. These context tokens are initialized from pooled representations of the original text embeddings
L, and augmented with learnable positional embeddings during attention operation. They are concatenated with the original textual tokens
to form an enhanced textual sequence:
Self-attention is then applied to
U to generate the initial linguistic semantic description
, which encodes both the semantic and syntactic relations among text tokens and context tokens. The textual self-attention process is formulated as
where
are learnable projection matrices, and
denotes the dimension of the query and key vectors. The self-attention matrix
encodes the latent linguistic structural and syntactic relations among the tokens in
U, which serves as the reference structure for subsequent consistency regularization.
Bidirectional Cross Modal Interaction (BCMI): To realize bidirectional information interaction between the enhanced textual sequence and visual tokens, we perform two-stage cross-attention between (enhanced textual features) and F (visual tokens from the CLIP image encoder). Such interaction enables the textual tokens to guide the extraction of target-related visual features, while the visual tokens to enrich the semantic representation of textual tokens, thus enhancing the semantic–spatial correspondence between the two modalities.
The first stage of cross-attention (text-guided visual feature refinement) is formulated as
where
,
, and
are learnable projection matrices for the query, key, and value, respectively.
is the cross-attention matrix between textual tokens and visual tokens, and
is the refined textual feature after interacting with visual features.
The second stage of cross-attention (visual-guided textual feature refinement) is formulated as
where
,
, and
are learnable projection matrices.
is the cross-attention matrix between visual tokens and textual tokens, and
is the refined visual feature after interacting with textual features. Through cross-attention interactions, the learnable context tokens
C in
U extract target-relevant global contextual information. The
encodes the latent linguistic structural and syntactic relations among the tokens in
U, which serves as the reference structure for subsequent consistency regularization.
Through this two-stage cross-attention interaction, the learnable context tokens C in U can effectively aggregate target-relevant global contextual information from the visual tokens F, further enhancing the ability of textual tokens to distinguish the target from cluttered background. The cross-attention matrix characterizes the visual response distribution induced by language queries, which forms the basis for constructing the vision-induced relational structure.
Vision-induced Token Relational Structure Construction (VTRSC): Based on the two-stage cross-attention interaction, the learnable context tokens
C in
U can effectively extract target-relevant global contextual information from the visual tokens
F, further enhancing the ability of textual tokens to distinguish the target from the background. The cross-attention matrix
reflects the distribution of visual responses to language descriptions induced by cross-modal cross-attention, which serves as the foundation for constructing the vision-induced token relational structure. We compute the cosine similarity between the column vectors of
corresponding to the original textual tokens to obtain the vision-induced token-to-token similarity matrix
, where the context tokens
c are excluded during computation to avoid introducing bias into the alignment process. The specific formulation is as follows:
with
, where
denotes the cosine similarity function, and
(the similarity of a token to itself is 1). Each element
in the matrix characterizes the similarity or overlap between the visual regions attended to by the
i-th and
j-th text tokens in the visual space. A higher value of
indicates that the two language tokens attend to the same or spatially proximate visual regions, while a lower value indicates that they attend to different visual regions. The schematic of this process is illustrated in
Figure 4a. For clear visualization, textual tokens are grouped by semantics (target entity, spatial relation, reference entity, and location constraint) and color coded to intuitively present the target-relation-reference-location semantic relational structure. In the attention matrix
, each row represents the correlation between the color-coded textual tokens and the image embeddings. In the
matrix, color intensity denotes the correlation and similarity between each pair of tokens.
Structural Consistency Regularization (SCR): The core of SCR lies in imposing a consistency constraint between the linguistic structural matrix
(derived from textual self-attention, encoding the inherent semantic and syntactic dependencies of language) and the vision-induced relational matrix
(derived from cross-modal cross-attention, capturing the association patterns between targets and contexts in visual space), as illustrated in
Figure 4a. This module adopts a bidirectional alignment mechanism to synergize linguistic semantics and visual spatial relations. On the one hand, the linguistic structure provides explainable semantic priors to avoid ambiguous visual matching, preventing unconstrained visual matching from falling into ambiguity. On the other hand, the visual relational structure injects spatial constraints to refine textual representations, correcting potential biases in text representations. With such bidirectional regularization, our model effectively suppresses inconsistent visual alignments, and significantly improves the stability, interpretability, and robustness of cross-modal grounding in RRSIS.
To simplify computation, we only calculate the cosine similarity of corresponding elements in the lower-triangular matrices of
and
as the consistency regularization loss. Specifically, we denote the lower-triangular part (excluding the main diagonal) of a matrix
M as
, where
is the lower-triangular matrix extraction operator that retains elements below the main diagonal and sets all other elements to 0. For
and
(where
is the length of the textual token sequence), we extract the
i-th row vectors of
and
as
and
, respectively, which is formally defined as
Notably, the first token in the textual sequence
(usually the [CLS] token) represents global category information, which is excluded from the consistency computation to reduce the influence of global errors on fine-grained alignment. Thus, we restrict the index
i to the range
(retaining only the vectors corresponding to the actual text tokens that describe the target’s attributes and spatial relations). The consistency regularization loss is formulated as
where
B is the batch size. This constraint aligns the structural patterns encoded in the textual modality with the visual response patterns, without enforcing exact positional attention, thereby mitigating contradictory cross-modal activations and improving grounding stability.
The effectiveness visualization of the ASCR module is presented in
Figure 4b; the first panel shows the Ground Truth, where the correct target is the middle tennis court (marked in red) immediately below the upper-left tennis court, corresponding to the spatial reference in the sentence “The tennis court is below the tennis court on the upper left.” The second panel presents the result without structural consistency constraints: the semantic binding between “below” and the reference phrase “the tennis court on the upper left” is extremely weak. The model incorrectly associates “below” with the standalone phrase “the tennis court”, leading to the failure of hierarchical spatial relation parsing. Consequently, the query is erroneously localized to the bottom-most tennis court, resulting in severe localization ambiguity. With the introduction of SCR, text tokens form distinct semantic groups (red for target entity, purple for spatial relation, light blue for reference entity, yellow + green for location constraint), and the structural matrix exhibits compact semantic blocks with aggregated similarity. By enforcing the consistency between linguistic syntax and visual spatial layout, the model accurately localizes the target tennis court in the middle. The overlaid attention heatmaps demonstrate that each semantic group is precisely aligned to its corresponding visual region, achieving high-precision instance-level localization and hierarchical spatial relation parsing.
Finally, the ASCR module outputs three components: the optimized visual tokens
, the optimized text tokens
, and the context tokens
G which are extracted from
according to their original positions in
U. The outputs are computed as
Here, denotes the attention-induced cross-modal structural alignment operation applied to the input visual tokens F and text tokens L. These three components are then fed into the subsequent module to provide stable grounding cues for the SAM2 encoder.
This design stabilizes grounding in cluttered, repetitive remote sensing scenes, producing reliable multimodal anchors that serve as high-quality semantic priors for downstream spatial prompting and pixel-level segmentation.