CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation

Lu, Qianqi; Xie, Yuxiang; Zhang, Jing; Guo, Yanming; Wei, Yingmei; Jiang, Jie; Luan, Xidao

doi:10.3390/rs17223675

Open AccessArticle

CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation

by

Qianqi Lu

¹

,

Yuxiang Xie

^1,*

,

Jing Zhang

¹,

Yanming Guo

¹

,

Yingmei Wei

¹

,

Jie Jiang

¹ and

Xidao Luan

²

¹

College of Systems Engineering, National University of Defense and Technology, Changsha 410073, China

²

College of Computer Science and Engineering, Changsha University, Changsha 410082, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(22), 3675; https://doi.org/10.3390/rs17223675

Submission received: 16 September 2025 / Revised: 1 November 2025 / Accepted: 4 November 2025 / Published: 8 November 2025

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose CD2FSAN, a CLIP-driven single-stage framework that integrates dynamic feature selection, multi-scale aggregation and alignment, and a dynamic rotation correction decoder, explicitly addressing scale variation, orientation diversity, and cross-modal misalignment in referring remote sensing image segmentation.
Extensive experiments demonstrate state-of-the-art performance, while ablation studies and qualitative visualizations confirm the effectiveness of the proposed modules and the robustness of CD2FSAN in small-object localization, rotated target delineation, and fine-grained vision–language alignment.

What are the implications of the main findings?

This work pioneers a new CLIP-based paradigm for remote sensing segmentation that does not rely on SAM-based prompt engineering, achieving both high accuracy and computational efficiency as verified by efficiency analysis experiments.
The framework effectively narrows the modality gap between natural-image–trained CLIP features and domain-specific remote sensing imagery, enabling accurate segmentation under complex spatial and geometric variations.

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) aims to accurately locate and segment target objects in high-resolution aerial imagery based on natural language descriptions. Most existing approaches either directly modify Referring Image Segmentation (RIS) frameworks originally designed for natural images or employ image-based foundation models such as SAM to improve segmentation accuracy. However, current RRSIS models still face substantial challenges due to the domain gap between remote sensing and natural images, including large-scale variations, arbitrary object rotations, and complex spatial–linguistic relationships. Consequently, such transfers often lead to weak cross-modal interaction, inaccurate semantic alignment, and reduced localization precision, particularly for small or rotated objects. In addition, approaches that rely on multi-stage alignment pipelines, redundant high-level feature fusion, or the incorporation of large foundation models generally incur substantial computational overhead and training inefficiency, especially when dealing with complex referring expressions in high-resolution remote sensing imagery. To address these challenges, we propose CD2FSAN, a CLIP-driven dynamic feature selection and alignment network that establishes a unified framework for fine-grained cross-modal understanding in remote sensing imagery. This network first follows the principle of maximizing cross-modal information to dynamically select the visual representations most semantically aligned with the language from CLIP’s hierarchical features, thereby strengthening cross-modal correspondence under image domain shifts. It then performs adaptive multi-scale aggregation and alignment to integrate linguistic cues into spatially diverse visual contexts, enabling precise feature fusion across varying object scales. Finally, a dynamic rotation correction decoder with differentiable affine transformation was designed to refine segmentation by compensating for orientation diversity and geometric distortions. Extensive experiments verify that CD2FSAN consistently outperforms existing methods in segmentation accuracy, validating the effectiveness of its core components while maintaining competitive computational efficiency. These results demonstrate the framework’s strong capability to bridge the cross-modal gap between language and remote sensing imagery, highlighting its potential for advancing semantic understanding in vision–language remote sensing tasks.

Keywords:

referring remote sensing image segmentation; cross-modal alignment; dynamic feature selection; multi-scale aggregation; rotation correction; CLIP-based framework

Graphical Abstract

1. Introduction

Referring Remote Sensing Image Segmentation (RRSIS) [1,2] is a novel task that combines remote sensing image analysis [3] with language expression. Unlike traditional segmentation methods [4,5] with known and fixed category labels, RRSIS performs open-domain segmentation based on free-form textual descriptions. This enables precise identification of specific targets in complex aerial scenes, making it valuable for applications such as urban planning [6], disaster assessment [7], environmental monitoring [8], precision agriculture [9] and land cover classification [10]. Recent advances in RRSIS have demonstrated promising results by extending techniques from Referring image segmentation (RIS), but the unique characteristics of aerial imagery, such as diverse spatial scales and arbitrary object orientations, pose additional challenges. These issues reflect a deeper scientific challenge in RRSIS, achieving consistent cross-modal alignment between heterogeneous visual and linguistic domains under large-scale, orientation, and semantic variations. Addressing this problem is not only critical for improving segmentation accuracy in remote sensing, but also holds broader significance for advancing multimodal representation learning under image domain shift.

Current RRSIS methods primarily follow two representative cross-modal alignment paradigms. The first is the encode-then-decode paradigm [11,12], illustrated in Figure 1a, where visual and textual features are independently encoded and fused during decoding. Although this approach is modular, it often suffers from weak cross-modal interaction, leading to suboptimal segmentation—particularly for remote sensing scenes with small, rotated, or scale-diverse targets. As shown in Figure 1a, it fails to accurately delineate arbitrarily oriented objects and frequently produces false positives or incomplete masks for small targets, reflecting its limited adaptability to the complexities of remote sensing imagery. The second is the language-aware vision encoder paradigm [2,13,14], as shown in Figure 1b, which embeds linguistic features into visual representations during encoding via cross-attention. This enables early vision-language interaction and supports a lightweight convolutional decoder design, improving segmentation accuracy. However, repeated cross-attention layers incur high computational cost and hinder parallelization, especially on large-scale, high-resolution remote sensing datasets. Moreover, the model’s limited capacity to capture fine-grained multi-scale features makes it less effective for small or densely distributed targets. Furthermore, both existing paradigms typically adopt isolated visual and textual encoders, preventing them from leveraging the shared multimodal priors learned by vision-language models such as CLIP [15]. While recent works like CRIS [16] have shown the benefits of integrating CLIP into natural image segmentation, directly applying CLIP-based models to remote sensing scenarios remains challenging. This difficulty arises not from inherent flaws in CLIP itself but from a significant domain gap, because CLIP is pretrained on natural image text pairs that differ substantially from remote sensing data in both visual appearance and semantic structure. Remote sensing imagery is characterized by abstract textures, top down perspectives, and domain-specific object categories that rarely appear in the pretraining data of CLIP. As a result, the cross-modal representations extracted by CLIP often suffer from distributional mismatch, which reduces alignment accuracy, particularly for small, rotated, or densely distributed targets.

To address the limitations of existing methods, we propose CD2FSAN (CLIP-Driven Dynamic Feature Selection and Alignment Network), a single-stage framework specifically tailored for RRSIS. It jointly enhances cross-modal alignment, multi-scale adaptability, and geometry-aware decoding in complex aerial scenes, enabling fine-grained understanding of spatial–linguistic relationships within remote sensing imagery. The framework comprises three task-aware modules that collaboratively address the core challenges of RRSIS. The Dynamic Feature Selection Mechanism (DFSM) maximizes cross-modal information to adaptively select language-consistent features from CLIP’s hierarchical representations, enhancing alignment robustness under domain shifts. The Multi-scale Aggregation and Alignment Module (MAAM) refines multi-scale feature fusion through asymmetric and dilated convolutions, improving small-object localization while maintaining computational efficiency. The Dynamic Rotation Correction Decoder (DRCD) introduces a differentiable affine transformation–based kernel steering mechanism to align receptive fields with object orientations, mitigating edge misalignment and enabling precise segmentation of rotated or irregular targets. Collectively, these modules form a unified architecture that directly addresses scale variation, rotation diversity, and cross-modal misalignment in RRSIS.

The main contributions of this paper are as follows:

We propose CD2FSAN (CLIP-Driven Dynamic Feature Selection and Alignment Network), a single-stage, CLIP-tailored framework for RRSIS that jointly optimizes cross-modal alignment, small-object localization, and geometry-aware decoding in complex aerial scenes.
To enhance cross-modal alignment and fine-grained segmentation, we design an integrated visual–language alignment and geometry-aware decoding framework. The DFSM adaptively selects language-consistent features from CLIP’s hierarchy based on the principle of maximizing cross-modal information, thereby strengthening cross-modal correspondence under domain shifts. The MAAM constructs scale-consistent representations for small-object localization and multi-scale fusion. The DRCD employs differentiable affine transformations to align receptive fields with object orientations, enabling precise segmentation of rotated or irregular targets.
Performance evaluations on three public RRSIS benchmarks demonstrate that CD2FSAN achieves state-of-the-art segmentation accuracy with competitive computational efficiency. Validation under complex remote sensing scenarios further confirms that the proposed framework maintains robust cross-modal alignment and fine-grained segmentation performance even in scenes with large scale variations, object rotations, and background interference, highlighting its practical value for real-world vision–language remote sensing applications.

2. Related Work

2.1. Referring Image Segmentation

Referring Image Segmentation (RIS) aims to localize and segment the region in an image described by a natural language expression. This task requires fine-grained semantic alignment between vision and language and has attracted increasing attention. Early RIS methods primarily relied on convolutional [17] and recurrent architectures [18] that independently encoded visual and textual inputs, followed by simple fusion operations such as element-wise multiplication or concatenation. While these approaches established the task’s foundation, their separate encoding schemes led to weak cross-modal interaction and coarse segmentation boundaries. To overcome these limitations, subsequent works introduced more sophisticated fusion and reasoning mechanisms. KWANet [19] emphasized key word awareness to assign higher importance to discriminative words in the query; CMSA [20] incorporated a cross-modal self-attention module to enhance spatial–semantic correspondence between text and visual features; BRINet [21] developed a bidirectional reasoning network to jointly infer relationships from both modalities; and LTS [14] established a strong two-stage pipeline that first localizes and then segments the target region for improved precision. Collectively, these models greatly advanced visual–language interaction and segmentation quality in natural images.

With the rise of Transformers, RIS methods began leveraging joint multimodal encoding to improve feature interaction. MDETR [22] and VLT [11] proposed unified decoder-based fusion strategies. Among them, LAVT [13] has become a widely used baseline, particularly for remote sensing tasks. It employs a Swin Transformer backbone [23] and injects language-guided attention at multiple encoder stages, which enhances the alignment between linguistic expressions and salient visual regions. While its hierarchical design facilitates cross-scale fusion, LAVT tends to focus on prominent or contextually obvious targets, and often struggles with small, cluttered, or densely distributed objects due to the absence of explicit spatial modeling or geometry-aware mechanisms. ReSTR [24] and CRIS [16] adopt dual-branch Transformer encoders followed by multimodal fusion, while PolyFormer [25] and SeqTR [26] redefine RIS as sequence prediction over boundary points. Other methods like GRES [27] and CGFormer [28] apply query-based proposal matching for region grounding. Collectively, these works demonstrate how Transformer-based architectures have reshaped RIS through more expressive and flexible cross-modal reasoning. Despite this progress, most existing RIS models are tailored to natural images, where targets are typically salient, well-aligned, and semantically coherent with human language. In contrast, remote sensing imagery presents unique challenges: varied spatial resolution, cluttered backgrounds, and frequent presence of small or rotated targets. Such characteristics often degrade the performance of standard RIS models when applied to Referring Remote Sensing Image Segmentation (RRSIS). In this work, we build upon RIS advancements and propose a CLIP-guided framework with dynamic feature selection based on cross-modal information maximization and rotation-aware decoding, aiming to improve cross-modal alignment and segmentation performance under remote sensing conditions.

2.2. Referring Remote Sensing Image Segmentation

Referring Remote Sensing Image Segmentation (RRSIS) [1] is a newly emerged multimodal task that segments specific regions in aerial imagery based on natural language expressions. Compared with traditional semantic segmentation, RRSIS enables more flexible human-computer interaction but poses unique challenges due to the high resolution, rotation, and scale diversity in remote sensing scenes [3]. To advance research in this domain, Yuan et al. introduced the RRSIS task alongside the RefSegRS dataset [1], which comprises over 4000 image-text-mask triplets. They further proposed the LGCE module, built upon the LAVT framework, to enhance multi-scale visual-linguistic fusion. Despite its foundational contributions, RefSegRS exhibits notable limitations, including ambiguous boundaries between instances and classes, as well as a lack of linguistic diversity. For example, expressions such as “road” are often annotated with masks that encompass all road regions in the image, thereby blurring the distinction between referring segmentation and open-vocabulary semantic segmentation [29] or GRES [27]. To address these limitations, Liu et al. proposed RRSIS-D [2], a large-scale benchmark with 17,402 triplets across 20 categories and 7 attributes. Featuring fine-grained, rotated, and small objects with high-quality semi-automatic annotations, it has become the most widely adopted dataset in RRSIS research. Built on this dataset, RMSIN was proposed to improve hierarchical alignment through spatial- and scale-aware interaction. More recently, RISBench [30] extended dataset diversity by introducing over 52,000 samples with more varied object categories and complex language structures. CroBIM was introduced as a strong baseline, emphasizing bidirectional interaction and spatial reasoning. However, as RISBench is newly released and currently in preprint, its adoption remains limited. Each dataset presents distinct strengths and trade-offs. RSRefSeg provides basic coverage with coarse annotations, RRSIS-D emphasizes precision and rotation awareness, while RISBench focuses on linguistic diversity and generalization. We conduct comprehensive experiments across all three to ensure a robust and multidimensional evaluation of our proposed model.

Alongside these benchmarks, recent methods such as FIANet [31], MAFN [32], and RSRefSeg [33] have pushed the field forward. FIANet enhances vision–language alignment by disentangling object and positional cues; MAFN adopts correlation-guided multi-scale fusion to handle rotation and scale variance; and RSRefSeg leverages CLIP for visual–linguistic feature extraction and employs SAM for prompt-driven segmentation, demonstrating the potential of foundation models in remote sensing. Despite these advances, most existing RRSIS methods address each challenge in isolation. For instance, FIANet [31] primarily focuses on improving cross-modal alignment, MAFN [32] emphasizes scale variation modeling, and RMSIN [2] partially alleviates geometric distortion through rotation-augmented features. However, a unified framework that jointly handles alignment, scale, and orientation variation is still lacking. Moreover, while RSRefSeg highlights the feasibility of incorporating foundation models into RRSIS, the adaptation of vision–language models such as CLIP to this domain is still in its early exploratory stage. RSRefSeg employs CLIP merely as a partially trainable feature encoder, without leveraging advanced adaptation strategies, such as prompt tuning or adapter-based fine-tuning, that have recently shown effectiveness in achieving efficient domain transfer across diverse image modalities.

In contrast, our framework fully exploits CLIP’s hierarchical representations and introduces task-aware alignment mechanisms to achieve robust cross-modal adaptation under the substantial domain gap between natural and remote sensing imagery. Building upon advances in RIS and RRSIS, we propose CD2FSAN, a CLIP-tailored single-stage architecture that unifies mutual-information–driven dynamic feature selection, a multi-scale aggregation and alignment module, and a dynamic rotation correction decoder with differentiable affine transformations. These components jointly strengthen cross-modal alignment, enhance small-object localization, and improve rotation-robust segmentation under complex remote sensing conditions.

3. Materials and Methods

3.1. Overview

As illustrated in Figure 2, we propose CD2FSAN, a CLIP-driven segmentation framework tailored for RRSIS to address the challenges of semantic misalignment, spatial heterogeneity, and geometric distortion in aerial imagery.

At the front end of the architecture, the CLIP visual encoder is enhanced through a dynamic feature selection mechanism based on mutual information maximization. During the visual encoding stage, cross-modal similarity is computed between sentence-level text features

F_{T}

and CLIP’s hierarchical intermediate visual embeddings for subsequent cross-modal aggregation and alignment. This process yields an adaptive feature pyramid composed of low-level

F_{V 1}

, mid-level

F_{V 2}

, and high-level

F_{V 3}

representations, facilitating early-stage alignment between language and vision cues while preserving both fine-grained and abstract semantics. To enhance spatial awareness and facilitate effective multi-scale fusion, the MAAM first consolidates the hierarchical visual features

F_{V 1}, F_{V 2}, F_{V 3}

into an intermediate representation

F_{F}

. Then, this module introduces a hybrid alignment strategy that integrates Image Multi-scale Convolution (IMC) and Text Multi-scale Convolution (TMC). IMC employs directional and dilated convolutions to capture diverse spatial patterns within remote sensing imagery, while TMC applies scale-adaptive 1D convolutions to extract hierarchical linguistic structures from word-level embeddings

F_{W}

. Through joint self-attention and cross-attention mechanisms, MAAM produces an aligned representation

F_{A}

that encodes both spatial detail and semantic consistency across modalities. The final prediction is generated by the DRCD, which is designed to address orientation variability, a key obstacle in remote sensing segmentation. Based on the input

F_{A}

, the decoder predicts sample-specific rotation angles

θ

and dynamically transforms convolutional kernels via differentiable affine operations. This rotation-aware process aligns the convolutional receptive fields with the dominant object orientations, resulting in a robust pose-adaptive decoding pipeline. The output features are progressively refined through a top-down pathway to produce the final segmentation mask M, with particular efficacy in capturing arbitrarily rotated or densely packed targets.

3.2. Image and Text Feature Encoding

Image Encoder. We adopt the CLIP-pretrained Vision Transformer (ViT-B) as our image encoder to extract hierarchical visual representations. For an input image

I \in R^{H \times W \times 3}

, we first split it into small, non-overlapping pieces of a set size

P \times P

, and each piece is turned into a token embedding. This results in a patch sequence

I_{p} \in R^{h \times w \times c}

, where

(h, w) = (H / P, W / P)

and c denotes the embedding dimension. The patch embeddings are then processed through a 12-layer Transformer encoder following the ViT-B architecture. We denote the output of the i-th Transformer block as

F_{I} (i) \in R^{h \times w \times c}

, for

i = 1, \dots, 12

, capturing progressively enriched visual features across layers.

Text Encoder. For language representation, we utilize the CLIP text encoder to process the input referring expression. The expression is tokenized and prepended with a special [SOS] token and appended with an [EOS] token to indicate sequence boundaries. The encoder outputs a sequence of contextualized word embeddings denoted as

F_{W}

, representing fine-grained linguistic cues. In addition, the final embedding corresponding to the [EOS] token is extracted as the sentence-level representation

F_{T}

, which encapsulates the global semantics of the entire expression.

3.3. Dynamic Feature Selection Mechanism

As illustrated in Figure 3, different layers of the CLIP image encoder exhibit distinct attention distributions across semantic regions in remote sensing images, reflecting varying levels of abstraction and spatial granularity. Despite this, existing CLIP-based referring segmentation models commonly rely solely on the final-layer visual features for cross-modal interaction, thereby overlooking the rich and complementary semantic cues embedded in intermediate layers. Motivated by insights from SegFormer [34] and related hierarchical fusion frameworks [35,36], we initially explored a simple approach by randomly selecting two intermediate layers and fusing them with final-layer features through a pyramid fusion strategy. While this method occasionally yielded marginal improvements, experiment results showed that the performance gains were inconsistent across samples and configurations. In particular, we observed that the segmentation mIoU did not exhibit a clear advantage over using the final layer alone, and the results were highly sensitive to the number of the randomly chosen layers. To systematically study this phenomenon, we conducted comprehensive experiments and visualization analyses on the CLIP visual layer under the same image-text pairings. These findings suggest that layer choice should not be random but language query-adaptive. We therefore replace heuristic fusion with a mutual-information-inspired dynamic selection that casts layer selection as cross-modal information maximization. Concretely, given the sentence-level embedding and CLIP’s hierarchical visual features, we compute a cross-modal cosine similarity between and each as a tractable surrogate for cross-modal information, rank the layers, and forward the top-ranked language-consistent features to subsequent alignment. This per-sample selection exposes textual cues early in the encoder and reduces the natural-to-remote-sensing domain bias observed when relying solely on the final CLIP layer, yielding more stable alignment and stronger downstream segmentation.

Specifically, we extract visual features from layers 4 to 12 of the CLIP image encoder, denoted as

F_{I} (i) \in R^{h \times w \times c}

, where

i = 4, 5, \dots, 12

. For each layer, we use the global visual feature

F_{I}^{'} (i)

from the CLIP encoder, which is a pooled representation of the entire image. The global visual feature for the i-th layer is mapped to the same dimension as the text features using a shared linear transformation to obtain

V (i)

, where

i = 4, 5, \dots, 11

:

V_{i} = Linear (F_{I}^{'} (i)) \in R^{d}

(1)

The similarity score between the i visual layer and the text is defined as

{Score}_{i} = V_{i} ⊙ F_{T}

(2)

where ⊙ denotes element-wise multiplication. These similarity scores reflect the alignment between each layer’s global visual semantics and the textual description. All

{Score}_{i}

values are concatenated and passed through a lightweight selection network to produce a refined score vector:

S = Φ_{sel} ([{Score}_{1}, \dots, {Score}_{11}]) \in R^{11}

(3)

The

Φ_{sel}

selection network plays a crucial role in identifying the most relevant feature layers based on semantic similarity scores. In this work, we employ a simple yet effective combination of linear layers followed by softmax activation as the selection mechanism. We select the values with the highest and second-highest semantic similarity scores from the vector

S

and record the corresponding indices, with the constraint that the lower index precedes the higher one, adopting top-2 as a practical and stable setting rather than a theoretical optimum and leaving dynamic layer-number selection for future work.

k, m = arg max_{i \neq j} (S_{i}, S_{j}), with k < m

(4)

These indices, k and m, are then used to directly select the corresponding feature layers

F_{I_{k}}

and

F_{I_{m}}

from the visual feature layers:

F_{V_{1}} = F_{I_{k}}, F_{V_{2}} = F_{I_{m}}, k < m

(5)

Here,

F_{I_{k}}

and

F_{I_{m}}

represent the visual feature layers selected based on the indices k and m, which correspond to the highest similarity scores in

S

. We further include the final-layer output of the CLIP encoder as the high-level visual feature:

F_{V_{3}} = F_{I_{12}}

(6)

This feature

F_{V_{3}}

captures high-level global semantics of the image. Initially, all three features,

F_{V_{1}}

,

F_{V_{2}}

, and

F_{V_{3}}

, have the same spatial resolution, but to enhance feature expressiveness, we apply downsampling to introduce different spatial resolutions. As a result, these features share the same channel dimension C but differ in spatial resolution.

F_{V_{3}} \in R^{H \times W \times C}, F_{V_{2}} \in R^{\frac{H}{2} \times \frac{W}{2} \times C}, F_{V_{1}} \in R^{\frac{H}{4} \times \frac{W}{4} \times C}

(7)

This multi-scale representation enhances spatial granularity and contextual diversity, benefiting subsequent cross-modal alignment.

3.4. Multi-Scale Aggregation and Alignment Module

In the visual-linguistic fusion and alignment stage, we begin by fusing the visual features

F_{V_{1}}

,

F_{V_{2}}

, and

F_{V_{3}}

, obtained in the previous stage, with the global sentence-level textual features

F_{T}

through multi-level feature aggregation. This aggregation produces the initial combined representation

F_{F} \in R^{H \times W \times C}

. To improve cross-modal alignment between the visual and textual modalities, we introduce a dual-attention mechanism, which includes both a multi-scale self-attention module and a cross-attention module. This dual-attention mechanism refines

F_{F}

, enhancing the alignment accuracy.

Unlike natural images, remote sensing imagery exhibits significant scale variations and often contains numerous small objects, which are frequently overshadowed by larger, more dominant features. This phenomenon makes it difficult for traditional models to accurately align these small objects with their corresponding textual descriptions. Moreover, the complexity of remote sensing scenes can lead to false positives during small object recognition, especially when background elements share similar visual features. To address these challenges, we suggest a multi-scale self-attention and cross-attention mechanism, designed to capture cross-scale correlations within the visual features and improve the representation of fine-grained objects. This mechanism, inspired by recent advances in remote sensing segmentation and referring image comprehension, allows the fused features

F_{F}

to interact with the textual embedding

F_{W}

across multiple scales. This interaction facilitates robust semantic alignment between the visual and textual modalities. The computational architecture for this multi-scale self-attention and cross-attention mechanism is illustrated in Figure 4a.

Specifically, given the fused multimodal features

F_{F}

at three hierarchical levels, we apply multi-scale self-attention through the IMC module to enhance visual encoding. As illustrated in Figure 4b, the IMC adopts a lightweight and effective architecture composed of three parallel convolutional branches with complementary inductive biases tailored to remote sensing imagery. Each input feature map

F_{F}

is first processed by a

1 \times 1

convolution to adjust the channel dimension and is then passed through multiple convolutional branches that extract complementary semantic patterns. The outputs of these branches are aggregated to form the final representation, enabling the model to capture multi-scale contextual dependencies within a unified structure. Each branch in the IMC module employs distinct kernel configurations to capture diverse contextual information. The dilated branch enlarges the receptive field and encodes large-scale spatially sparse structures such as aprons and industrial compounds. The asymmetric

1 \times 3

and

3 \times 1

branches emphasize elongated and directional structures such as roads, rivers, and runways. The standard

3 \times 3

branch preserves fine local details and boundaries and alleviates the over-smoothing of small objects. To further improve computational efficiency, all convolutions are implemented in a depthwise separable form, which significantly reduces computation while maintaining strong representational capacity. This design provides complementary spatial cues that strengthen cross-scale representation and enhance discriminability for subsequent cross-modal alignment.The overall computation of the IMC module is formalized as follows:

F_{F}^{'} = Φ_{{Conv}_{1 \times 1}} (F_{F})

(8)

b r a n c h_{1} = Φ_{{Conv}_{3 \times 3}} (F_{F}^{'})

(9)

b r a n c h_{2} = Φ_{{Conv}_{3 \times 3}}^{'} (Φ_{{Conv}_{3 \times 1}} (Φ_{{Conv}_{1 \times 3}} (F_{F}^{'})))

(10)

b r a n c h_{3} = Φ_{{Conv}_{3 \times 3}}^{'} (Φ_{{Conv}_{1 \times 3}} (Φ_{{Conv}_{3 \times 1}} (F_{F}^{'})))

(11)

F_{M S A} = Φ_{{Conv}_{1 \times 1}} [Cat (b r a n c h_{1}, b r a n c h_{2}, b r a n c h_{3})] \oplus F_{F}^{'}

(12)

where

{branch}_{1}

,

{branch}_{2}

, and

{branch}_{3}

are the outputs of the three branches, each consisting of different convolution operations. Specifically,

Φ_{{Conv}_{1 \times 1}}

and

Φ_{{Conv}_{3 \times 3}}

are standard depthwise separable convolutions, while

Φ_{{Conv}_{3 \times 1}}

and

Φ_{{Conv}_{1 \times 3}}

refer to directional convolutions with different kernel shapes. Additionally,

Φ_{{Conv}_{3 \times 3}}^{'}

represents a dilated convolution with a dilation rate of 5. The operator

Cat (\cdot)

indicates channel-wise concatenation, and ⊕ denotes element-wise addition for residual enhancement. The use of directional convolutions (e.g.,

1 \times 3

and

3 \times 1

) improves the extraction of structured edge information, which is particularly beneficial for detecting small or elongated objects in remote sensing imagery. Moreover, the combination of multi-branch design and dilation allows the module to capture both fine-grained details and broader spatial context, thereby significantly improving the quality and discriminability of the visual representations.

The enhanced visual features are subsequently aligned with their corresponding text embeddings through a cross-attention mechanism. Meanwhile, to strengthen the multi-scale representation of language features, we introduce a novel Text Multi-scale Convolution (TMC) module. The architecture of TMC is shown in Figure 4c. Building upon the principles of the IMC, TMC employs convolutions with varying receptive fields to capture textual information across multiple semantic scales. However, in contrast to IMC, which utilises 2D convolutions to process spatial visual features, TMC leverages 1D convolutions specifically designed for sequential data, ensuring seamless compatibility with the textual feature structure. This innovative design enhances sequence-level encoding, while simultaneously maintaining tight alignment with the visual modality, facilitating richer cross-modal interaction and improving the overall feature fusion. Given the input textual embedding denoted as

F_{W}

, TMC generates a multi-scale enriched textual feature represented as

F_{W_{1}} = Φ_{{Conv}_{1 \times 1}} (F_{W})

(13)

F_{W_{2}} = Φ_{{Conv}_{3 \times 3}} [c a t (F_{W}, F_{W 1})]

(14)

F_{W}^{'} = Φ_{{Conv}_{3 \times 3}}^{'} [c a t (F_{W 1}, F_{W 2})] \oplus F_{W_{1}}

(15)

Finally, the refined visual features

F_{F}^{'}

and the enhanced multi-scale text features

F_{W}^{'}

are integrated through a cross-attention module, producing the aligned multimodal representation

F_{A}

. This process facilitates stronger semantic interaction and improves the localisation of fine-grained objects based on referring expressions.

3.5. Dynamic Rotation Correction Decoder

To address the challenges posed by the wide range of object orientations in remote sensing imagery, we propose a novel decoder module, termed the Dynamic Rotation Correction Decoder (DRCD). Unlike traditional convolutional decoders that utilise spatially fixed kernels, DRCD explicitly models rotation variance by dynamically generating orientation-aligned filters for each input sample. This allows the decoder to better adapt to pose diversity and improve segmentation accuracy for rotated or skewed targets. The decoder design is presented in Figure 2d.

The core component of DRCD is the Dynamic Rotation Correction (DRC) mechanism. For each input sample b, the aligned multimodal representation

F_{A}^{(b)} \in R^{C \times H \times W}

, a lightweight routing network, predicts a set of sample-specific rotation angles

{θ_{i}^{(b)}}_{i = 1}^{n}

which specify the orientation for each of the n rotation bases and gating weights

{λ_{i}^{(b)}}_{i = 1}^{n}

, which control the contribution of each rotated base kernel to the final output. Here, n denotes the number of rotation bases used for the transformation. The predicted rotation angles are used to rotate each base kernel

W_{i} \in R^{C \times k \times k}

using a differentiable affine transformation:

W_{i}^{(b)} = {Rot}_{θ} (W_{i}, θ_{i}^{(b)}), i = 1, \dots, n,

(16)

where

{Rot}_{θ} (\cdot)

denotes a bilinear grid-sampling rotation operator. The resulting rotated kernels are aggregated via a weighted summation, with the gating weights

λ_{i}^{(b)}

determining the contribution of each rotated kernel to the final filter:

W^{(b)} = \sum_{i = 1}^{n} λ_{i}^{(b)} \cdot W_{i}^{(b)} .

(17)

Finally, the reparameterized filters

W^{(b)}

are applied through grouped convolution to produce the segmentation-enhanced feature maps:

F_{M}^{(b)} = W^{(b)} * F_{A}^{(b)} .

(18)

This design effectively enhances the model’s ability to handle the multi-orientation characteristics of targets in remote sensing images. By dynamically adjusting the convolution kernel based on the predicted angles, DRC can better align with the actual orientation of objects, leading to more accurate feature extraction and mask generation. The structural guidance in DRC ensures that the model focuses precisely on the target region, reducing the dispersion of attention and improving the clarity and accuracy of boundary details. This results in more precise segmentation masks, particularly for objects with complex shapes and orientations. Additionally, the replacement of a portion of the convolutional layers with DRC helps to reduce redundant feature learning, making the model more efficient and effective in capturing the essential features of remote sensing images.

For mask supervision, we use an equal-weight combination of focal loss and Dice loss. Let

M \in {0, 1}^{H \times W}

denote the ground-truth mask and

\hat{M} \in {[0, 1]}^{H \times W}

the predicted probability map. The overall objective is

L = L_{focal} (M, \hat{M}) + L_{Dice} (M, \hat{M}) .

(19)

This hybrid supervision is well suited to RRSIS, where object regions are typically small, sparse, and highly imbalanced with the background. The focal loss mitigates this imbalance by focusing the optimization on hard positive pixels, while the Dice loss promotes stable optimization of region overlap, improving segmentation consistency under varying scales and orientations.

4. Results

4.1. Dataset and Implementation Details

In this study, we evaluate the effectiveness of the proposed method using three publicly available remote sensing datasets: RefSegRS [1], RRSIS-D [2], and RISBench [30]. These datasets, which were recently introduced, significantly contribute to the progress of the Remote Sensing Image Segmentation (RRSIS) task.

RefSegRS. The dataset comprises 4420 image–text–label triplets drawn from 285 scenes. The training, validation, and test splits contain 2172, 431, and 1817 triplets, respectively, corresponding to 151, 31, and 103 scenes. Fourteen categories (for example, road, vehicle, car, van, and building) and five attributes are annotated. Images are 512 × 512 pixels at a ground sampling distance (GSD) of 0.13 m.
RRSIS-D. This benchmark contains 17,402 triplets of images, segmentation masks, and referring expressions. The training, validation, and test sets include 12,181, 1740, and 3481 triplets, respectively. It covers 20 semantic categories (for example, airplane, golf field, expressway service area, baseball field, and stadium) and seven attributes. Images are 800 × 800 pixels, with GSD ranging from 0.5 m to 30 m.
RISBench. The dataset includes 52,472 image–language–label triplets. The training, validation, and test partitions contain 26,300, 10,013, and 16,159 triplets, respectively. It features 26 categories with eight attributes. All images are resized to 512 × 512 pixels, with GSD ranging from 0.1 m to 30 m.

We employ a CLIP-pretrained ViT-B as the visual encoder and a Transformer as the language encoder. Images are resized to

480 \times 480

pixels, and expressions are limited to 22 tokens (including [SOS] and [EOS]) on RefSegRS, RRSIS-D, and RISBench. Training is conducted for 40 epochs using the Adam optimizer with an initial learning rate of

5 \times 10^{- 5}

, a batch size of 8, and two RTX 4090 GPUs. All inputs are normalized with CLIP’s mean and variance. No explicit data augmentation, is applied to remain consistent with CLIP’s pretraining distribution. To reduce overfitting on relatively small remote sensing datasets, the CLIP encoders use one-tenth of the base learning rate applied to task-specific modules. Automatic mixed precision is enabled, and the best model is selected based on validation mIoU.

Following prior studies, we use the following evaluation metrics:

Overall Intersection over Union (oIoU): This metric is calculated as the ratio of the cumulative intersection area to the cumulative union area across all test samples, with an emphasis on larger objects.
Mean Intersection over Union (mIoU): mIoU is computed by averaging the IoU values between predicted masks and ground truth annotations for each test sample, treating both small and large objects equally.
Precision@X: Precision@X measures the percentage of test samples for which the IoU between the predicted result and the ground truth exceeds a threshold $X \in {0.5, 0.6, 0.7, 0.8, 0.9}$ . This metric evaluates the model’s accuracy at specific IoU thresholds, reflecting its performance in object localization.

4.2. Comparisons with Other Methods

To ensure a fair and comprehensive comparison, we adopt the experimental results reported in the original publications, supplemented by those reproduced in subsequent studies and publicly available benchmark papers. Our evaluation includes methods specifically tailored for remote sensing imagery, such as LGCE [1], RMSIN [2], FIANet [31], and CroBIM [30], as well as general-purpose referring image segmentation approaches like LAVT [13], CrossVLT [37], and CRIS [16]. Although the latter are not explicitly designed for remote sensing scenarios, they represent the state of the art in referring image segmentation and thus serve as valuable baselines. As summarized in Table 1, RRSIS remains in its early stage, as existing datasets differ considerably in annotation scale, image sources, target categories, object size distributions, and linguistic expression styles. Unlike referring natural image segmentation, which benefits from unified benchmarks such as RefCOCO and RefCOCOg, the remote sensing domain still lacks standardized datasets, leading to inconsistent evaluation protocols and variable model performance.

Despite these challenges, CD2FSAN demonstrates consistently strong performance across three public benchmarks. It achieves state-of-the-art oIoU on both validation and test sets, surpassing previous best results by 1.53%, 0.77%, and 0.81% on the RefSegRS, RRSIS-D, and RISBench validation sets, and by 1.60%, 0.64%, and 0.69% on the corresponding test sets. It also records the highest mIoU on RRSIS-D and RISBench, highlighting its stable generalization ability under different data conditions.

Compared with existing approaches, LAVT and RMSIN rely on Swin-based hierarchical early fusion, which enhances token-level interactions but overlooks orientation modelling. CRIS incorporates CLIP into referring image segmentation yet performs feature fusion and decoding without rotation awareness. In contrast, CD2FSAN integrates three complementary components: dynamic CLIP-layer selection that strengthens language-consistent visual cues, multi-scale aggregation and alignment that improves small-object sensitivity, and rotation-aware decoding that handles diverse object orientations. On the RRSIS-D test set, this combination delivers an improvement of 0.64% in oIoU and 1.17% in mIoU over the strong RMSIN baseline and surpasses CRIS by 15.68% in oIoU, confirming more stable and robust gains across heterogeneous remote sensing scenes. These results validate the effectiveness of CD2FSAN as a unified and robust architecture for referring remote sensing image segmentation. The superior performance arises from the synergistic contribution of three modules: a dynamic feature selection mechanism for semantically aligned visual extraction, a multi-scale aggregation and alignment module for robust cross-scale modelling, and a dynamic rotation correction decoder for enhanced orientation-aware segmentation. Together, these components effectively address cross-modal alignment, geometric deformation, and fine-grained object delineation in remote sensing imagery.

On RefSegRS, CD2FSAN is competitive but slightly trails FIANet in mIoU because many annotations are class-level (e.g., all road segments labeled “road”), yielding large and coarse masks that reward broad semantic coverage over precise localization. FIANet benefits from this setting by capturing spatially extensive objects, and its Swin Transformer with fewer parameters may further aid adaptation to small datasets. By contrast, CD2FSAN targets small and arbitrarily oriented objects via multi-scale fusion and a dynamic rotation convolutional decoder, making it better suited to instance-level referring segmentation that requires precise localization. Consistent with this, CD2FSAN surpasses FIANet in both oIoU and mIoU across the other datasets, indicating stronger generalization to small, rotated, and structurally complex targets. Future work could integrate region-aware linguistic cues and stronger visual pretraining to improve mIoU under class-level supervision without compromising instance-level strengths.

To comprehensively evaluate the effectiveness of our method, we conducted more experiments on the RRSIS-D validation set and presented additional evaluation metrics. The specific results are shown in Table 2. CD2FSAN achieves state-of-the-art performance in both overall Intersection oIoU and mIoU, reaching 79.04% and 66.47%, respectively, outperforming all existing baselines. These results confirm the model’s strong ability to balance instance-level localization and category-level segmentation consistency. In terms of Precision@X, our method also achieves the best results at X = 0.5, X = 0.6, and X = 0.7, demonstrating robust localization under moderate confidence thresholds. Nevertheless, at higher thresholds (X = 0.8 and X = 0.9), the performance of our model is slightly inferior to that of early-fusion methods such as LAVT and LGCE. This suggests that CD2FSAN, which employs single-stage cross-modal feature alignment and a CLIP-based visual encoder, is less confident in assigning extremely high-probability predictions to foreground pixels. In contrast, models like LAVT integrate language features into the visual backbone across multiple stages and benefit from hierarchical token-wise refinement via Swin Transformer encoders. Such architectures may retain more precise spatial structure and support stronger confidence calibration.

Nevertheless, CD2FSAN markedly outperforms CRIS, which also uses CLIP-based visual encoding, highlighting the benefits of our dynamic feature selection, multi-scale feature alignment, and dynamic rotation correction decoder. The remaining gap at high thresholds likely stems from the limited dense spatial granularity of CLIP features trained for global alignment. Recent advances mitigate this via segmentation-oriented prompts or pixel-aware backbones (for example, SAM decoders). We will explore incorporating auxiliary pixel-level prompts into the decoder to sharpen confidence discrimination for fine-scale foreground prediction. Importantly, CD2FSAN remains superior on the most representative metrics, oIoU and mIoU, consistently producing masks with accurate contours and spatial coverage, which underscores its overall effectiveness for remote-sensing image-text segmentation.

Table 3 presents a class-wise analysis on RRSIS-D. Beyond dataset-level metrics, CD2FSAN attains the best score in most land-cover categories, for example, golf fields, baseball fields, vehicles and basketball courts, and achieves the highest average mIoU of

69.98 %

, exceeding the second-best FIANet by 3.20 percentage points. As shown in Table 3, the model’s performance varies by category. Objects with sparse boundaries and heavy clutter, such as bridges and golf fields, remain challenging, whereas more homogeneous regions such as buildings are easier. These trends are consistent with the inductive biases introduced by MAAM and IMC. The dilated path models large and spatially sparse context, the asymmetric

1 \times 3

and

3 \times 1

paths emphasize elongated and directional structures, and the standard

3 \times 3

path preserves boundaries of small objects. Together with TMC, which contributes language-scale priors, and DRCD, which handles dominant orientations, the model attains larger margins on small, slender or oriented categories; for illustration, golf fields improve by 7.15 percentage points and bridges by 5.59 percentage points over the second-best method in Table 3. Performance is relatively lower under heavy clutter or fine-grained confusion, as in harbors and overpasses, which suggests that future work is needed on stronger instance and boundary modeling and on more local grounding cues.

We also evaluate performance as a function of object size, using the classification standard defined by the RRSIS-D dataset, where instances are categorized based on their mask coverage.

θ = \frac{| M |}{H \times W}

(20)

With Small defined as

θ \leq 0.20

, Medium as

0.20 < θ \leq 0.60

, and Large as

θ > 0.60

, CD2FSAN consistently outperforms all competing methods across all object-size categories, as shown in Figure 5. It achieves oIoU scores of 49.52, 73.56, and 83.80 for small, medium, and large objects, respectively, surpassing the strongest baselines by 1.77, 1.70, and 1.47 points. Similarly, CD2FSAN attains mIoU scores of 37.11, 61.94, and 70.62, exceeding the best competitors by 1.18, 1.55, and 1.09 points in the same order. The improvement on small objects is mainly attributed to the MAAM module, which employs asymmetric, dilated, and depthwise separable convolutions to capture fine-grained spatial details at low computational cost. Meanwhile, the DFSM module enhances early language-guided grounding by dynamically selecting semantically aligned layers from CLIP rather than relying solely on the final layer, thereby improving cross-modal alignment and localisation accuracy without compromising performance on larger objects.

4.3. Ablation Study

We perform ablation studies on the RRSIS-D validation subset and report Pr@0.5–0.9 and mIoU over seven variants grouped into three settings (Table 4a–g).

Baseline model. As shown in Table 4a, the baseline uses the visual characteristics of the CLIP last layer, fused with the characteristics of the global language by multiplication by elements, and decodes the mask.

Effect of the dynamic feature selection mechanism. We conduct ablation on layer selection by randomly sampling two intermediate CLIP layers (from layers 4–11) and fusing them with the final layer (12) instead of using only the last-layer features as in the baseline (Table 4b,c). Here, RFS

(i, j, 12)

denotes random feature selection, where the visual features from the CLIP ViT-B layers i and j are fused with those from the final layer 12, and the layer indices correspond to the twelve Transformer blocks of the CLIP image encoder. This random fusion strategy does not consistently outperform the baseline, as the intermediate layers contribute unevenly to referential understanding. When the selected layers are not semantically aligned with the expression, they introduce noise and degrade multimodal fusion performance (Table 4b). In contrast, DFSM (Table 4d) automatically identifies the visual layers that contain the most referentially relevant semantics and selectively fuses them. This data-driven selection consistently improves both cross-modal alignment and segmentation accuracy, validating DFSM’s effectiveness in leveraging informative CLIP layers.

Effect of the multi-scale aggregation and alignment module. After DFSM (Table 4d), we compare two alignment strategies on the selected multilevel visual features: (e) a conventional transformer using self-attention and cross-attention, and (f) the proposed MAAM, which augments this design with hierarchical multi-scale self-attention and cross-attention (Table 4e,f). Both improve segmentation over (d); the standard Transformer raises mIoU by 9.69% on the validation set and 10.96% on the test set. MAAM yields further gains of 4.56% on validation and 3.37% on test relative to the standard Transformer, demonstrating stronger cross-scale semantic modeling and spatial-detail alignment. The advantage of MAAM comes from two components: the IMC block that uses multibranch convolutions with varied kernel sizes, dilated convolutions for expanded receptive fields, and asymmetric kernels for directional cues to better represent small and scale-varying objects; and a TMC block that enriches multi-scale language representations to align with visual features across semantic levels. Their integration strengthens pixel language correspondence and yields superior mIoU and related metrics, addressing remote sensing challenges such as large-scale variation and numerous small targets.

Impact of the dynamic rotation correction decoder. As shown in Table 4g, integrating DRC into the decoder increases mIoU by 1.60% on the validation set and 2.82% on the test set relative to the preceding configuration. DRC dynamically orients convolutional kernels using predicted angles, improving feature extraction and mask generation, aligning the operations with actual object orientations, and suppressing redundant responses. These effects produce sharper, more accurate boundaries and boost overall segmentation accuracy, particularly for multi-oriented and complex-shaped targets common in remote-sensing imagery. We further examine the impact of the number of rotation bases (n) in the DRCD.

As shown in Table 5, the performance of DRCD fluctuates slightly as the number of rotation bases n varies. The best results are obtained when

n = 3

, achieving an oIoU of 79.04% and an mIoU of 66.47%. Increasing the number of bases beyond three leads to marginal declines, likely due to redundant angular representations and higher computational cost. These findings suggest that using three rotation bases offers the best balance between geometric flexibility and efficiency, providing stable and accurate orientation-aware decoding without introducing unnecessary complexity.

4.4. Visualization and Qualitative Analysis

Figure 6 compares CD2FSAN with two representative baselines, RMSIN and CRIS, on RRSIS-D. RMSIN is a state-of-the-art method tailored to referring remote-sensing image segmentation with public code and pretrained weights, and CRIS is a CLIP-based framework for general referring image segmentation. As the field’s most widely used benchmark, RRSIS-D spans varied object sizes and spatial contexts, making it a challenging and representative testbed. Across diverse image-text pairs, CD2FSAN produces more accurate and structurally coherent masks than both baselines, with sharper edges and cleaner boundaries, and gains are especially pronounced for small objects, rotated targets, and densely cluttered scenes typical of aerial imagery. Relative to RMSIN and CRIS, these improvements reflect complementary module design. RMSIN relies on task-specific modules without strong pretrained vision-language alignment and thus struggles with fine-grained distinctions and small or subtly referred objects. CRIS, although CLIP-based, is designed for natural images and lacks rotation-aware mechanisms and explicit multi-scale spatial reasoning, leading to coarse masks under diverse orientations or crowded layouts. CD2FSAN addresses these issues with three specialized designs. DFSM filters intermediate visual features according to the referring expression to suppress irrelevant background, the MAAM strengthens cross-scale correspondence, and the DRCD improves orientation-aware decoding.

As illustrated in Figure 7, these components enable higher-fidelity segmentation of small and rotated objects and cleaner boundaries in complex scenes, improving semantic grounding, spatial precision, and generalization to real-world remote sensing data. Figure 7 presents qualitative results on the test set in three representative scenarios: (1) salient and visually dominant targets, (2) small and spatially compact objects, and (3) rotated objects, which reflect common challenges in referring remote-sensing segmentation and isolate the contribution of each module. (1) Without the Dynamic Feature Selection mechanism based on cross-modal information maximization, baselines can roughly localize salient targets but exhibit diffuse attention and unclear boundaries, indicating imprecise visual grounding. DFSM selects intermediate visual features that align with the referring expression, sharpening attention and boundaries and yielding more accurate masks. In scenes with small objects or many similar candidates, its effect is less consistent, but it still narrows attention to semantically plausible regions and provides a strong basis for later refinement. (2) For small or fine-grained targets, accuracy is limited by downsampling and loss of spatial detail. The MAAM aggregates features across resolutions and aligns them with hierarchical attention, increasing sensitivity to subtle structures and boundaries. This improves segmentation of small buildings, aircraft, and scattered infrastructure and helps concentrate attention when many similar objects co-occur. (3) With rotated instances the spatial reasoning is critical. The DRCD introduces rotation-aware context modeling. It uses adaptive receptive fields to focus on the correctly oriented target, suppresses background noise, and produces sharper masks with cleaner boundaries, especially in dense or overlapping layouts. Overall, these examples show that each added component yields more focused attention, crisper boundaries, and stronger spatial-linguistic grounding, supporting the generalization and practical effectiveness of CD2FSAN.

Despite the strong performance of the CD2FSAN model across various datasets, there are still certain failure cases that reveal its limitations. Figure 8 illustrates these failure cases, where the model’s segmentation results are compared to the ground truth annotations. For instance, in the first case, the phrase “An overpass is on the upper right of the train station on the upper left” caused confusion due to the similarity between the overpass and the surrounding urban features. The model struggled to distinguish the target from the complex, texture-rich background, which highlights the challenge of segmenting targets in densely structured urban environments. A potential solution to this issue is enhancing context awareness and multi-scale feature fusion to better differentiate targets from similar background elements. In another case, the problem stems from annotation and instruction ambiguity rather than model deficiency. The expression “a black vehicle” is underspecified because several black vehicles are present, yet the ground truth marks only one instance without disambiguating cues. Under such conditions, the correct system behavior is to flag the instruction as ambiguous or abstain from a unique prediction, rather than returning an arbitrary mask. Our current pipeline selects the highest confidence candidate, which is then penalized by the single-instance annotation. Future work will incorporate an ambiguity detector and a no-unique-referent option, and we recommend dataset revisions that tag ambiguous expressions or provide multi-instance masks. Additionally, the phrase “The vehicle is at the bottom on the left” resulted in the model segmenting the wrong vehicle. The misunderstanding arose because the model interpreted the phrase as “left bottom” instead of the intended “bottom left,” demonstrating the model’s limitation in accurately processing complex spatial relationships and multiple positional references. A possible improvement would be to incorporate graph-based models or enhanced spatial reasoning techniques to better understand and differentiate spatial positions in referring expressions. Lastly, in the case of the phrase “A green and brown golf field,” the model segmented the golf field, but due to inaccurate ground truth annotations, the segmentation was suboptimal. The significant surface variation of the golf field led to inconsistent labeling, which affected the model’s ability to accurately segment the target. This issue stems from the dataset’s labeling inconsistencies, especially for objects with varying surface textures. Future improvements should focus on refining and standardizing the annotations, particularly for objects with heterogeneous surface features, to ensure more accurate segmentation.

4.5. Efficiency and Complexity Analysis

We first analyze the computational efficiency of individual modules within CD2FSAN. As shown in Table 6, the overall computational cost gradually increases with the integration of DFSM, MAAM, and DRCD. Specifically, DFSM raises FLOPs from 88.54 G to 97.27 G and slightly decreases FPS to 16.82 due to additional similarity computation and sorting operations, while improving segmentation accuracy. MAAM further increases FLOPs to 120.44 G because of the asymmetric and dilated convolutions used for multi-scale aggregation. DRCD introduces the largest cost, with 148.20 G FLOPs and an FPS of 14.96, as its rotation-aware decoding requires multiple orientation transformations. Despite these incremental costs, CD2FSAN maintains reasonable efficiency, demonstrating a balanced trade-off between computational complexity and segmentation performance.

We then compare the overall efficiency of CD2FSAN with leading referring segmentation models on the RRSIS-D dataset. Table 4 and Figure 9 summarize FLOPs, parameter counts, FPS, and corresponding accuracy. CD2FSAN lies in the top-right region of the accuracy–efficiency plot, achieving both the highest oIoU (78.43%) and the fastest inference speed (14.96 FPS) among all methods. In terms of model size, it remains moderate while delivering superior accuracy. Most of the computational cost stems from the CLIP-based visual–linguistic similarity computation and the rotation decoding in DRCD. Nevertheless, our design performs multimodal alignment in both the encoder and decoder stages with fewer parameters than multi-stage cross-attention models such as LAVT and CrossVLT, leading to an excellent balance between speed, complexity, and segmentation accuracy.

5. Conclusions

This paper presents CD2FSAN, a novel framework for RRSIS that jointly enhances cross-modal alignment, scale adaptability, and rotation robustness. The framework integrates three collaborative modules, where the DFSM is based on cross-modal information maximization, the MAAM ensures scale-consistent representation, and the DRCD performs geometry-aware decoding. Extensive experiments on RefSegRS, RRSIS-D, and RISBench demonstrate that CD2FSAN consistently achieves state-of-the-art performance across all test sets, surpassing previous oIoU scores by 1.60%, 0.64%, and 0.69%, respectively. On the RRSIS-D dataset, the framework shows notable advantages in small-object segmentation, achieving up to 1.77 points higher oIoU in the small-size bin, mainly attributed to the multi-scale aggregation and alignment design. Ablation and visualization analyses further confirm the complementary effects of the proposed modules and the framework’s robustness in handling scale variation, rotation diversity, and cross-modal misalignment. These results collectively validate the effectiveness of CD2FSAN and highlight its potential as a robust and generalizable solution for vision–language understanding in remote sensing. Looking ahead, we will focus on enhancing CD2FSAN’s fine-grained referring image segmentation performance under higher precision thresholds and incorporating more advanced vision–language foundation models to enrich pixel-level representations and capture deeper linguistic context.

Author Contributions

Conceptualization, Q.L. and Y.X.; methodology, Q.L. and J.Z.; software, Q.L.; validation, Q.L.; formal analysis, Q.L.; resources, Y.X.; data curation, Q.L.; writing—original draft preparation, Q.L.; writing—review and editing, Q.L., Y.X., J.Z., Y.G., Y.W., J.J. and X.L.; supervision, Y.X.; project administration, Y.X.; funding acquisition, Y.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Provincial Natural Science Foundation of Hunan grants 2023JJ30082.

Data Availability Statement

The datasets analyzed in this study are publicly available from third-party sources: RRSIS-D at https://drive.google.com/drive/folders/1Xqi3Am2Vgm4a5tHqiV9tfaqKNovcuK3A (accessed on 15 September 2025); RefSegRS at https://huggingface.co/datasets/JessicaYuan/RefSegRS (accessed on 15 September 2025); and RISBench at https://github.com/HIT-SIRS/CroBIM (accessed on 15 September 2025). No new data were created in this work. Our code used to reproduce the experiments will be made publicly available upon acceptance at https://github.com/luqianqi/CD2FSAN.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, Z.; Mou, L.; Hua, Y.; Zhu, X.X. RRSIS: Referring Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
Liu, S.; Ma, Y.; Zhang, X.; Wang, H.; Ji, J.; Sun, X.; Ji, R. Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 26648–26658. [Google Scholar] [CrossRef]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Cheng, J.; Deng, C.; Su, Y.; An, Z.; Wang, Q. Methods and datasets on semantic segmentation for Unmanned Aerial Vehicle remote sensing images: A review. ISPRS J. Photogramm. Remote Sens. 2024, 211, 1–34. [Google Scholar] [CrossRef]
Yuan, X.; Shi, J.; Gu, L. A review of deep learning methods for semantic segmentation of remote sensing imagery. Expert Syst. Appl. 2021, 169, 114417. [Google Scholar] [CrossRef]
Duan, L.; Lafarge, F. Towards Large-Scale City Reconstruction from Satellites. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 89–104. [Google Scholar]
Abid, S.K.; Chan, S.W.; Sulaiman, N.; Bhatti, U.; Nazir, U. Present and Future of Artificial Intelligence in Disaster Management. In Proceedings of the 2023 International Conference on Engineering Management of Communication and Technology (EMCTECH), Vienna, Austria, 16–18 October 2023; pp. 1–7. [Google Scholar] [CrossRef]
Zhao, W.; Lyu, R.; Zhang, J.; Pang, J.; Zhang, J. A fast hybrid approach for continuous land cover change monitoring and semantic segmentation using satellite time series. Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104222. [Google Scholar] [CrossRef]
Weiss, M.; Jacob, F.; Duveiller, G. Remote sensing for agricultural applications: A meta-review. Remote Sens. Environ. 2020, 236, 111402. [Google Scholar] [CrossRef]
Ji, R.; Tan, K.; Wang, X.; Tang, S.; Sun, J.; Niu, C.; Pan, C. PatchOut: A novel patch-free approach based on a transformer-CNN hybrid framework for fine-grained land-cover classification on large-scale airborne hyperspectral images. Int. J. Appl. Earth Obs. Geoinf. 2025, 138, 104457. [Google Scholar] [CrossRef]
Ding, H.; Liu, C.; Wang, S.; Jiang, X. Vision-Language Transformer and Query Generation for Referring Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 16321–16330. [Google Scholar]
Yu, L.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Bansal, M.; Berg, T.L. MAttNet: Modular Attention Network for Referring Expression Comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 18134–18144. [Google Scholar] [CrossRef]
Jing, Y.; Kong, T.; Wang, W.; Wang, L.; Li, L.; Tan, T. Locate then Segment: A Strong Pipeline for Referring Image Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 9853–9862. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Meila, M., Zhang, T., Eds.; PMLR: Birmingham, UK, 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Wang, Z.; Lu, Y.; Li, Q.; Tao, X.; Guo, Y.; Gong, M.; Liu, T. CRIS: CLIP-Driven Referring Image Segmentation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11676–11685. [Google Scholar] [CrossRef]
Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from Natural Language Expressions. In Proceedings of the Computer Vision–ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 108–124. [Google Scholar]
Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring Image Segmentation via Recurrent Refinement Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5745–5753. [Google Scholar] [CrossRef]
Shi, H.; Li, H.; Meng, F.; Wu, Q. Key-Word-Aware Network for Referring Expression Image Segmentation. In Proceedings of the Computer Vision–ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; pp. 38–54. [Google Scholar]
Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-Modal Self-Attention Network for Referring Image Segmentation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10494–10503. [Google Scholar] [CrossRef]
Hu, Z.; Feng, G.; Sun, J.; Zhang, L.; Lu, H. Bi-Directional Relationship Inferring Network for Referring Image Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 4423–4432. [Google Scholar] [CrossRef]
Kamath, A.; Singh, M.; LeCun, Y.; Synnaeve, G.; Misra, I.; Carion, N. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 1760–1770. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Kim, N.; Kim, D.; Kwak, S.; Lan, C.; Zeng, W. ReSTR: Convolution-free Referring Image Segmentation Using Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–20 June 2022; pp. 18124–18133. [Google Scholar] [CrossRef]
Liu, J.; Ding, H.; Cai, Z.; Zhang, Y.; Kumar Satzoda, R.; Mahadevan, V.; Manmatha, R. PolyFormer: Referring Image Segmentation as Sequential Polygon Generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18653–18663. [Google Scholar] [CrossRef]
Zhu, C.; Zhou, Y.; Shen, Y.; Luo, G.; Pan, X.; Lin, M.; Chen, C.; Cao, L.; Sun, X.; Ji, R. SeqTR: A Simple Yet Universal Network for Visual Grounding. In Proceedings of the Computer Vision–ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; pp. 598–615. [Google Scholar]
Liu, C.; Ding, H.; Jiang, X. GRES: Generalized Referring Expression Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–23 June 2023; pp. 23592–23601. [Google Scholar] [CrossRef]
Quan, W.; Deng, P.; Wang, K.; Yan, D.M. CGFormer: ViT-Based Network for Identifying Computer-Generated Images with Token Labeling. IEEE Trans. Inf. Forensics Secur. 2024, 19, 235–250. [Google Scholar] [CrossRef]
Wu, J.; Li, X.; Xu, S.; Yuan, H.; Ding, H.; Yang, Y.; Li, X.; Zhang, J.; Tong, Y.; Jiang, X.; et al. Towards Open Vocabulary Learning: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5092–5113. [Google Scholar] [CrossRef] [PubMed]
Dong, Z.; Sun, Y.; Liu, T.; Zuo, W.; Gu, Y. Cross-Modal Bidirectional Interaction Model for Referring Remote Sensing Image Segmentation. arXiv 2025, arXiv:2410.08613. [Google Scholar] [CrossRef]
Lei, S.; Xiao, X.; Zhang, T.; Li, H.C.; Shi, Z.; Zhu, Q. Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–11. [Google Scholar] [CrossRef]
Shi, L.; Zhang, J. Multimodal-Aware Fusion Network for Referring Remote Sensing Image Segmentation. IEEE Geosci. Remote Sens. Lett. 2025, 22, 1–5. [Google Scholar] [CrossRef]
Chen, K.; Zhang, J.; Liu, C.; Zou, Z.; Shi, Z. RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models. arXiv 2025, arXiv:2501.06809. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Curran Associates, Inc.: New York, NY, USA, 2021; Volume 34, pp. 12077–12090. [Google Scholar]
Li, Y.; Li, Z.Y.; Zeng, Q.; Hou, Q.; Cheng, M.M. Cascade-CLIP: Cascaded vision-language embeddings alignment for zero-shot semantic segmentation. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024. [Google Scholar]
Ruan, Z.; Pu, N.; Chen, J.; Gao, S.; Guo, Y.; Kong, Q.; Xie, Y.; Wei, Y. Few-shot event-based action recognition. Neural Netw. 2025, 191, 107750. [Google Scholar] [CrossRef]
Cho, Y.; Yu, H.; Kang, S.J. Cross-Aware Early Fusion with Stage-Divided Vision and Language Transformer Encoders for Referring Image Segmentation. IEEE Trans. Multimed. 2024, 26, 5823–5833. [Google Scholar] [CrossRef]

Figure 1. Illustration of three cross-modal alignment paradigms for RRSIS.

Figure 2. The overall architecture of the proposed CD2FSAN framework and detailed structures of its three core modules. (a) The end-to-end pipeline of CD2FSAN, which integrates cross-modal semantic priors from CLIP encoders with specialized modules for remote sensing segmentation. (b) The DFSM Based on Cross-Modal Information Maximization, which adaptively selects and organizes hierarchical visual features from multiple levels of the CLIP vision encoder based on their semantic relevance to the input expression, thereby enhancing the expressiveness and discriminability of visual representations for downstream fusion. (c) The MAAM aggregates hierarchical features and performs cross-modal alignment through self-attention and scale-aware convolutional operations applied to both image and text modalities. (d) The DRCD employs learnable rotation correction to generate orientation-adaptive features for accurate segmentation of arbitrarily oriented targets.

Figure 3. Illustration of the dynamic feature selection behavior of the proposed DFSM. Given the same remote sensing image but different referring expressions, DFSM dynamically selects different CLIP visual encoder layers for cross-modal fusion. The selected layers exhibit higher semantic alignment with the linguistic input, enabling more context-aware and accurate segmentation. This observation supports the motivation for dynamic, expression-dependent feature selection rather than relying solely on final-layer feature.

Figure 4. The computational architecture design of the proposed multi-scale alignment module. The module integrates both multi-scale self-attention and cross-attention mechanisms to enhance the semantic alignment between visual features and linguistic embeddings.

Figure 5. Size-wise oIoU and mIoU on RRSIS-D. Instances are binned by mask coverage

θ

following the dataset paper: Small (

θ \leq 0.20

), Medium (

0.20 < θ \leq 0.60

), and Large (

θ > 0.60

). Bars compare LAVT, RMSIN, LGCE, FIANet, and CD2FSAN.

Figure 5. Size-wise oIoU and mIoU on RRSIS-D. Instances are binned by mask coverage

θ

following the dataset paper: Small (

θ \leq 0.20

), Medium (

0.20 < θ \leq 0.60

), and Large (

θ > 0.60

). Bars compare LAVT, RMSIN, LGCE, FIANet, and CD2FSAN.

Figure 6. Visualization results of our CD2FSAN and the RMSIN and CRIS models. Our model is able to predict more accurate masks.

Figure 7. Qualitative examples of the proposed components. (a) The baseline model. (b) The baseline model with DFSM. (c) The baseline model with DFSM+MAAM. (d) The CD2FSAN (with DFSM+MAAM+DRCD).

Figure 8. Failure cases of CD2FSAN in remote sensing image segmentation.

Figure 9. Efficiency–accuracy trade-off on RRSIS-D. Models are positioned by FPS (abscissa) and oIoU (ordinate); bubble area scales with FLOPs and colour maps to parameter count.

Table 1. Comparison of mIoU (%) and oIoU (%) across methods on the RefSegRS, RRSIS-D, and RISBench datasets. Red indicates the best result and blue indicates the second-best result within each column.

Metric	Model	Publication	Visual Encoder	Text Encoder	RefSegRS		RRSIS-D		RISBench
Metric	Model	Publication	Visual Encoder	Text Encoder	Val	Test	Val	Test	Val	Test
oIoU	RRN [18]	CVPR-18	ResNet-101	LSTM	69.24	65.06	66.53	66.43	47.28	49.67
	BRINet [21]	CVPR-20	ResNet-101	LSTM	61.59	58.22	70.73	69.68	46.27	48.73
	ETRIS [20]	ICCV-23	ResNet-101	CLIP	72.89	65.96	72.75	71.06	64.09	67.61
	CRIS [16]	CVPR-22	CLIP	CLIP	72.14	65.87	70.98	70.46	66.26	69.11
	CrossVLT [37]	TMM-23	Swin-B	BERT	76.12	69.73	76.25	75.48	69.77	74.33
	LAVT [13]	CVPR-22	Swin-B	BERT	78.50	71.86	76.27	76.16	69.39	74.15
	LGCE [1]	TGRS-24	Swin-B	BERT	83.56	76.81	76.68	76.34	68.81	73.87
	RMSIN [2]	CVPR-24	Swin-B	BERT	74.40	68.31	78.27	77.79	69.51	74.09
	FIANet [31]	TGRS-25	Swin-B	BERT	85.51	78.28	76.77	76.05	-	-
	CroBIM [30]	Arxiv-25	Swin-B	BERT	78.85	72.30	76.24	76.37	69.08	73.61
	CD2FSAN	-	CLIP	CLIP	87.04	79.88	79.04	78.43	70.32	74.84
mIoU	RRN [18]	CVPR-18	ResNet-101	LSTM	50.81	41.88	46.06	45.64	42.65	43.18
	BRINet [21]	CVPR-20	ResNet-101	LSTM	38.73	31.51	51.41	49.45	41.54	42.91
	ETRIS [20]	ICCV-23	ResNet-101	CLIP	54.03	43.11	55.21	54.21	51.13	53.06
	CRIS [16]	CVPR-22	CLIP	CLIP	53.74	43.26	50.75	49.69	53.64	55.18
	CrossVLT [37]	TMM-23	Swin-B	BERT	55.27	42.81	59.87	58.48	61.54	62.84
	LAVT [13]	CVPR-22	Swin-B	BERT	61.53	47.40	57.72	56.82	60.45	61.93
	LGCE [1]	TGRS-24	Swin-B	BERT	72.51	59.96	60.16	59.37	60.44	62.13
	RMSIN [2]	CVPR-24	Swin-B	BERT	54.24	42.63	65.10	64.20	61.78	63.07
	FIANet [31]	TGRS-25	Swin-B	BERT	80.61	68.63	62.99	63.64	-	-
	CroBIM [30]	Arxiv-25	Swin-B	BERT	65.79	52.69	63.99	64.24	67.52	67.32
	CD2FSAN	-	CLIP	CLIP	76.95	66.96	66.47	65.37	68.36	69.74

Table 2. Performance comparison on the RRSIS-D validation set. The bold result is the optimal result.

Method	oIoU	mIoU	Pr@0.5	Pr@0.6	Pr@0.7	Pr@0.8	Pr@0.9
LAVT [13]	76.27	57.72	65.23	58.79	50.29	40.11	23.05
CRIS [16]	70.98	50.75	56.44	47.87	39.77	29.31	11.84
LGCE [1]	76.68	60.16	68.10	60.61	51.45	42.34	23.85
FIANet [31]	76.77	62.99	74.20	66.15	54.08	41.27	22.30
RMSIN [2]	78.27	65.10	68.39	61.72	52.24	41.44	23.16
CroBIM [30]	76.24	63.99	74.20	66.15	54.08	41.38	22.30
CD2FSAN	79.04	66.47	78.28	70.11	56.78	41.38	20.57

Table 3. Per-class mIoU (%) on the RRSIS-D validation set. Average is the unweighted mean across listed classes. The bold result is the optimal result.

Category	LAVT [13]	RMSIN [2]	LGCE [1]	FIANet [31]	CD2FSAN
Airport	66.44	68.08	68.11	68.66	68.61
Golf field	56.53	56.11	56.43	57.07	64.22
Expressway service area	76.08	76.68	71.19	77.35	72.31
Baseball field	68.56	66.93	70.93	70.44	88.43
Stadium	81.77	83.09	84.90	84.87	84.43
Ground track field	81.84	81.91	82.54	82.00	83.06
Storage tank	71.33	73.65	73.33	76.99	74.89
Basketball court	70.71	72.26	74.37	74.86	88.43
Chimney	65.54	68.42	68.44	68.41	79.85
Tennis court	74.98	76.68	75.63	78.48	72.88
Overpass	66.17	70.14	67.67	70.01	65.63
Train station	57.02	62.67	58.19	61.30	68.32
Ship	63.47	64.64	63.48	65.96	68.32
Expressway toll station	63.01	65.71	61.63	64.82	72.32
Dam	61.61	68.70	64.54	71.31	66.11
Harbor	60.05	60.40	60.47	62.03	57.77
Bridge	30.48	36.74	34.24	37.94	43.53
Vehicle	42.60	47.63	43.12	49.66	49.72
Windmill	35.32	41.99	40.76	46.72	63.76
Average	62.82	65.39	64.21	66.78	69.98

Table 4. Ablation results on RRSIS-D (Validation/Test). RFS indicates random feature selection, where RFS

(i, j, 12)

fuses two randomly selected intermediate layers i and j with the final layer 12 of the CLIP ViT-B encoder. The bold result is the optimal result.

Table 4. Ablation results on RRSIS-D (Validation/Test). RFS indicates random feature selection, where RFS

(i, j, 12)

fuses two randomly selected intermediate layers i and j with the final layer 12 of the CLIP ViT-B encoder. The bold result is the optimal result.

Method		Validation						Test
Method		P@0.5	P@0.6	P@0.7	P@0.8	P@0.9	mIoU	P@0.5	P@0.6	P@0.7	P@0.8	P@0.9	mIoU
(a)	Baseline	45.98	35.92	26.26	15.86	4.48	42.59	39.73	29.93	22.01	12.41	4.14	38.58
(b)	(a) + RFS (4,7,12)	44.19	34.02	24.71	14.65	3.67	41.75	38.67	28.51	18.10	10.34	2.12	37.39
(c)	(a) + RFS (6,9,12)	50.97	40.45	31.43	19.31	6.14	46.21	45.51	36.09	26.61	15.86	4.31	43.66
(d)	(a) + DFSM	57.82	51.32	42.53	32.30	15.40	50.62	52.24	42.53	31.15	12.53	5.74	48.22
(e)	(d) + Transformer	71.84	62.29	51.21	37.59	17.30	60.31	71.65	61.96	50.47	35.94	18.04	59.18
(f)	(d) + MAAM	76.32	66.60	54.82	40.80	19.71	64.87	72.39	62.37	50.83	36.11	17.81	62.55
(g)	(f) + DRCD (full)	78.28	70.11	56.78	41.38	20.57	66.47	73.14	63.46	51.19	36.37	19.42	65.37

Table 5. Ablation study on the number of rotation bases n in DRCD on the RRSIS-D validation set. The bold result is the optimal result.

Rotation Bases (n)	oIoU (%)	mIoU (%)	Pr@0.5	Pr@0.7	Pr@0.9
$n = 1$	78.61	66.13	78.07	56.10	20.19
$n = 2$	77.99	65.19	76.69	55.91	19.89
$n = 3$	79.04	66.47	78.28	56.78	20.57
$n = 4$	78.27	65.46	78.24	56.44	20.36

Table 6. Module-wise efficiency analysis on the RRSIS-D validation set.

Model	Params (M)	FLOPs (G)	FPS
Baseline	183.84	88.54	18.23
+DFSM	186.28	97.27	16.82
+MAAM	192.09	120.44	15.32
+DRCD (Full Model)	197.54	148.20	14.96

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Q.; Xie, Y.; Zhang, J.; Guo, Y.; Wei, Y.; Jiang, J.; Luan, X. CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation. Remote Sens. 2025, 17, 3675. https://doi.org/10.3390/rs17223675

AMA Style

Lu Q, Xie Y, Zhang J, Guo Y, Wei Y, Jiang J, Luan X. CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation. Remote Sensing. 2025; 17(22):3675. https://doi.org/10.3390/rs17223675

Chicago/Turabian Style

Lu, Qianqi, Yuxiang Xie, Jing Zhang, Yanming Guo, Yingmei Wei, Jie Jiang, and Xidao Luan. 2025. "CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation" Remote Sensing 17, no. 22: 3675. https://doi.org/10.3390/rs17223675

APA Style

Lu, Q., Xie, Y., Zhang, J., Guo, Y., Wei, Y., Jiang, J., & Luan, X. (2025). CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation. Remote Sensing, 17(22), 3675. https://doi.org/10.3390/rs17223675

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CLIP-Driven with Dynamic Feature Selection and Alignment Network for Referring Remote Sensing Image Segmentation

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Referring Image Segmentation

2.2. Referring Remote Sensing Image Segmentation

3. Materials and Methods

3.1. Overview

3.2. Image and Text Feature Encoding

3.3. Dynamic Feature Selection Mechanism

3.4. Multi-Scale Aggregation and Alignment Module

3.5. Dynamic Rotation Correction Decoder

4. Results

4.1. Dataset and Implementation Details

4.2. Comparisons with Other Methods

4.3. Ablation Study

4.4. Visualization and Qualitative Analysis

4.5. Efficiency and Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI