RSAM: Vision-Language Two-Way Guidance for Referring Remote Sensing Image Segmentation

Zhao, Zilong; Xu, Xin; Huang, Bingxin; Chen, Hongjia; Pu, Fangling

doi:10.3390/rs17243960

Open AccessArticle

RSAM: Vision-Language Two-Way Guidance for Referring Remote Sensing Image Segmentation

by

Zilong Zhao

,

Xin Xu

^*,

Bingxin Huang

,

Hongjia Chen

and

Fangling Pu

Collaborative Sensing Laboratory, Electronic Information School, Wuhan University, Luoyu Road 129, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(24), 3960; https://doi.org/10.3390/rs17243960

Submission received: 6 October 2025 / Revised: 4 December 2025 / Accepted: 5 December 2025 / Published: 8 December 2025

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A novel benchmark dataset, RISORS, containing 36,697 high-quality instruction–mask pairs, is constructed to advance referring remote sensing image segmentation (RRSIS) research.
The proposed Referring-SAM (RSAM) framework, featuring a Two-Way Guidance Module (TWGM) and a Multimodal Mask Decoder (MMMD), achieves state-of-the-art segmentation performance, especially for small and diverse targets.

What is the implication of the main finding?

The RISORS dataset and the RSAM framework provide a foundation and a baseline for future research in integrated vision-language analysis for referring remote sensing image segmentation.
The design of TWGM and MMMD demonstrates an effective approach for achieving robust cross-modal alignment and precise text-prompted segmentation in complex remote sensing scenarios.

Abstract

Referring remote sensing image segmentation (RRSIS) aims to accurately segment target objects in remote sensing images based on natural language instructions. Despite its growing relevance, progress in this field is constrained by limited datasets and weak cross-modal alignment. To support RRSIS research, we construct referring image segmentation in optical remote sensing (RISORS), a large-scale benchmark containing 36,697 instruction–mask pairs. RISORS provides diverse and high-quality samples that enable comprehensive experiment in remote sensing contexts. Building on this foundation, we propose Referring-SAM (RSAM), a novel framework that extends Segment Anything Model 2 to support text-prompted segmentation. RSAM integrates a Two-Way Guidance Module (TWGM) and a Multimodal Mask Decoder (MMMD). TWGM facilitates a two-way guidance mechanism that mutually refines image and text features, with positional encodings incorporated across all attention layers to significantly enhance relational reasoning. MMMD effectively separates textual prompts from spatial prompts, improving segmentation accuracy in complex multimodal settings. Extensive experiments on RISORS, as well as RefSegRS and RRSIS-D datasets, demonstrate that RSAM achieves state-of-the-art performance, particularly in segmenting small and diverse targets. Ablation studies further validate the individual contributions of TWGM and MMMD. This work provides a solid foundation for further developments in integrated vision-language analysis within remote sensing applications.

Keywords:

referring remote sensing image segmentation; vision-language interaction; Segment Anything Model

1. Introduction

Referring remote sensing image segmentation (RRSIS) aims to accurately segment target objects in remote sensing images based on natural language instructions [1]. In contrast to traditional segmentation algorithms [2,3], RRSIS leverages linguistic expressions as instruction, significantly reducing the complexity of identifying and localizing specified targets in diverse geospatial scenes. This capability supports critical applications such as land use classification [4,5], urban infrastructure development [6,7], and maritime monitoring [8,9].

Prior to the rise of deep learning, referring remote sensing image segmentation was an exceedingly challenging task, resulting in limited research efforts. In recent years, the substantial advancements in deep learning, particularly with convolutional neural networks (CNNs) [10] and transformers [11], have enabled the widespread adoption of these techniques in various remote sensing image interpretation applications, including change detection [12,13], scene classification [14,15], cross-modal retrieval [16], and visual localization [17]. Concurrently, deep learning architectures such as U-Net [18], DeepLabv3+ [19], and SegFormer [20] have significantly advanced the field of semantic segmentation. In the domain of promptable image segmentation, the Segment Anything Model (SAM) [21] stands out as a prominent example, serving as the foundation for numerous subsequent studies.

Referring remote sensing image segmentation was first introduced by Yuan et al. in 2024 [1], and remains in its early stages of research, facing many challenges. First, it continues to grapple with the heterogeneity between visual and textual modalities and the imbalance of information content, which complicate feature alignment and information exchange. Previous works [22,23] have predominantly focused on the image branch, integrating textual features into the image feature extraction process. However, in practical applications, the deep involvement of text features in image feature extraction impedes the reuse of image features when referencing different textual prompts, resulting in additional computational overhead. Moreover, these approaches neglect the ambiguity and redundancy inherent in textual instructions. Second, RRSIS research remains constrained by data scarcity issues that hinder the advancement. Previous datasets [1,22] suffer from limitations, including insufficient sample diversity, simplistic textual instructions, and imperfect mask annotations. These deficiencies ultimately undermine the stability and generalization capability of deep learning models in practical applications.

To advance research in this emerging area, we present a new large-scale benchmark dataset for RRSIS, named referring image segmentation in optical remote sensing (RISORS). RISORS consists of 36,697 human-annotated description–mask pairs, each capturing complex natural language references to visual targets. Compared to prior datasets [1,22], RISORS offers broader sample diversity, richer linguistic instructions, and higher-quality pixel-level annotations, thereby supporting the development and evaluation of more robust and generalizable RRSIS models.

Building upon this foundation, we introduce Referring-SAM (RSAM), a novel framework designed to enhance referring image segmentation in remote sensing contexts. RSAM extends the Segment Anything Model 2 by integrating two key components: a Two-Way Guidance Module (TWGM) and a Multimodal Mask Decoder (MMMD). TWGM enables deep bidirectional interaction between visual and textual modalities, using visual features to refine semantic representations and linguistic cues to guide visual attention. MMMD enhances segmentation accuracy by distinguishing between textual and spatial prompts during mask decoding.

RSAM comprises four main components: an image encoder, a text encoder, TWGM, and MMMD. The image encoder is based on Hiera [24], a hierarchical svision transformer capable of extracting multi-scale visual features. Text encoding is handled by BERT [25], which generates robust language representations. TWGM is implemented as a stack of Two-Way Guidance Transformer layers that align cross-modal features at multiple levels. MMMD receives text prompts alongside conventional input types (points, boxes, masks) to produce precise segmentation masks, fully enabling the referring capability.

Extensive experiments on RISORS, RefSegRS, and RRSIS-D demonstrate that RSAM achieves state-of-the-art performance, particularly in scenarios involving small or complex targets. Ablation studies confirm the effectiveness of both TWGM and MMMD.

The main contributions of our work can be summarized as follows.

We construct RISORS, a novel dataset for the RRSIS task. It provides precise pixel-level masks for each instance and contains a total of 36,697 text description–mask pairs. This dataset will be released as an open-source resource for the research community.
We propose a Two-Way Guidance Module (TWGM). It enables synchronous updates of image and text features, allowing them to mutually guide each other. Meanwhile, it incorporates position embeddings into every attention layer, which is critical for enhancing the model’s ability to understand relationships between objects.
We introduce RSAM, a variant of SAM 2 equipped with a Multimodal Mask Decoder, enabling linguistic interpretation and precise referring segmentation of remote sensing images. Experimental results show our method achieves state-of-the-art segmentation precision.

2. Related Work

2.1. Referring Image Segmentation on Natural Images

Referring Image Segmentation is an image segmentation technique that aims to label pixels representing object instances in images or videos based on natural language expressions. Hu et al. [26] initially proposing the use of the CNN-LSTM framework to extract visual features and linguistic features. Recursive RMI [27] employs a multimodal LSTM to encode text–image–spatial interactions and generate coarse localization masks, which are subsequently refined through a unidirectional LSTM network. Furthermore, RRN [28] matches each word with each pixel for initial mask generation, then applies recursive optimization to iteratively refine the segmentation for high-quality results. However, the CNN-LSTM framework is unable to capture complex linguistic and image features due to the limitation of the local receptive field of CNN.

To precisely capture the intricate relationships between visual and textual modalities, recent works have extensively leveraged attention mechanisms. CMSA [29] adaptively focuses on key words and image regions to efficiently capture long-range dependencies. CGAN [30] integrates cascade-grouped attention with instance-level attention loss, enabling layer-by-layer inference to effectively distinguish visual instances and enhance the alignment between textual descriptions and image regions. Considering the problems of modality gaps, BRINet [31] uses a visual-guided language attention module to extract adaptive linguistic context and filter out irrelevant regions. BUSNet [32] is a one-stage framework that uses a bottom-up approach to align text with visual regions and a bidirectional attention module to integrate multi-level features for enhanced segmentation.

ViT [33] have been widely applied to a range of visual tasks. Among them, the Swin Transformer [34]—a representative hierarchical ViT model—not only preserves the fine details of small objects but also operates with exceptional efficiency. LAVT [35] employs a multi-level design within the Swin Transformer backbone, forming a hierarchical approach to the visual structure of language perception. Liu et al. [36] design a Multi-Modal Mutual Attention to empower the generic attention mechanism with feature fusing functionality.

2.2. Referring Remote Sensing Image Segmentation

Recently, referring remote sensing image segmentation has emerged as a novel research area within remote sensing image interpretation. RefSegRS [1] is the first benchmark dataset in the RRSIS field, laying a foundational framework for research, after which Liu et al. [22] introduced the more diverse and extensive RRSIS-D dataset and evaluated mainstream RIS methods on it. Yuan et al. [1] adopted the LAVT framework as the baseline and designed the LGCE module to address the challenges associated with recognizing small and sparsely distributed objects in remote sensing imagery. RMSIN [22] fully exploits both intra-scale and inter-scale interactions to alleviate issues arising from the diverse spatial scales and orientations inherent in remote sensing imagery. DANet [37] integrates an explicit alignment strategy to narrow the inter-domain affinity distribution and thereby reduce domain discrepancies, and incorporates a reliable agent alignment module to enhance multi-modality awareness in the predictor while mitigating the effects of deceptive noise interference. FIANet [23] introduces a Fine-grained Image–text Alignment Module that decouples referring expressions into object and spatial components, along with a Text-aware Multiscale Enhancement Module to handle varying object scales.

2.3. SAM with Text Prompt

Segment Anything Model (SAM) proposed by Kirillov et al. [21] demonstrates its ability to accept points, bounding boxes and masks prompts, while exhibiting robust zero-shot generalization capabilities. Subsequently, SAM 2 [38] expands upon SAM by incorporating video segmentation capabilities, achieving improvements in both segmentation accuracy and processing speed. SAM 2 has demonstrated significant success in image segmentation through its innovative architecture and diverse dataset training, offering the fine-grained precision and cross-domain generalization capabilities essential for our approach.

Extensive research has been conducted based on the SAM/SAM 2 architectures. Mohanad et al. [39] implemented a semantic segmentation pipeline for remote sensing imagery that merely combines SAM with Grounding DINO [40], where Grounding DINO generates object bounding boxes from text prompts that are subsequently fed into SAM. Notably, their approach fails to directly modify or enhance the SAM architecture itself, limiting potential performance improvements for domain-specific applications. Many works [41,42] leverages Multi-modal Large Language Model (MLLM, e.g., LLaVA [43]) for multi-modal information acquisition and fusion to generate segment prompt tokens, which are subsequently fed into SAM as sparse prompts (similar to point prompts) for segmentation tasks. However, such approaches heavily rely on the capability of MLLM to generate segmentation prompt tokens. Furthermore, by treating textual prompts at the same low-dimensional level as sparse prompts like point prompts and bounding box prompts, these methods fail to fully utilize the rich information inherently contained within textual descriptions.

3. Materials and Methods

3.1. Datasets

In the experiment, we use the constructed RISORS and two public referring remote sensing image segmentation datasets, RefSegRS [1] and RRSIS-D [22], to evaluate the effectiveness of the proposed method.

3.1.1. Construction Procedure of RISORS

We utilized the images, text instructions, and object bounding boxes of DIOR-RSVG [44] as our original data source, which contains 17,402 remote sensing images and 38,320 instances. Building upon this foundation, we implemented a semi-automated annotation pipeline to efficiently construct a dataset for referring remote sensing image segmentation. In this pipeline, all instance masks are generated by sam2.1-hiera-large [38] from manual spatial prompts, bypassing pixel-level edits and ensuring a “unified standard”. The final human refining and filtering step guarantees that the high-quality segmentation masks accurately correspond to their referring expressions.

The specific construction procedure is presented below and depicted in Figure 1 for clarity.

(1): For each instance, using the original bounding boxes of the targets as prompts, sam2.1-hiera-large generates three different pixel-level initial masks.
(2): These masks undergo manual screening to select effective ones; if all three are invalid, manual re-guidance is implemented. This re-guidance involves adding a point prompts to any of the previously generated masks, including positive points which indicating areas to retain and negative points which indicating areas to discard. Sam2.1-hiera-large takes the manually added points, original bounding box, and previously generated mask as prompts to generate new masks. This manual re-guidance process can be executed multiple times until an effective mask is obtained. If too many re-guidance attempts still fail to produce a valid mask, the instance is discarded.
(3): All images and masks are uniformly resized to $800 \times 800$ by cropping or padding with black borders. Annotations are stored in SA-1B format [21]. The annotations preserve the original text instructions and bounding boxes, masks, and all manually added prompt points. Additional instance information is also maintained, as detailed in Table 1. The masks are stored using Run-Length Encoding (RLE) for efficient representation.

3.1.2. Data Characteristics and Analysis of RISORS

Through human-guided annotation and careful screening, we ultimately obtained 36,697 valid instances from 16,642 remote sensing images. The dataset was randomly divided into training, validation, and test sets. To ensure there was no overlap between these three sets, we maintained that all instances derived from the same original image were assigned to the same set. The statistics for each set is presented in Table 2.

The distribution of the number of instances contained in each image is shown in Figure 2. The dataset comprises 19 distinct categories, with the distribution of instances across classes shown in Figure 3. The mean area (the number of pixels) of the instances is 21,564, with a minimum value of 60 and a maximum value of 452,917. This considerable range in instance sizes presents notable challenges for RRSIS. The spatial distribution of instances is illustrated in Figure 4, while Figure 5 presents the area distribution across different categories. Variations in instance sizes are evident among the different categories. In addition, playgrounds, overpass, bridge, stadiums, and expressway toll stations exhibit distinct multimodal distributions in their area measurements. The multi-scale characteristics of remote sensing images are particularly prominent. The mean number of human-guided points per instance is 3.08. The minimum length of the descriptions is 3 words, the maximum is 22 words, and the average is 7.45 words.

3.1.3. Comparison of Datasets

The three datasets, RISORS, RefSegRS and RRSIS-D, exhibit certain differences, and their statistical characteristics are summarized in Table 3. The counts for train, val, and test set for the three datasets are presented in Table 4.

The RefSegRS contains only 4420 samples, and its text instructions are brief and general, with an average length of 3.09 words per description. Most text prompts correspond to multiple target instances. Representative samples from RefSegRS are shown in Figure 6.

RRSIS-D comprises 17,402 samples, while RISORS contains 36,697 samples. Their textual instructions are detailed and specific, often referring to particular objects and incorporating semantic information such as spatial relationships, size comparisons, and color attributes. Representative samples from RRSIS-D and RISORS are shown in Figure 7. The average text length in RRSIS-D is 6.80, whereas RISORS has a slightly longer average text length of 7.45. RISORS standardizes all images to 800 × 800 pixels by cropping or padding with black borders, without altering the original aspect ratio. In contrast, although most images in RRSIS-D are also sized at 800 × 800 pixels, a small portion exhibit variability in their dimensions. Additionally, the masks in RISORS exhibit more precise boundaries compared to those in RRSIS-D.

3.2. Proposed Method

3.2.1. Overall Architecture and Formulation

The schematic diagram of RSAM is illustrated in Figure 8.

The architecture comprises five principal components: image encoder, text encoder, Two-Way Guidance Module, prompt encoder, and Multimodal Mask Decoder.

(1) The image encoder adopts a hierarchical structure to extract multi-scale features from the images. This accommodates the multi-scale characteristics of objects in remote sensing imagery and provides essential information for reconstructing high-resolution segmentation results. The encoder utilizes Hiera [24] to extract visual features, preserving features from stage 1 through 4:

{V^{(1)}, V^{(2)}, V^{(3)}, V^{(4)}} = Hiera (I)

(1)

where

I \in R^{H \times W \times 3}

represents the input remote sensing image, and

V^{(s)}

represents the image feature from stage s.

The shallow image features

V^{(1)}

and

V^{(2)}

are employed in the Multimodal Mask Decoder to facilitate the generation of accurate high-resolution masks. A feature pyramid network (FPN) [45] is utilized to fuse the deep image features

V^{(3)}

and

V^{(4)}

, enhancing multi-scale feature representation capabilities:

V_{d} = FPN (V^{(3)}, V^{(4)}) .

(2)

where

V_{d} \in R^{\frac{H}{16} \times \frac{W}{16} \times d}

represents deep multi-scale image features, d is the feature dimension. Subsequently,

V_{d}

is fed into the Two-Way Guidance Module.

(2) For text encoder, a pretrained BERT model is adopted, with global feature tokens embedded within the sequence:

T = BERT ({[GT], T_{e}})

(3)

where

T_{e}

is the text embeddings,

[GT]

is global feature token,

T \in R^{l \times d}

represents text feature, and l is the length of textual sequence.

(3) The Two-Way Guidance Module (TWGM) employs attention mechanisms to simultaneously update image and text features, allowing images to enrich textual information while the text guides the image to focus on target regions. Additionally, positional encoding is incorporated into each attention unit within the TWGM to enhance the model’s ability to capture spatial relationships among objects. This process facilitates comprehensive interaction between image features and text features.

V_{d}^{'}, T_{G}^{'} = TWGM (V_{d}, T) .

(4)

where

V_{d}^{'} \in R^{\frac{H}{16} \times \frac{W}{16} \times d}

represents the final deep visual features, and

T_{G}^{'} \in R^{l \times d}

represents the text global feature.

The TWGM consists of N stacked two-way guidance transformer (TWGT) layers that process the image features, text features, and positional embeddings of both modalities, outputting fused image and text representations. More details are in Section 3.2.2.

(4) The prompt encoder remains consistent with SAM 2’s implementation, encoding the coordinates of points and bounding boxes as sparse prompt embeddings, while encoding masks as dense prompt embeddings:

\begin{matrix} P_{s} & = h (P, B) \end{matrix}

(5)

\begin{matrix} P_{d} & = g (M) \end{matrix}

(6)

where

P_{s}, P_{d}

represent the sparse and dense prompt embeddings, respectively, h is a function that generates sparse embeddings, and g for dense embeddings.

When point or bounding box prompts are absent, the sparse prompt embeddings are populated with a learnable “empty” vector (donated as

v_{se} \in R^{d}

) semantically encodes the no-prompt condition. Analogously, a trainable null embedding

v_{de} \in R^{d}

is allocated to dense prompt embeddings when mask prompts are unavailable. This initialization protocol ensures stable feature space occupancy across all prompt modalities.

(5) Finally, the Multimodal Mask Decoder facilitates interaction between image features, text prompts, and other prompts, while incorporating shallow-level image features to reconstruct high-resolution masks:

M = MMMD (V_{d}^{'}, T_{G}^{'}, P_{s}, P_{d}, V_{s})

(7)

where

V_{s} = {V^{1}, V^{2}}

represents shallow image features, and

M \in R^{H \times W}

represents mask of prediction. The relevant details are in Section 3.2.3.

3.2.2. Two-Way Guidance Transformer Module

Initially, multimodal interaction typically involved simply concatenating the two modalities and extracting common features using basic operations such as fully connected layers [46,47]. To facilitate comprehensive feature interaction, numerous studies [1,22,23,48] have implemented image and text information exchange during the feature extraction stage. This interaction approach deeply intertwines the processes of image feature extraction and text feature acquisition. Although they are effective, but in practical applications, we often need to match a text passage against multiple images (or an image against multiple text passages). In such scenarios, the feature extraction process must be executed between any given image and text pair, without the ability to reuse image or text features, significantly increasing computational demands.

To facilitate comprehensive cross-modal information interaction while maintaining maximally independent image and text feature extractors, we propose Two-Way Guidance Module (TWGM). The schematic diagram of the TWGM is shown in Figure 9.

This module is constructed by stacking multiple two-way guidance transformer (TWGT) layers, followed by a final text-to-image cross-attention [11] operation to extract global textual prompt embedding. It can be expressed as follows:

v_{i}, l_{i} = \{\begin{matrix} V_{d}, T, & i = 0 \\ TWGT (v_{i - 1}, l_{i - 1}), & i = 1, \dots, N \end{matrix}

(8)

\begin{matrix} T^{'} & = CA (v_{N}, l_{N}) \end{matrix}

(9)

\begin{matrix} T_{G}^{'} & = T^{'} [0] \end{matrix}

(10)

\begin{matrix} V_{d}^{'} & = v_{N} \end{matrix}

(11)

where

v_{i} \in R^{\frac{H}{16} \times \frac{W}{16} \times d}, l_{i} \in R^{l \times d}

represent the visual and text feature of ith two-way guidance transformer layer,

[0]

represents the extraction of the token at the first position which is the global feature token,

V_{d}^{'} \in R^{\frac{H}{16} \times \frac{W}{16} \times d}

represents the final deep visual features, and

T_{G}^{'} \in R^{1 \times d}

represents the text global feature.

The TWGT layer implements a dedicated mechanism for bidirectional cross-modal guidance between visual and textual modalities. Specifically, it employs self-attention (SA) [11] and the multi-layer perceptron (MLP) for intra-modal feature refinement and cross-attention (CA) for inter-modal feature exchange. All the self-attention, cross-attention and the MLP are followed by residual connection [49] and normalization. It can be expressed as follows:

f_{i + 1} = Norm (f_{i} + Func (f_{i}))

(12)

where

Func (\cdot)

denotes self-attention, cross-attention or MLP.

In the context of referring image segmentation tasks, the spatial orientation of the target and the relative relationships between objects are critical information. To maximize preservation of positional information, we systematically integrate learnable positional embeddings at each stage of TWGT. Inspired by DERT [50], we add positional embeddings only to the queries and keys in the self-attention and cross-attention modules, while the values remain unaffected by positional embeddings. It can be expressed as follows:

\begin{matrix} f_{i + 1} & = SA (f_{i} + {PE}_{f}, f_{i} + {PE}_{f}, f_{i}) \end{matrix}

(13)

\begin{matrix} f_{i + 1} & = CA (f_{i} + {PE}_{f}, f_{i}^{'} + {PE}_{f^{'}}, f_{i}^{'}) \end{matrix}

(14)

where

{PE}_{f}

represents the positional embedding of

f_{i}

, and

f_{i}^{'}

represents another modal features.

3.2.3. Multi-Model Mask Decoder

We propose a Multimodal Mask Decoder (MMMD) specifically designed for SAM 2 to enable referring image segmentation. This module effectively distinguishes the influence paths of text prompts, sparse prompts (points and bounding boxes), and dense prompts (masks), while fully retaining SAM 2’s original segmentation capabilities. The architectural details are depicted in Figure 10.

We will provide a detailed introduction to the specific decoding process.

(1) The MMMD module maintains a set of learnable output embeddings

E_{m} = {t_{obj}, t_{iou}, t_{mask}}

\in R^{3 \times d}

. The global textual prompt embeddings

T_{G}^{'}

undergo element-wise addition with

t_{mask}

, after which all output embeddings are integrated with sparse prompt embeddings via channel-wise concatenation, forming the composite decoding tokens:

E = Concat (t_{obj}, t_{iou}, t_{mask} + T_{G}^{'}, P_{s}) .

(15)

(2) The deep image features undergo element-wise addition with dense prompt embeddings, then jointly enter a two-way transformer module [38] with image positional embeddings and decoding tokens. In two-way transformer, image features continuously condition the decoding tokens via spatial–semantic projection, meanwhile, decoding tokens reciprocally refine image features.

(3) The image features

v \in R^{\frac{H}{16} \times \frac{W}{16} \times d}

output by the two-way transformer are integrated with shallow-layer image features through fractionally strided convolution [51] operations:

\begin{matrix} v^{'} & = GELU (LN (DC (v) + V^{(2)})) \end{matrix}

(16)

\begin{matrix} V & = GELU (DC (v^{'}) + V^{(1)}) \end{matrix}

(17)

where

DC

denotes fractionally strided convolution,

LN

denotes layer normalization [52], and

V \in R^{H \times W \times d}

.

This multi-scale feature fusion strategy enables the preservation of high-resolution spatial details throughout the network architecture, thereby facilitating the reconstruction of high-fidelity mask predictions with precise boundary localization.

(4) The three output embeddings are extracted from the decoding tokens processed by the Two-Way Transformer. Object score embedding

t_{obj}^{'}

is processed through a MLP to predict the target existence probability. IoU embedding

t_{iou}^{'}

is transformed via MLP into predicted intersection-over-union scores between segmentation masks and ground truth. Mask decoding embedding

t_{mask}^{'}

is projected through the Hypernetwork [53] to generate the final decoding vector:

t_{m} = Hypernetwork (t_{mask}^{'})

(18)

where

t_{m} \in R^{d}

represents the decoding vector.

The resultant decoding vector is then multiplied with the image feature to produce the predicted mask, formally expressed as

M = σ (V t_{m})

(19)

where

σ (\cdot)

denotes the sigmoid activation function for probabilistic mask generation.

3.2.4. Objective Function

The learning objective integrates three loss functions to jointly optimize segmentation accuracy, prediction confidence, and mask quality estimation:

L = L_{mask} + L_{obj} + L_{iou} .

(20)

Mask prediction loss combines sigmoid focal loss [54] and dice loss [55] to address class imbalance and boundary refinement:

L_{mask} = λ_{1} \cdot L_{focal} + λ_{2} \cdot L_{dice}

(21)

where

λ_{1}

and

λ_{2}

are weighting coefficients.

Sigmoid Focal Loss resolves foreground-background pixel imbalance by suppressing dominant-class gradients:

\begin{matrix} p_{t} & = y p + (1 - y) (1 - p) \end{matrix}

(22)

\begin{matrix} α_{t} & = y α + (1 - y) (1 - α) \end{matrix}

(23)

\begin{matrix} L_{focal} & = - \sum α_{t} {(1 - p_{t})}^{γ} log (p_{t}) \end{matrix}

(24)

where label y is 1 if the pixel belongs to the target object and 0 otherwise,

α

is used to adjust the weights of positive and negative pixels, and

γ

is used to reduce the weights of easy-to-classify pixels and focus on difficult-to-classify pixels.

Dice Loss ignores the deviation in the number of target pixels and is suitable for optimizing the segmentation of small targets:

L_{dice} = 1 - \frac{2 \sum y p + ϵ}{\sum p + \sum y + ϵ}

(25)

where

ϵ

is a smoothing term used to avoid the denominator being zero.

The object score, representing the probability of target presence in the image, is optimized using binary cross-entropy loss:

\begin{matrix} p_{t} & = y_{obj} p_{obj} + (1 - y_{obj}) (1 - p_{obj}) \end{matrix}

(26)

\begin{matrix} L_{obj} & = - \sum log p_{t} \end{matrix}

(27)

where label

y_{obj}

is 1 if the target object presences in the image and 0 otherwise.

The predicted IoU values are refined through L1 regression loss:

L_{iou} = |y_{iou} - p_{iou}|

(28)

where

y_{iou}

represents the IoU of ground truth mask and the predicted mask.

To harmonize the interplay between different loss components, weighting coefficients are introduced, yielding the composite objective function:

L_{c} = λ_{1} \cdot L_{focal} + λ_{2} \cdot L_{dice} + λ_{3} \cdot L_{obj} + λ_{4} \cdot L_{iou} .

(29)

3.3. Experimental Setup

All experiments in this study were implemented using the PyTorch 2.6.0 [56] framework and executed on four NVIDIA A100 GPUs. In the training phase, AdamW [57] is adopted to optimize the model. In the TWGM, the number of stacked layers in TWGT is empirically set to 2. The weighting coefficients

λ_{1}

∼

λ_{4}

of losses are set to 20, 1, 1 and 1. The image encoder utilizes the part of pretrained weights sam2.1-hiera-base-plus, while the text encoder employs BERT with the pretrained weights bert-base. The parameters of the TWGM and MMMD are randomly initialized. Notably, only the parameters of the text encoder are kept frozen during training and are not updated. During RSAM training, a subset of samples is randomly selected to receive multiple positive and negative guidance points, which facilitate faster learning of the spatial extent represented by the text prompts. These points are generated randomly based on the comparison between the model’s current predicted masks and the corresponding ground truth masks.

3.3.1. Evaluation Metrics

Following previous works [1,22,23], we adopt these metrics: overall IoU (oIoU), mean IoU (mIoU), and Average Precision at IoU thresholds from 0.5 to 0.9 (P@[0.5:0.9]). IoU (Intersection over Union) represents the ratio of the intersection to the union between predicted mask and ground truth:

{IoU}^{(i)} = \frac{|M^{(i)} \cap M_{gt}^{(i)}|}{|M^{(i)} \cup M_{gt}^{(i)}|} .

(30)

Overall IoU is calculated as the sum of intersections for all instances divided by the sum of unions, which can be computed as

oIoU = \frac{\sum_{i} |M^{(i)} \cap M_{gt}^{(i)}|}{\sum_{i} |M^{(i)} \cup M_{gt}^{(i)}|} .

(31)

In overall IoU, larger objects are assigned greater weights, whereas in mean IoU, all objects are weighted equally regardless of their size.

3.3.2. Comparative Experiment

We conducted experimental comparisons between the proposed RSAM method and existing open-source approaches on the datasets. Among them, LGCE [1], RMSIN [22] and FIANet [23] are prompt-based image segmentation models specifically designed for remote sensing imagery. We re-implemented the RMSIN and FIANet methods and performed further evaluations on the RISORS dataset. The performance of other methods is cited from the results reported in [22,23].

To ensure fair comparative analysis across methodologies, we maintained consistent computational environments for all experimental configurations. It is critical to note that the RSAM operates solely on the text instructions during testing—no additional prompts of any kind are utilized. Consistent with the experimental settings in [23], we train the models for 60 epochs on RefSegRS and 40 epochs on RRSIS-D, with a batch size of 4 per GPU. For the RISORS dataset, the models training proceeded for 60 epochs with a batch size of 2 per GPU. In experiments with the RRSIS-D and RISORS dataset, black rectangular masks was randomly applied to the images as data augmentation.

3.3.3. Ablation Studies

To further evaluate the effectiveness of the proposed Two-Way Guidance Module (TWGM) and the Multimodal Mask Decoder (MMMD) modules, we conducted a series of ablation experiments on the RISORS dataset. Specifically, we compare four settings: the baseline model, baseline with MMMD, baseline with TWGM, and baseline with both modules.

3.3.4. Comparison of Different Model Setting

We compared two image encoder sizes: hiera-base-plus and hiera-large. In addition, we investigated the effect of freezing the pretrained parameters, namely those of the image and text encoders, on the segmentation performance evaluated on the RISORS dataset.

3.3.5. Multimodal Prompt Capability

To assess whether point prompts remain effective in RSAM, we introduced random positive or negative point prompts based on segmentation results generated solely from text prompts during the testing phase.

4. Results

4.1. Comparisons with Other Methods

(1) Results on the RefSegRS Dataset: Table 5 presents a quantitative comparison of different methods on the RefSegRS dataset. Our approach achieves the second-best performance in terms of oIoU, while significantly outperforming other methods across all remaining metrics. Furthermore, at higher IoU thresholds of 0.8 and 0.9, our precision scores exhibit a multiple-fold increase, reflecting the enhanced fine-grained segmentation capability of our approach.

Figure 11 illustrates the prediction masks of FIANet, RMSIN, and our method on several examples. In the figure, black represents correctly predicted background areas, white represents correctly predicted target areas, red indicates missed detections, and green denotes false positives. The prediction results demonstrate that our method has certain advantages in segmentation accuracy, particularly showing greater precision in segmenting small objects.

(2) Results on the RRSIS-D Dataset: Table 6 presents a quantitative comparison of different methods on the RRSIS-D dataset. Our approach achieves superior performance over all compared methods on all evaluation metrics. We also computed the mIoU for segmentation results across different target categories, as summarized in Table 7, where the categories are ranked in ascending order according to their average area. The results show that our method has advantages in segmentation accuracy for almost object categories, with particularly notable benefits for small objects such as vehicles and wind turbines.

Figure 12 presents the prediction results of FIANet, RMSIN, and our method on several examples. The results show that our method can interpret semantic references more accurately and demonstrates a clear advantage in understanding the relationships between objects, enabling more accurate identification of the correct targets.

(3) Results on the RISORS Dataset: Table 8 presents a quantitative comparison of different methods on the RRSIS-D dataset. Our method achieves the best performance across all metrics. Although the improvement in oIoU is marginal, the mIoU increases by more than 5.65%. Notably, our approach exhibits a substantial improvement in precision at higher IoU thresholds of 0.8 and 0.9, with gains exceeding 10%. These results demonstrate that our method has advantages in prompt segmentation of small objects and yields higher segmentation accuracy compared to previous approaches. Table 9 shows the mIoU of segmentation results for different categories of targets, where the categories are ranked in ascending order according to their average area. Our method obtains superior performance in all categories, with particularly significant improvements observed for small objects.

Figure 13 presents the prediction results of FIANet, RMSIN, and our method on several examples. The results indicate that our method performs better in capturing the detailed edges of the targets and provides more accurate localization of the referred objects. At the same time, we are able to exclude partial occlusions, making the masks focus more precisely on the target itself.

(4) Model Complexity Comparison: The model complexity comparison is presented in Table 10. Our proposed RSAM utilizes 194 M total parameters and 86 M trainable parameters, significantly fewer trainable parameters than RMSIN (240 M total/trainable) and FIANet (251 M total/trainable). The higher computational cost of RSAM (280 G FLOPs) compared to RMSIN (204 G) and FIANet (210 G) is partly attributable to its larger input resolution (1024 × 1024 vs. 480 × 480), representing approximately four times more pixels per input sample.

4.2. Ablation Studies

As shown in Table 11, adding the MMMD module to the baseline brings a noticeable improvement, increasing the mIoU from 69.73% to 71.27%. The inclusion of the TWGM module also leads to clear gains, raising mIoU to 71.71% and significantly enhancing precision, particularly at higher IoU thresholds, such as P@0.9 (from 43.04% to 45.49%). When both TWGM and MMMD are employed, the performance is further boosted, achieving the highest mIoU of 72.80% as well as an oIoU of 81.84%.

Furthermore, as depicted in Figure 14, the incorporation of the TWGM enables the model to better comprehend inter-object relationships, leading to more accurate localization of the referred target. Simultaneously, the MMMD allows for a deeper understanding of the textual instruction, which contributes to the generation of a more precise segmentation mask.

4.3. Comparison of Different Model Setting

The experimental results are presented in Table 12. The experimental results indicate that, on the RISORS dataset, using hiera-base-plus as the image encoder leads to better performance. Additionally, freezing the text encoder achieves superior results. Furthermore, we compared the performance of different image encoder sizes on the RefSegRS and RRSIS-D datasets. The experimental results are presented in Table 13. The results show that on the RRSIS-D dataset, the findings are consistent with those on RISORS, with hiera-base-plus achieving better performance. However, an opposite trend is observed on RefSegRS.

4.4. Multimodal Prompt Capability

The experimental results are shown in Table 14. The results indicate that adding point prompts can significantly improve segmentation accuracy. Specifically, after adding the first point prompt, the mIoU increases by approximately 15.6%, and the oIoU increases by 9.3%. Notably, over 90% of instance segmentation IoU values reach 0.7.

5. Discussion

The comparative experimental results demonstrate the superior performance of our proposed RSAM in referring remote sensing image segmentation. Specifically, RSAM excels at comprehending spatial relationships among objects, enabling precise localization of the target indicated by the text, and it achieves higher segmentation accuracy, particularly for small objects.

To highlight the advantages of our constructed dataset, we compare the performance of several representative methods on three datasets. Table 15 presents the performance of RMSIN, FIANet, and our proposed method on RefSegRS, RRSIS-D, and our RISORS dataset, evaluated by oIoU and mIoU metrics. Across all three methods, the RISORS dataset consistently yields higher performance metrics compared to the other two datasets. For instance, the RSAM method achieves 81.84% oIoU and 72.80% mIoU on RISORS, surpassing its performance on RefSegRS (80.38% oIoU, 67.15% mIoU) and RRSIS-D (79.59% oIoU, 67.04% mIoU). Similar performance trends are observed for the other evaluated methods. These results indicate that the RISORS dataset, potentially due to its higher annotation quality and greater scene diversity, facilitates improved segmentation accuracy and enables the evaluated methods to attain superior performance.

The ablation studies validate the individual and complementary contributions of the TWGM and MMMD. The TWGM strengthens cross-modal feature guidance, leading to improved relational reasoning and accurate target localization. In parallel, the MMMD refines language understanding and utilizes shallow image features to produce higher-precision masks, notably for small objects. The strongest performance is achieved when both modules are combined, demonstrating a clear synergistic effect.

In the experiment comparing different model settings, we attribute the observations in Table 12 to the following reasons: the amount of textual data in RISORS is far less than that used in BERT pretraining, so retaining the pretrained parameters facilitates better extraction of textual features. Meanwhile the number of images in the RISORS dataset is sufficient for training the hiera-base-plus encoder, but insufficient for adequately training the hiera-large encoder. It is noteworthy that although the RISORS dataset was built using the larger image encoder (hiera-large), the experimental results on it favor the smaller hiera-base-plus version. This observation underscores the domain adaptation challenge, indicating that encoders trained on natural images are not readily suitable for remote sensing imagery. However, the exact reasons for the opposite trend observed on RefSegRS in Table 13 remain unclear, we note that this dataset differs significantly from others in two key aspects: it contains multi-target scenarios and has a substantially smaller scale, which may be contributing factors.

In the Multimodal Prompt Capability experiments, we demonstrate that RSAM retains its ability to comprehend spatial prompts. Furthermore, since RSAM is equipped with an interface to accept previously generated masks as input, it inherently possesses the potential for multi-turn dialogue. However, due to the lack of a suitable dataset containing such interactive sequences, this capability remains unexplored in our current work. Investigating this multi-turn dialogue functionality presents a promising direction for future research.

6. Conclusions

In this article, we propose a novel model for referring image segmentation in remote sensing imagery that extends the interactive segmentation capabilities of SAM 2 to support text-prompted segmentation. Considering the multidimensional richness of image features and the flexible diversity of textual instructions, we introduce the Two-Way Guidance Module (TWGM) to facilitate comprehensive interaction between image and text features. This module effectively leverages image features to guide the extraction of correct semantics from textual cues while simultaneously using text features to direct the image features’ attention toward relevant target regions. Furthermore, recognizing the pivotal role of text prompts among all input cues, we design a Multimodal Mask Decoder (MMMD) that equips the model with the ability to handle multimodal referring image segmentation. The MMMD distinctly separates text prompts from other sparse cues such as points and boxes, granting text prompts a dominant influence during the decoding process and, thereby, enhancing the accuracy of segmentation results referred by textual input. Given that the field of referring remote sensing image segmentation (RRSIS) is still in its nascent stages with limited publicly available datasets, we also contribute an open benchmark dataset RISORS. This dataset contains multi-category instances while retaining the guidance cues generated during the human-guided annotation process, presenting potential for advancing research and applications.

Remote sensing images differ significantly from natural images, and referring image segmentation in remote sensing still poses many challenges that require further investigation. Nevertheless, all existing methods struggle with segmenting objects characterized by highly complex shapes, such as “harbors”, which are typically narrow and possess intricate geometric structures. Future work should be directed toward extending and validating model performance in this direction.

Author Contributions

Conceptualization, Z.Z. and H.C.; methodology, Z.Z. and H.C.; software, Z.Z.; validation, Z.Z. and H.C.; formal analysis, Z.Z. and B.H.; resources, F.P.; data curation, B.H.; writing—original draft preparation, Z.Z.; writing—review and editing, X.X., B.H. and F.P.; visualization, Z.Z.; supervision, X.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62271356.

Data Availability Statement

Data are available upon request.

Acknowledgments

The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. The authors would like to thank Bingjian Shi, Jiehao Xue, Ke Sun, Xinjie Xiong, Zhe Li, Haoting Sun, Xin Yao, Chenbo Liang, and Ruiyu Zhao for their invaluable contributions to the RISORS dataset annotation and development. Their dedication and efforts have greatly supported this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yuan, Z.; Mou, L.; Hua, Y.; Zhu, X.X. Rrsis: Referring Remote Sensing Image Segmentation. arXiv 2024, arXiv:2306.08625. [Google Scholar] [CrossRef]
Mikeš, S.; Haindl, M.; Scarpa, G.; Gaetano, R. Benchmarking of Remote Sensing Segmentation Methods. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 2240–2248. [Google Scholar] [CrossRef]
Ramos, L.T.; Sappa, A.D. Multispectral semantic segmentation for land cover classification: An overview. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 14295–14336. [Google Scholar] [CrossRef]
Yang, S.; Song, F.; Jeon, G.; Sun, R. Scene changes understanding framework based on graph convolutional networks and Swin transformer blocks for monitoring LCLU using high-resolution remote sensing images. Remote Sens. 2022, 14, 3709. [Google Scholar] [CrossRef]
Li, J.; Hong, D.; Gao, L.; Yao, J.; Zheng, K.; Zhang, B.; Chanussot, J. Deep learning in multimodal remote sensing data fusion: A comprehensive review. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102926. [Google Scholar] [CrossRef]
Li, F.; Yigitcanlar, T.; Nepal, M.; Nguyen, K.; Dur, F. Machine learning and remote sensing integration for leveraging urban sustainability: A review and framework. Sustain. Cities Soc. 2023, 96, 104653. [Google Scholar] [CrossRef]
Gagliardi, V.; Tosti, F.; Bianchini Ciampoli, L.; Battagliere, M.L.; D’Amato, L.; Alani, A.M.; Benedetto, A. Satellite remote sensing and non-destructive testing methods for transport infrastructure monitoring: Advances, challenges and perspectives. Remote Sens. 2023, 15, 418. [Google Scholar] [CrossRef]
Jiang, J.; Fu, X.; Qin, R.; Wang, X.; Ma, Z. High-speed lightweight ship detection algorithm based on YOLO-v4 for three-channels RGB SAR image. Remote Sens. 2021, 13, 1909. [Google Scholar] [CrossRef]
Durlik, I.; Miller, T.; Kostecka, E.; Tuński, T. Artificial Intelligence in Maritime Transportation: A Comprehensive Review of Safety and Risk Management Applications. Appl. Sci. 2024, 14, 8420. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Jiang, S.; Lin, H.; Ren, H.; Hu, Z.; Weng, L.; Xia, M. Mdanet: A high-resolution city change detection network based on difference and attention mechanisms under multi-scale feature fusion. Remote Sens. 2024, 16, 1387. [Google Scholar] [CrossRef]
Huang, Y.; Li, X.; Du, Z.; Shen, H. Spatiotemporal enhancement and interlevel fusion network for remote sensing images change detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5609414. [Google Scholar] [CrossRef]
Qiu, C.; Zhang, X.; Tong, X.; Guan, N.; Yi, X.; Yang, K.; Zhu, J.; Yu, A. Few-shot remote sensing image scene classification: Recent advances, new baselines, and future trends. ISPRS J. Photogramm. Remote Sens. 2024, 209, 368–382. [Google Scholar] [CrossRef]
Huo, Y.; Gang, S.; Guan, C. FCIHMRT: Feature cross-layer interaction hybrid method based on Res2Net and transformer for remote sensing scene classification. Electronics 2023, 12, 4362. [Google Scholar] [CrossRef]
He, Y.; Xu, X.; Chen, H.; Li, J.; Pu, F. Visual Global-Salient Guided Network for Remote Sensing Image-Text Retrieval. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5641814. [Google Scholar] [CrossRef]
Yang, L.; Xu, Y.; Yuan, C.; Liu, W.; Li, B.; Hu, W. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 9499–9508. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Liu, S.; Ma, Y.; Zhang, X.; Wang, H.; Ji, J.; Sun, X.; Ji, R. Rotated multi-scale interaction network for referring remote sensing image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26658–26668. [Google Scholar]
Lei, S.; Xiao, X.; Zhang, T.; Li, H.C.; Shi, Z.; Zhu, Q. Exploring fine-grained image-text alignment for referring remote sensing image segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 63, 5604611. [Google Scholar] [CrossRef]
Ryali, C.; Hu, Y.T.; Bolya, D.; Wei, C.; Fan, H.; Huang, P.Y.; Aggarwal, V.; Chowdhury, A.; Poursaeed, O.; Hoffman, J.; et al. Hiera: A hierarchical vision transformer without the bells-and-whistles. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 29441–29454. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
Hu, R.; Rohrbach, M.; Darrell, T. Segmentation from natural language expressions. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 108–124. [Google Scholar]
Liu, C.; Lin, Z.; Shen, X.; Yang, J.; Lu, X.; Yuille, A. Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1271–1280. [Google Scholar]
Li, R.; Li, K.; Kuo, Y.C.; Shu, M.; Qi, X.; Shen, X.; Jia, J. Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5745–5753. [Google Scholar]
Ye, L.; Rochan, M.; Liu, Z.; Wang, Y. Cross-modal self-attention network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 10502–10511. [Google Scholar]
Luo, G.; Zhou, Y.; Ji, R.; Sun, X.; Su, J.; Lin, C.W.; Tian, Q. Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, Virtual, 12–16 October 2020; pp. 1274–1282. [Google Scholar]
Hu, Z.; Feng, G.; Sun, J.; Zhang, L.; Lu, H. Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4424–4433. [Google Scholar]
Yang, S.; Xia, M.; Li, G.; Zhou, H.Y.; Yu, Y. Bottom-up shift and reasoning for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 11266–11275. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Yang, Z.; Wang, J.; Tang, Y.; Chen, K.; Zhao, H.; Torr, P.H. Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 18155–18165. [Google Scholar]
Liu, C.; Ding, H.; Zhang, Y.; Jiang, X. Multi-Modal Mutual Attention and Iterative Interaction for Referring Image Segmentation. IEEE Trans. Image Process. 2023, 32, 3054–3065. [Google Scholar] [CrossRef]
Pan, Y.; Sun, R.; Wang, Y.; Zhang, T.; Zhang, Y. Rethinking the Implicit Optimization Paradigm with Dual Alignments for Referring Remote Sensing Image Segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, Australia, 28 October–1 November 2024; pp. 2031–2040. [Google Scholar]
Ravi, N.; Gabeur, V.; Hu, Y.T.; Hu, R.; Ryali, C.; Ma, T.; Khedr, H.; Rädle, R.; Rolland, C.; Gustafson, L.; et al. Sam 2: Segment anything in images and videos. arXiv 2024, arXiv:2408.00714. [Google Scholar]
Diab, M.; Kolokoussis, P.; Brovelli, M.A. Optimizing zero-shot text-based segmentation of remote sensing imagery using SAM and Grounding DINO. Artif. Intell. Geosci. 2025, 6, 100105. [Google Scholar] [CrossRef]
Liu, S.; Zeng, Z.; Ren, T.; Li, F.; Zhang, H.; Yang, J.; Jiang, Q.; Li, C.; Yang, J.; Su, H.; et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 38–55. [Google Scholar]
Yuan, H.; Li, X.; Zhang, T.; Huang, Z.; Xu, S.; Ji, S.; Tong, Y.; Qi, L.; Feng, J.; Yang, M.H. Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos. arXiv 2025, arXiv:2501.04001. [Google Scholar]
Lai, X.; Tian, Z.; Chen, Y.; Li, Y.; Yuan, Y.; Liu, S.; Jia, J. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 9579–9589. [Google Scholar]
Liu, H.; Li, C.; Wu, Q.; Lee, Y.J. Visual instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 34892–34916. [Google Scholar]
Zhan, Y.; Xiong, Z.; Yuan, Y. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5604513. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
Chen, J.; Zhang, L.; Wang, Q.; Bai, C.; Kpalma, K. Intra-modal constraint loss for image-text retrieval. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; pp. 4023–4027. [Google Scholar]
Chng, Y.X.; Zheng, H.; Han, Y.; Qiu, X.; Huang, G. Mask grounding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 26573–26583. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Ha, D.; Dai, A.; Le, Q.V. Hypernetworks. arXiv 2016, arXiv:1609.09106. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Milletari, F.; Navab, N.; Ahmadi, S.A. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Hui, T.; Liu, S.; Huang, S.; Li, G.; Yu, S.; Zhang, F.; Han, J. Linguistic structure guided context modeling for referring image segmentation. In Proceedings of the European Conference on Computer Vision, Virtual, 23–28 August 2020; pp. 59–75. [Google Scholar]
Huang, S.; Hui, T.; Liu, S.; Li, G.; Wei, Y.; Han, J.; Liu, L.; Li, B. Referring image segmentation via cross-modal progressive comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10488–10497. [Google Scholar]
Liu, S.; Hui, T.; Huang, S.; Wei, Y.; Li, B.; Li, G. Cross-modal progressive comprehension for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4761–4775. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The flowchart of the dataset construction main procedure.

Figure 2. Distribution of the number of instances contained in each image in RISORS.

Figure 3. The number of instances per category in RISORS.

Figure 4. Distribution of instance areas in RISORS. The area axis is presented in logarithmic form.

Figure 5. Distribution of the number of pixels occupied by each category instance in RISORS.

Figure 6. Samples from RefSegRS.

Figure 7. Samples from RRSIS-D and RISORS.

Figure 8. Overall architecture of our proposed RSAM. Image Encoder first processes remote sensing images to extract multi-scale features. Then Two-Way Guidance Module enables bidirectional guidance interactions between deep image features and text features. Finally, Multi-Model Mask Decoder integrates image features, text global features, and spatial prompt embeddings (if provided), combined with shallow image features to reconstruct high-resolution masks.

Figure 9. The schematic diagram of the TWGM. It is constructed by stacking multiple TWGT layers, followed by a cross-attention operation to extract global textual prompt embedding.

Figure 10. The schematic diagram of the multi-model mask decoder.

Figure 11. Qualitative test results on several examples from the RefSegRS test set. Black indicates correctly predicted background areas, white indicates correctly predicted target areas, red denotes missed detections, and green represents false positives.

Figure 12. Qualitative test results on several examples from the RRSIS-D test set. Black indicates correctly predicted background areas, white indicates correctly predicted target areas, red denotes missed detections, and green represents false positives.

Figure 13. Qualitative test results on several examples from the RISORS test set. Black indicates correctly predicted background areas, white indicates correctly predicted target areas, red denotes missed detections, and green represents false positives.

Figure 14. Ablation studies results on several examples from the RISORS test set. Black indicates correctly predicted background areas, white indicates correctly predicted target areas, red denotes missed detections, and green represents false positives.

Table 1. Specific Information for Annotation of An Instance in RISORS.

Field Name	Description
id	The id of instance
name	The category to which the instance belongs
description	A text instruction of the instance
area	The number of pixels occupied by the mask
bbox	The bounding box of the instance
point_coords	Coordinates of points added during manual re-guidance
point_labels	Mark whether the point is positive or negative
segmentation	The mask in RLE format
predicted_iou	The IoU predicted by SAM when generating masks, should not be considered as ground truth during training

Table 2. Statistics for Each Set in RISORS.

Set	Images	Instances
Training	11,132	23,304
Validation	1886	4708
Test	3624	8685
Total	16,642	36,697

Table 3. Statistics for each dataset.

Dataset	Images	Instances	Ave. Words	Image Size
RefSegRS	4420	4420	3.09	512 × 512
RRSIS-D	17,402	17,402	6.80	800 × 800 *
RISORS	16,642	36,697	7.45	800 × 800

* Images in the RRSIS-D dataset are approximately, but not exactly, 800 × 800 pixels.

Table 4. Counts for each set for datasets.

Dataset	Train	Val	Test	Total
RefSegRS	2172	431	1817	4420
RRSIS-D	12,181	1740	3481	17,402
RISORS	23,304	4708	8685	36,697

Table 5. Comparison on RefSegRS Dataset.

Methods	P@0.5	P@0.6	P@0.7	P@0.8	P@0.9	oIoU	mIoU
LSTM-CNN [26]	15.69	10.57	5.17	1.10	0.28	53.83	24.76
ConvLSTM [28]	31.21	23.39	15.30	7.59	1.10	66.12	43.34
CMSA [29]	28.07	20.25	12.71	5.61	0.83	64.53	41.47
BRINet [31]	22.56	15.74	9.85	3.52	0.50	60.16	32.87
LAVT [35]	70.23	55.53	30.05	14.42	4.07	76.21	57.30
LGCE [1]	76.55	67.03	44.85	19.04	5.67	77.62	61.90
RMSIN [22]	73.25	59.88	34.07	12.27	2.37	72.64	59.07
FIANet [23]	81.29	69.73	48.60	20.20	3.63	75.85	64.12
RSAM (Ours)	80.77	72.86	61.76	39.40	8.63	76.21	68.48

Note: The best results are bold; the second-best results are underlined.

Table 6. Comparison on RRSIS-D Dataset.

Methods	P@0.5	P@0.6	P@0.7	P@0.8	P@0.9	oIoU	mIoU
RNN [28]	51.07	42.11	32.77	21.57	6.37	66.43	45.64
CMSA [29]	55.32	46.45	37.43	25.39	8.15	69.39	48.53
LSCM [58]	56.02	46.25	37.70	25.28	8.27	69.05	49.92
CMPC [59]	55.83	47.40	36.94	25.45	9.19	69.22	49.24
BRINet [31]	56.90	48.77	39.12	27.03	8.73	69.88	49.65
CMPC+ [60]	57.65	47.51	36.97	24.33	7.78	68.64	50.24
LAVT [35]	66.93	60.99	51.71	39.79	23.99	76.58	59.05
LGCE [1]	69.41	63.06	53.46	41.22	24.27	76.24	61.02
RMSIN [22]	72.34	64.72	52.60	39.39	21.00	76.21	62.62
FIANet [23]	73.97	67.19	55.93	41.71	23.50	76.88	63.41
RSAM (Ours)	78.85	73.54	63.61	48.51	29.85	78.25	67.58

Note: The best results are bold; the second-best results are underlined.

Table 7. Results on Each Category on RRSIS-D Dataset.

Category	RMSIN	FIANet	RSAM
vehicle	46.08	46.84	54.16
windmill	58.54	59.54	63.24
expressway toll station	68.00	70.95	75.20
airplane	67.33	69.21	72.14
tennis court	68.52	69.87	74.48
bridge	48.42	48.03	53.63
storage tank	72.38	75.53	78.33
harbor	34.85	35.53	37.01
basketball court	67.35	66.39	68.76
ship	63.22	61.62	69.20
baseball field	83.92	85.24	85.30
dam	60.17	62.93	65.54
train station	61.88	63.79	65.28
overpass	60.26	61.49	63.57
chimney	77.97	77.11	81.53
airport	56.81	56.04	60.74
expressway service area	69.58	70.75	72.78
ground track field	74.86	75.04	81.74
golf field	75.52	76.80	77.21
stadium	82.10	83.66	81.85
average	64.89	65.82	69.08

Note: The best results are bold.

Table 8. Comparison on RISORS dataset.

Methods	P@0.5	P@0.6	P@0.7	P@0.8	P@0.9	oIoU	mIoU
RMSIN [22]	75.46	70.71	63.37	51.77	32.84	80.38	67.15
FIANet [23]	75.19	70.39	63.50	52.46	33.70	79.59	67.04
RSAM (Ours)	80.17	77.24	72.61	64.87	46.80	81.84	72.80

Note: The best performance are bold; the second-best performance are underlined.

Table 9. Results on each category on RISORS Dataset.

Category	RMSIN	FIANet	RSAM
windmill	66.99	65.67	75.12
vehicle	55.85	56.77	65.61
expressway toll station	77.14	77.01	82.05
airplane	71.00	72.08	75.42
harbor	38.23	38.32	42.54
bridge	45.46	47.38	57.65
tennis court	75.25	75.43	79.63
storage tank	77.06	75.81	78.43
ship	69.21	68.17	75.79
baseball field	80.96	80.41	84.31
basketball court	73.45	74.97	77.51
dam	63.10	61.66	65.36
airport	65.55	65.93	67.81
train station	65.49	66.55	69.48
overpass	59.24	59.00	63.97
chimney	78.06	75.85	79.90
expressway service area	75.18	74.56	77.74
ground track field	75.11	75.38	79.34
stadium	85.18	85.64	87.01
average	68.45	68.24	72.88

Note: The best performance are bold.

Table 10. Comparison of Parameters and FLOPs.

Methods	Total Params	Trainable Params	FLOPs
RMSIN	240 M	240 M	204 G
FIANet	251 M	251 M	210 G
RSAM	194 M	86 M	280 G

Table 11. Ablation Studies on RISORS Dataset.

TWGM	MMMD	P@0.5	P@0.6	P@0.7	P@0.8	P@0.9	oIoU	mIoU
		77.00	73.77	68.97	60.80	43.04	78.68	69.73
	√	78.60	75.19	70.48	62.39	43.92	80.09	71.27
√		79.25	76.03	71.32	63.19	45.49	80.31	71.71
√	√	80.17	77.24	72.61	64.87	46.80	81.84	72.80

Table 12. Comparison of Different Model Setting on RISORS Datasets.

Module	Text Encoder	Image Encoder	Total Params	Trainable Params	oIoU	mIoU
base+			194 M	194 M	81.76	72.75
base+	*		194 M	85.7 M	81.84	72.80
base+	*	*	194 M	16.6 M	78.30	67.88
large			338 M	338 M	80.44	70.70
large	*		338 M	229 M	80.70	72.22
large	*	*	338 M	16.6 M	77.60	67.19

* denotes that the module is frozen and the parameters are not updated in training.

Table 13. Comparison of Different Model Size on Three Datasets.

Dataset	P@0.5		P@0.6		P@0.7		P@0.8		P@0.9		oIoU		mIoU
Dataset	Base+	Large	Base+	Large	Base+	Large	Base+	Large	Base+	Large	Base+	Large	Base+	Large
RefSegRS	80.77	83.79	72.86	76.65	61.76	66.92	39.40	48.63	8.63	13.90	76.21	76.66	68.48	71.65
RRSIS-D	78.85	75.29	73.54	71.13	63.61	61.83	48.51	47.50	29.85	28.65	78.25	74.26	67.58	64.85
RISORS	80.17	79.14	77.24	76.17	72.61	72.09	64.87	64.70	46.80	46.17	81.84	80.70	72.80	72.22

Table 14. The impact of point prompts on RISORS dataset.

Number of Points	P@0.5	P@0.6	P@0.7	P@0.8	P@0.9	oIoU	mIoU
0	80.17	77.24	72.61	64.87	46.80	81.84	72.80
1	97.34	95.35	91.78	83.90	61.54	91.21	88.38
2	98.54	97.16	94.76	88.14	65.93	93.09	90.04

Table 15. Comparison of different methods on three datasets.

Dataset	RMSIN		FIANet		RSAM
Dataset	oIoU	mIoU	oIoU	mIoU	oIoU	mIoU
RefSegRS	72.64	59.07	75.85	64.12	76.21	68.48
RRSIS-D	76.21	62.62	76.88	63.41	78.25	67.58
RISORS	80.38	67.15	79.59	67.04	81.84	72.80

Note: The best performance are bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, Z.; Xu, X.; Huang, B.; Chen, H.; Pu, F. RSAM: Vision-Language Two-Way Guidance for Referring Remote Sensing Image Segmentation. Remote Sens. 2025, 17, 3960. https://doi.org/10.3390/rs17243960

AMA Style

Zhao Z, Xu X, Huang B, Chen H, Pu F. RSAM: Vision-Language Two-Way Guidance for Referring Remote Sensing Image Segmentation. Remote Sensing. 2025; 17(24):3960. https://doi.org/10.3390/rs17243960

Chicago/Turabian Style

Zhao, Zilong, Xin Xu, Bingxin Huang, Hongjia Chen, and Fangling Pu. 2025. "RSAM: Vision-Language Two-Way Guidance for Referring Remote Sensing Image Segmentation" Remote Sensing 17, no. 24: 3960. https://doi.org/10.3390/rs17243960

APA Style

Zhao, Z., Xu, X., Huang, B., Chen, H., & Pu, F. (2025). RSAM: Vision-Language Two-Way Guidance for Referring Remote Sensing Image Segmentation. Remote Sensing, 17(24), 3960. https://doi.org/10.3390/rs17243960

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

RSAM: Vision-Language Two-Way Guidance for Referring Remote Sensing Image Segmentation

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Referring Image Segmentation on Natural Images

2.2. Referring Remote Sensing Image Segmentation

2.3. SAM with Text Prompt

3. Materials and Methods

3.1. Datasets

3.1.1. Construction Procedure of RISORS

3.1.2. Data Characteristics and Analysis of RISORS

3.1.3. Comparison of Datasets

3.2. Proposed Method

3.2.1. Overall Architecture and Formulation

3.2.2. Two-Way Guidance Transformer Module

3.2.3. Multi-Model Mask Decoder

3.2.4. Objective Function

3.3. Experimental Setup

3.3.1. Evaluation Metrics

3.3.2. Comparative Experiment

3.3.3. Ablation Studies

3.3.4. Comparison of Different Model Setting

3.3.5. Multimodal Prompt Capability

4. Results

4.1. Comparisons with Other Methods

4.2. Ablation Studies

4.3. Comparison of Different Model Setting

4.4. Multimodal Prompt Capability

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI