Next Article in Journal
Evaluation of Seismo-Ionospheric and Seismological Parameters Within the Lithosphere–Atmosphere–Ionosphere Coupling Framework for the 2025 Mw 7.7 Myanmar Earthquake
Previous Article in Journal
Remote Sensing Data-Based Modelling for Analyzing Green Tide Proliferation Drivers in the Yellow Sea
Previous Article in Special Issue
DWTF-DETR: A DETR-Based Model for Inshore Ship Detection in SAR Imagery via Dynamically Weighted Joint Time–Frequency Feature Fusion
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation

1
National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing, Beijing Institute of Technology, Beijing 100081, China
2
Innovative Equipment Research Institute of Beijing Institute of Technology in Sichuan Tianfu New Area, Chengdu 610213, China
3
School of Integrated Circuits, Tsinghua University, Beijing 100084, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(7), 1015; https://doi.org/10.3390/rs18071015 (registering DOI)
Submission received: 13 February 2026 / Revised: 23 March 2026 / Accepted: 25 March 2026 / Published: 28 March 2026

Abstract

Referring Remote Sensing Image Segmentation (RRSIS) is a representative multimodal understanding task for remote sensing, which segments designated targets from remote images according to free-form natural language descriptions. However, complex remote sensing characteristics, such as cluttered backgrounds, large-scale variations, small scattered targets and repetitive textures, lead to unstable visual grounding and further spatial grounding drift, resulting in inaccurate segmentation results. Existing approaches typically perform implicit visual–linguistic fusion across encoding and decoding stages, entangling spatial grounding with mask refinement. This tightly coupled formulation lacks explicit structural constraints and is prone to cross-modal ambiguity, especially in complex remote sensing layouts. To address these limitations, we propose a Structurally consistent and Grounding-aware Stagewise Reasoning Framework (SGSRF) that follows a grounding-first, segmentation-second paradigm. The framework decomposes inference into three cascaded stages with progressively imposed structural constraints. First, Cross-modal Consistency Refinement (CCR) lays the foundation for stable spatial grounding by enhancing visual–textual structural alignment via CLIP-based features and Structural Consistency Regularization (SCR), producing well-aligned multimodal representations and reliable grounding cues. Second, Grounding-aware Prompt (GPG) Generation bridges grounding and segmentation by converting aligned representations into complementary sparse and dense prompts, which serve as explicit grounding guidance for the segmentation model. Third, Grounding Modulated Segmentation (GMS) leverages the Segment Anything Model (SAM) to generate fine-grained mask prediction under the joint guidance of prompts and grounding cues, improving spatial grounding stability and robustness to background interference and scale variation. Extensive experiments on three remote sensing benchmarks , namely RefSegRS, RRSIS-D, and RISBench, demonstrate that SGSRF achieves state-of-the-art performance. The proposed stagewise paradigm integrates structural alignment, explicit grounding, and prompt-driven segmentation into a unified framework, providing a practical and robust solution for RRSIS in real-world Earth observation applications.
Keywords: remote sening; referring segmentation; multimodal understanding; spatila grounding; segment anything model remote sening; referring segmentation; multimodal understanding; spatila grounding; segment anything model

Share and Cite

MDPI and ACS Style

Dong, S.; Xie, J.; Chen, L.; Chen, H.; Qi, B.; Ge, Y. Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation. Remote Sens. 2026, 18, 1015. https://doi.org/10.3390/rs18071015

AMA Style

Dong S, Xie J, Chen L, Chen H, Qi B, Ge Y. Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation. Remote Sensing. 2026; 18(7):1015. https://doi.org/10.3390/rs18071015

Chicago/Turabian Style

Dong, Shan, Jianlin Xie, Liang Chen, He Chen, Baogui Qi, and Yunqiu Ge. 2026. "Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation" Remote Sensing 18, no. 7: 1015. https://doi.org/10.3390/rs18071015

APA Style

Dong, S., Xie, J., Chen, L., Chen, H., Qi, B., & Ge, Y. (2026). Structurally Consistent and Grounding-Aware Stagewise Reasoning for Referring Remote Sensing Image Segmentation. Remote Sensing, 18(7), 1015. https://doi.org/10.3390/rs18071015

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.
Back to TopTop