Next Article in Journal
An Innovative Coastal Altimetry Waveform Processing Approach Based on Wave-Transformer Classifier
Previous Article in Journal
Two-Stage Oil Spill Detection in SAR Using a Domain-Adapted Segment Anything Model
 
 
Article
Peer-Review Record

MSMamba: A Multi-Semantic Mamba Framework for Referring Remote Sensing Image Segmentation

Remote Sens. 2026, 18(12), 1949; https://doi.org/10.3390/rs18121949 (registering DOI)
by Tianxiang Zhang 1,2, Junbai Li 1,2, Yanqiang Feng 1,2, Zhaokun Wen 1,2, Li Liu 1,2 and Jiangyun Li 1,2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Remote Sens. 2026, 18(12), 1949; https://doi.org/10.3390/rs18121949 (registering DOI)
Submission received: 16 March 2026 / Revised: 6 June 2026 / Accepted: 9 June 2026 / Published: 12 June 2026
(This article belongs to the Section AI Remote Sensing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The VTFB and MSFD modules shown in the proposed MSMamba of this manuscript are reasonable and innovative. However, there are significant shortcomings in the efficiency comparison and the detailed description of the methods. The detailed suggestions are as follows:
(1) Replenishing the comparisons of inference speed (FPS) and GPU memory usage to Table 5, comparing it with similar methods;
(2) The clear explanations or annotations of the sources of the baseline results are necessary in Tables 1-4;
(3) The specific operations of CSA, the WLSP preprocessing workflow, and the BERT alignment method can be described in slightly more detail;
(4) Supplementing the SS2D ablation experiments. Additionally, it is suggested to add grouped IoU comparisons based on target size.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposed a method for Referring Remote Sensing Image Segmentation (RRSIS), which is a related new area. The topic is quite interesting. The authors compared their results with other methods. The method and results are clear. The conclusion shows the contribution of the work.

The fonts in the figures are different, as in Figs 1 and 2. Please unify them.

Fig.2 suggests adding a space to list the full names of L, L’, GL-CA, VTFB, WLSP, and so on. Even though the authors have introduced them in the contents, the reviewer still suggests adding an explanation in the figure directly for easier reading.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors The paper aims to improve referring remote sensing image segmentation, where the goal is to identify and segment the exact object or region in an aerial/satellite image based on a natural-language description. The authors correctly identify specifics regarding handling and processing remote sensing images which are usually showing large areas, objects of different scales, similar context and require a lot of computational resources when using attention-based and cross-based models.  In the paper, MS Mamba framework is proposed which represents a valid alternative. Paper has clear motivation, technically sound structure, and good experimental evaluation. However, there are several aspects that should be improved before paper acceptance. Specifically, in part 3.3.1. Word-Level Semantic Processor uses spaCy natural language processing library for parsing and extracting attribute-related word. spaCY analyze a sentence and identify useful words i.e. most important words from the user input to identify the correct object and finally produce appropriate segmentation mask. This represents a very important part of the proposed framework, and authors recognize the problem of spaCy failure and implement pre-defined strategy for mitigation, but there is no information about how often this occurs. Also, there is no analysis of how the model performs if attribute extraction is not correct, i.e. how effective is the mitigation strategy. This need to be explained in more details since this is direct input from the user. 4 different datasets were used, but results comparison of different method were done only only within one data set. Since there is a lot of overlapping in method used for estimation in all 4 datasets, additional explanation would be beneficial to explain why specific methods were used for specific datasets and not for others. Also, a cross-dataset generalization experiment, for example training on one dataset and testing on another, could show a robustness of the model to various input data.  In part 4.3. Performance Comparison the fairness of the baseline comparison needs additional clarification, since results come from different sources. Some are reimplemented, some are taken from original papers and adopted from RMSIN but nowhere is clearly stated which results come from which sources. This creates a confusion.

In part 4.3.5. Computational Efficiency Analysis only parameter size and FLOPs are used to demonstrate the MSMamba effectiveness. For more comprehensive analysis it would be good to report also other information regarding computational efficiency like inference time or GPU memory consumption from other models. Also, authors should provide explanation why the analysis contains only performance on the RefSegRS dataset.

In Figure 2. Overview of MSMamba for referring remote sensing image segmentation it seems input text is "A gray large dam" but in output a cruise ship is segmented i.e. referred-object mask.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

Comments and Suggestions for Authors

A Mamba-based framework for referencing remote sensing image segmentation is proposed in this paper: MSMamba. The approach uses a Multi-scale Fusion Decoder (MSFD) for cross-scale segmentation refinement, a Visual-Text Fine-grained Block (VTFB) for attribute-aware language grounding, and a VMamba-style visual encoder. Grounding natural-language expressions in complicated remote sensing scenes, where targets may be small, visually ambiguous, or characterized by lengthy attribute-rich expressions, is a significant and pressing issue that is the focus of this work. The authors claim significant improvements in high-IoU precision, mIoU, and high-resolution segmentation settings and present impressive results on four public benchmarks. However, there are the following comments:

  1. How is the alignment between the appropriate image region and the referring text trained? In particular, is the alignment acquired simply indirectly through the final segmentation losses, or does MSMamba employ any explicit text-image alignment supervision, such as token-pixel matching, contrastive learning, or phrase-region correspondence loss?
  2. I assume that this Mamba architecture is trained under a supervised segmentation setting, where the input consists of a remote sensing image and a referring text expression, and the output is the segmented target area. However, the paper should clarify more explicitly how the relationship between the text and the image region is learned during training. From the current description, the final prediction appears to be supervised only by the ground-truth binary mask using Dice loss and BCEWithLogits loss. If this is the case, the text-image alignment may be learned only indirectly through the segmentation objective. The authors should clearly state whether there is any explicit supervision for text-image correspondence, such as token-pixel alignment, phrase-region matching, contrastive learning, or auxiliary grounding loss.
  3. In this context, "broadcast" seems to refer to the spatial repetition of the same attribute-level text feature vector over all image feature map places, ensuring that each visual location receives the same linguistic bridge information. It should be made clear in the paper if this is a learnt spatial conditioning mechanism or a straightforward tensor expansion/repetition action. The authors should explain how spatially specific text-image grounding is accomplished if it is merely equally repeated across all pixels.
  4. In this context, "broadcast" seems to refer to the spatial repetition of the same attribute-level text feature vector over all image feature map places, ensuring that each visual location receives the same linguistic bridge information. It should be made clear in the paper if this is a learnt spatial conditioning mechanism or a straightforward tensor expansion/repetition action. The authors should explain how spatially specific text-image grounding is accomplished if it is merely equally repeated across all pixels. Is it possible to learn the gate in the Selective Kernel Scale-Aware Gate (SKSAG)? Could you please explain how w is generated? Is it obtained from selective-kernel attention, computed using fixed rules, or predicted using learnable convolutional layers? Additionally, the authors should indicate if the gate is channel-wise, spatial-wise, or both. They should also include an ablation that illustrates the impact of substituting a simple addition or concatenation for the learnable gate.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop