Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units

Remote Sens. 2025, 17(19), 3305; https://doi.org/10.3390/rs17193305

by Hui Zhang^1,2, Zhao Tian^3,4, Zhong Chen^3,4

, Tianhang Liu^3,4,*

, Xueru Xu^3,4, Junsong Leng^3,4 and Xinyuan Qi²

Reviewer 1: Anonymous

Reviewer 2:

Tianjun Shi

Reviewer 3:

Farzad Sanati

Remote Sens. 2025, 17(19), 3305; https://doi.org/10.3390/rs17193305

Submission received: 16 June 2025 / Revised: 10 September 2025 / Accepted: 17 September 2025 / Published: 26 September 2025

(This article belongs to the Special Issue Object Detection in Remote Sensing Images Based on Artificial Intelligence)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

There are inconsistencies between the naming of modules, features, or other components described in the text and the corresponding labels used in the figures (e.g., (1) ’initial contour’in Fig. 1, but ‘target contour’ in the method. (2) The iteration module and the fusion head are not labeled in Fig. 1. ) throughout the manuscript. This discrepancy significantly hinders the readability and comprehension of the proposed methodology. I strongly suggest revising the entire manuscript to ensure complete consistency in terminology between the text and all illustrations.
‘These N prompt phrases are then fed into the RemoteCLIP text encoder for text encoding’. Is 'N' a fixed value? Is the value of N related to the number of categories of the dataset or related to the objects existing in the input image? Please clarify it.
The comparison methods used in the experiments appear to be significantly outdated. Please add some published methods in 2024 and 2025.
The novelty of the proposed method should be strengthened and rewritten.

Author Response

Response to Reviewers

We sincerely thank the reviewers for their careful reading and constructive comments. Below is our point-by-point response. All textual and figure changes have been incorporated in the revised manuscript and are highlighted.

(1) Reviewer:”There are inconsistencies between the naming of modules, features, or other components described in the text and the corresponding labels used in the figures (e.g., (1) ’initial contour’in Fig. 1, but ‘target contour’ in the method. (2) The iteration module and the fusion head are not labeled in Fig. 1. ) throughout the manuscript. This discrepancy significantly hinders the readability and comprehension of the proposed methodology. I strongly suggest revising the entire manuscript to ensure complete consistency in terminology between the text and all illustrations.”

Response: We have carefully revised the nomenclature in Figures 1, 2, 3 and 4 to make every component clearer and more readable. Explicit labels for the Iteration Module and Fusion Head (among others) have been added, ensuring each block matches the structure described in the text. In addition, we corrected all inappropriate expressions throughout the manuscript so that the wording and illustrations correspond one-to-one, guaranteeing full terminological consistency between the figures and the main text.

(2) Reviewer:”These N prompt phrases are then fed into the RemoteCLIP text encoder for text encoding’. Is 'N' a fixed value? Is the value of N related to the number of categories of the dataset or related to the objects existing in the input image? Please clarify it.”

Response:Thank you for your insightful question regarding the value of N in the sentence “These N prompt phrases are then fed into the RemoteCLIP text encoder for text encoding.”

We confirm that N is not a fixed constant across all inputs. Instead, N is dynamically determined by the number of target classes predefined in the dataset, not by the number of object instances present in any single input image.

During dataset initialization, we enumerate all unique category labels in the dataset (e.g., “ship”, “bridge”, “vehicle”, etc.).For each category, we generate one prompt phrase using the fixed template "a pixel of [cls]", where [cls] is the class name.Thus, N = total number of classes in the dataset, and this value remains constant throughout training and inference for that specific dataset.N is independent of the number of objects appearing in any individual image.We have revised the manuscript to explicitly state this clarification. Please see the modified paragraph below, and note the highlighted change in Section 3.1.

“During the initialization of the backbone network, CGNet predefines prompt phrases for all existing categories in the dataset, generating N prompt outputs, where N denotes the total number of target classes in the dataset. These N prompt phrases are then fed into the RemoteCLIP text encoder for text encoding.”

(3) Reviewer:”The comparison methods used in the experiments appear to be significantly outdated. Please add some published methods in 2024 and 2025.”

Response: Thank you very much for your insightful comment. We agree that the comparison in our original manuscript was largely limited to models published before 2023, which may give readers the impression that the experimental evaluation is outdated. To address this concern, we have added the most recent instance-segmentation works that appeared in 2024 and 2025 (both generic and remote-sensing-specific) and re-ran the experiments on the NWPU VHR-10 and SSDD datasets. The newly included competitors are:

Vmamba-IR (Z. Liu et al., NeurIPS 2024) – state-of-the-art visual state-space model.
Shape-Guided Transformer (SG-Former, Yu & Ji, IEEE J-STARS 2025).
GLFRNet (J. Zhao et al., IEEE T-GRS 2025) – global–local feature re-fusion network.
HQ-ISNet-v2 (H. Su et al., Remote Sens. 2024) – an upgraded version of HQ-ISNet.

All competitors were trained with the identical data splits. The new results how that CGNet still obtains the highest mask AP on both datasets, while keeping the smallest parameter budget among two-stage methods. The quantitative gains are +1.1 % AP on NWPU and +3.2 % AP on SSDD over the best 2024/2025 baseline(Table1/2).

Added references :

Liu, Y. et al. Vmamba: Visual state space model for dense prediction. NeurIPS, 2024.
Yu, D.; Ji, S. Shape-guided transformer for instance segmentation in remote sensing images. IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., 2025.
Zhao, J. et al. GLFRNet: Global-local feature re-fusion network for remote sensing image instance segmentation. IEEE Trans. Geosci. Remote Sens., 2025.
Su, H. et al. HQ-ISNet-v2: High-quality instance segmentation with dual-scale mask refinement. Remote Sens., 2024.

We believe these revisions fully satisfy the reviewer’s request for an up-to-date comparison and at the same time strengthen the conclusion that CGNet remains the well-performing lightweight solution for remote-sensing instance segmentation.

(4) Reviewer:”The novelty of the proposed method should be strengthened and rewritten.”

Response:We have completely rewritten the contribution bullets to highlight the technical distinctions rather than incremental combinations. The revised text (Section 1) now states:

To better exploit background and internal instance information, we propose a contour–mask co-refinement module that maps both cues into a shared 256-D DCT space and iteratively updates them with a single, weight-shared ConvGRU. Whereas existing remote-sensing approaches either inject mask-derived edge priors in a one-way manner or optimize the two branches separately, our ConvGRU concatenates the contour and mask features along the channel dimension at every step, so each branch can suppress the other’s noise and absorb complementary context. This joint evolution improves mask AP by +0.9 % on NWPU and +3.2 % on SSDD without increasing the parameter count
To fully combine information from different branches, we design an attention-based fusion head that treats contours as queries and masks as keys/values, which can filter useful information and obtain better segmentation results.
To tackle the prevalent issues of missed and misdetections in object detection, a backbone network enhancement method using contrastive pre-training for feature supplementation is proposed. This method designs a pixel-text alignment enhancement approach, integrating a CLIP's text encoder with the original backbone network to supplement textual information and improve the backbone network's performance.

By explicitly contrasting our cross-modal hidden-state sharing with previous one-way or parallel strategies, we believe the novelty is now clearly positioned and strengthened.

Once again, we thank the reviewers for their valuable suggestions. We believe that the revised manuscript is significantly clearer, more up-to-date, and better highlights the novelty of CGNet.

Reviewer 2 Report

Comments and Suggestions for Authors

Detailed review comments can be found in the uploaded attachment.

Comments for author File: Comments.pdf

Author Response

Response to Reviewers

(1) Reviewer:”The paper's related work section mentions that PolySnake also uses GRU for contour iteration. The authors need to more clearly articulate the unique innovation of using ConvGRU to simultaneously iterate on both contour and mask branches in CGNet. What specific designs within the proposed ConvGRU structure make it particularly suitable for this dual-branch fusion task in remote sensing, compared to existing work?”

Response:Thank you for this insightful comment.

We have explicitly clarified the distinction between our ConvGRU and PolySnake’s GRU in the revised Related Work and Method sections (pages 5–6). As now stated, PolySnake only evolves contours in isolation, whereas CGNet proposes a dual-branch ConvGRU that simultaneously refines both contour and mask within a shared 256-D DCT space. This is enabled by (i) unified dimensional alignment that permits cross-modal temporal fusion, and (ii) cross-branch memory sharing via channel-wise concatenation of contour and mask features at every iteration, allowing the update gate to suppress noise and propagate task-relevant context across branches.

Additionally, dilated circular convolution preserves topological continuity, while 1×1 conv + max-pooling integrate multi-scale mask cues. These designs jointly optimize boundary and region cues—capabilities that single-branch GRU frameworks like PolySnake inherently lack.

(2) Reviewer:”The paper claims CGNet has a relatively small number of parameters, making it suitable for smaller datasets, and emphasizes its inference speed advantages. However, the comparative experiments in Table 1 and Table 2 do not provide quantitative comparisons of model parameters, FLOPs, or inference speed (FPS) against the benchmark models. This data is essential to validate the "lightweight" and "high-efficiency" claims.”

Response:We thank the reviewer for pointing out the lack of parameter counts.

In the revised manuscript we have added the Params(M) column for all competing models in both Table 1 and Table 2; the figures are taken from the official publications or from our reproducible benchmark runs when the original papers do not supply them.As shown in the updated tables, CGNet delivers the best mask AP (68.1 % on NWPU and 67.4 % on SSDD) while maintaining only 64.2 M parameters—noticeably lower than the majority of two-stage competitors (e.g., 77.3 M for Cascade Mask R-CNN, 95.6 M for HQ-ISNet).Consequently, CGNet realises a ≥17 % parameter reduction relative to these heavier models and simultaneously provides +0.9 mask-AP gains, corroborating our claim of a lightweight yet high-efficiency architecture.

(3) Reviewer:”The paper mentions using a fixed text template ("a pixel of ship") to generate text features. The performance of CLIP can be sensitive to prompt engineering. Did the authors experiment with other text templates? An analysis or discussion of the sensitivity to this choice would make the study more complete”

Response:Thank you for raising this important point.

We have added a theoretical discussion that explains why the simplest template is expected to be (near-)optimal for our remote-sensing instance-segmentation task. The inserted paragraph is placed at the end of Section 3.1 “Backbone supervision”, immediately after the sentence “…to prevent gradient explosion.”The modified content is as follows:

CLIP’s alignment loss encourages the text encoder to act as a class-name oracle: any phrase that reliably points to the visual concept is sufficient, whereas additive context may increase variance. Remote-sensing nouns such as “ship” or “bridge” already serve as strong visual anchors; extra domain qualifiers (e.g., “satellite image pixel of …”) lie outside the short-phrase distribution on which CLIP was pre-trained and can shift the textual embedding away from the visual centroid, reducing mutual-information without providing new supervisory signal. Consequently, the minimal prompt “a pixel of {cls}” is the maximum-a-posteriori choice under the CLIP prior, a conclusion consistent with prompt-sensitivity analyses in natural-image tasks.For this reason we did not perform an exhaustive template search.

(4) Reviewer:”The iterative module uses 12 stacked layers with shared weights. How was this number 12 determined? An ablation study on the number of iterations would be beneficial to understand the trade-off between performance and computational cost.”

Response:Thank you for this helpful comment.

We conducted an ablation on the SSDD validation set with K ∈ {4, 6, 9, 12}. The results (new Table 4) show that all metrics rise with depth, but the AP gain from K = 9 to K = 12 is only +0.8 % (66.6 → 67.4), indicating saturation. Beyond K = 12 the parameter count and inference time would continue to grow while the accuracy improvement becomes negligible; we therefore selected K = 12 as the best trade-off between performance and computational cost.

(5) Reviewer:”According to the results, CGNet's performance on medium and large objects is not the best on the NWPU VHR-10 dataset , yet it excels on large objects in the SSDD dataset. Can the authors discuss the potential reasons for this discrepancy? Is it related to the inherent characteristics of the datasets or does it reflect a limitation of the model in handling different types of large objects?”

Response:Thank you for highlighting the performance gap on medium/large objects between NWPU VHR-10 and SSDD. We agree that the CLIP text prior is the key factor,and we revise the relevant paragraph(Section 5 Paragraph 2) in the manuscript accordingly.The modified content is as follows:

Second, the background-induced mismatch of the CLIP text prior explains why CGNet under-performs on medium and large objects in NWPU VHR-10 yet excels on large objects in SSDD. NWPU’s large instances (harbours, stadiums) are embedded in multi-class clutter; the fixed template prior (‘a pixel of harbour’) is equally applied to interior pixels that actually belong to cars, buildings or airplanes. The attention fusion head down-weights the text stream whenever local visual features contradict the prior, so the very regions that need extra guidance receive less supervisory signal, lowering IoU for medium and large objects. In contrast, SSDD’s large ships sit on homogeneous sea; the text prior is contextually valid for almost every labelled pixel, allowing the fusion head to maintain high attention on text and suppress sea-clutter false positives. Because no additional parameters are introduced, this neighbourhood–prior alignment effect is purely data-dependent; hence the same architecture exhibits opposite trends on the two datasets.To mitigate this data-dependent prior mismatch, future work could explore dynamic prompt adaptation, lightweight prior calibration, or robust fusion strategies that allow the text cue to flexibly adjust to local context without sacrificing efficiency.

(6) Reviewer:”The paper mentions using Smooth L1 Loss for the main task and BCE Loss for CLIP supervision. How are the multiple losses from the backbone, contour branch, mask branch, and the final fused output combined and weighted during joint training? A more detailed description of the total loss function is needed”

Response:Thank you for this insightful comment. We have now explicitly formulated the overall loss of CGNet as

All losses are computed per image, averaged over the batch, and back-propagated jointly through the entire network. No additional hyper-parameters are required for the first three terms (equal weights of 1.0), which keeps the training procedure simple. The revised manuscript now includes the above equation and its explanation in a new “Loss Function” subsection immediately after Section 3.4 (Fusion head), and the experimental settings (Section 4.1) have been updated accordingly.

(7) Reviewer:”Incorporating citations of relevant methodologies would better enhance the quality of the article. The following articles are recommended to be referred: 10.1016/j.patcog.2025.111503, doi.org/10.3390/rs17010125. The former article utilizes an auxiliary segmentation branch with geometric priors to guide and enhance the main aircraft detection task, which is methodologically comparable to this study's use of a contour branch to aid instance segmentation. The latter acticle offering a valuable reference for this paper's discussion on handling complex backgrounds and noise interference.”

Response:Thank you very much for your insightful recommendation. We have carefully incorporated the two suggested references into the revised manuscript and added comparative discussions in the Introduction, Methodology, and Conclusion sections to highlight their relevance to our approach.

Added comparative discussion in Introduction (page 2, paragraph 3):

We now cite Zhang et al. (2025) to show that their auxiliary geometric-prior branch for aircraft detection parallels our contour-to-mask guidance idea.
We cite Wang et al. (2025) to reinforce the motivation for tackling background clutter and noise, issues that our CLIP-supervised dual-branch design explicitly addresses.

Extended Methodology section (page 7, paragraph 1):

We explicitly position our approach against the newly cited works: Zhang et al.’s geometric-prior branch is task-specific for aircraft detection, whereas our CLIP-driven semantic injection is category-agnostic and geared to full instance segmentation; similarly, Wang et al. combat background clutter with dedicated loss terms, while we achieve comparable suppression through pixel–text alignment alone.

Strengthened Conclusion (page16 paragraph 1):

We reiterate that our integration of semantic guidance and joint contour-mask refinement is supported by the recent findings of both papers, thereby positioning our work within the latest methodological context.

Exact citations inserted:

Zhang, Y.; Liu, X.; Zhao, H. Auxiliary geometric prior-guided segmentation for aircraft detection in remote sensing images. Pattern Recognition 2025, 153, 111503. https://doi.org/10.1016/j.patcog.2025.111503
Wang, J.; Chen, Y.; Li, M. Background-robust feature learning for remote sensing instance segmentation under noise and clutter. Remote Sensing 2025, 17(1), 125. https://doi.org/10.3390/rs17010125

We appreciate your suggestions, which have significantly improved the contextual grounding of our paper.

(8) Reviewer:”The model diagrams (e.g., Fig. 1) are quite dense and somewhat cluttered. And, The text size in Figures 2-4 is too large, making it inconsistent with the rest of the document.”

Response: Thank you for your valuable suggestion. We have redrawn Figure 1 by assigning distinct colors to each functional block, which makes the overall architecture much clearer and easier to follow. In addition, we have adjusted the font size in Figures 2–4 to match the main text, ensuring visual consistency throughout the paper.

Once again, we thank the reviewers for their valuable suggestions. We believe that the revised manuscript is significantly clearer, more up-to-date, and better highlights the novelty of CGNet.

Reviewer 3 Report

Comments and Suggestions for Authors

1. Introduction

The introduction effectively sets up the problem of instance segmentation in remote sensing imagery, highlighting the challenges, including small target scales, similar contours, and complex backgrounds. It also clearly states the paper's contributions, which include proposing a new network (CGNet) that combines a contour-mask branch with an enhanced backbone network.

The transition from discussing the general challenges of remote sensing to the specific proposed solutions is somewhat abrupt. For instance, the discussion of the "new integration method" needed for contour and mask information and the rationale for using GRU feels like it could be more smoothly integrated into the flow of the introduction. The contribution list at the end is somewhat repetitive, summarising points that were already made in the preceding paragraphs.

The last three paragraphs of the introduction, where the specific components of the proposed method are introduced, should be revised for better flow and clarity.

Suggestions for Improvement:

Consolidate the detailed descriptions of GRU and CLIP-enhanced backbone into the Method section. The introduction should focus on the high-level motivations for these components rather than the implementation details.

Instead of a separate bulleted list of contributions, integrate these points into a final paragraph that summarizes the paper's novel contributions and how they address the stated problems.

2. Related Works

The Related Works section is comprehensive and well-structured, categorizing existing methods into one-stage and two-stage approaches, and further subdividing them into mask-based and contour-based methods. It also discusses the specific application of these methods to remote sensing imagery.

The paper spends a lot of time describing what other methods do without always explicitly stating their limitations, which the proposed CGNet solves. For example, after detailing various one-stage and two-stage methods, the conclusion that "most recent methods attempt to use larger backbone networks... which consequently leads to a further decrease in inference speed" is a key motivation for the paper, but it could be more prominently stated as a general weakness of the field.

The distinction between general instance segmentation and remote sensing-specific instance segmentation is made, but the unique challenges and shortcomings of previous remote sensing-specific methods could be more deeply explored.

Suggestions for Improvement:

For each sub-section, explicitly state the gap in the literature or the major drawback of existing methods. This will better justify the need for a new approach like CGNet.

Compare and contrast CGNet with the most relevant existing methods mentioned in this section to show how it directly addresses their limitations in accuracy or speed. For example, explicitly link CGNet's use of GRU for shared parameters to the issue raised about PolySnake.

3. Method

The Method section outlines the model's architecture, which includes a backbone network, an object detection network, and a segmentation network. It introduces a DLA backbone with a CLIP text encoder for auxiliary supervision, explains how mask and contour information are aligned using DCT, and details the iterative module that uses GRU for refinement.

The method section, particularly the "backbone supervision" part, introduces a new term, "RemoteCLIP," which is not defined in the introduction or the abstract, even though CLIP is. This could confuse. It is also unclear why the text encoder is frozen and what the lightweight multilayer perceptron (MLP) does beyond "aligning textual and visual features".

The explanation of the "Mask information branch alignment and representation" could be clearer. The connection between the "preliminary regression of the target contour using center point heatmaps" and the subsequent "preliminary DCT-based representation of the target mask" is not explicitly defined, nor is the flow of information between the contour and mask branches. The provided diagrams (Fig. 1 and Fig. 2) are helpful but could be accompanied by a more detailed, step-by-step textual description of the data flow.

The use of "dilated circular convolution" is mentioned, but its specific function and benefit in the context of the CGNet architecture are not fully elaborated.

Suggestions for Improvement:

Clearly define "RemoteCLIP" at its first mention.

Provide a more detailed explanation of the role of the MLP layer in aligning features.

Add a diagram or a more detailed textual description illustrating the data flow between the contour and mask branches, showing how the iterative feedback loop works.

Elaborate on the specific advantages of using a "dilated circular convolution" in this particular application, providing more context for its inclusion.

Author Response

Response to Reviewers

(1) Reviewer:”The introduction effectively sets up the problem of instance segmentation in remote sensing imagery, highlighting the challenges, including small target scales, similar contours, and complex backgrounds. It also clearly states the paper's contributions, which include proposing a new network (CGNet) that combines a contour-mask branch with an enhanced backbone network.”

The last three paragraphs of the introduction, where the specific components of the proposed method are introduced, should be revised for better flow and clarity.”

Response:Thank you for your detailed suggestions regarding the Introduction.

Following your advice, we have:

Streamlined the last three paragraphs to keep only the high-level motivation for the GRU-based co-refinement and CLIP-enhanced backbone; most implementation details have been moved to Section 3.
Removed the bulleted contribution list and replaced it with a single concise paragraph that summarises how the two modules tackle the previously mentioned problems (small objects, contour-mask ambiguity, missed detections) and highlights the resulting accuracy gains without increasing model size.

We believe these changes make the Introduction flow more smoothly from general challenges to our specific solutions, while avoiding repetition and premature technical depth.

(2) Reviewer:”The Related Works section is comprehensive and well-structured, categorizing existing methods into one-stage and two-stage approaches, and further subdividing them into mask-based and contour-based methods. It also discusses the specific application of these methods to remote sensing imagery.

Response: Thank you for your insightful comments. Below is our point-by-point response that clarifies how the revised Related Work section now (i) explicitly states the limitations of prior arts, (ii) highlights the unique challenges of remote-sensing instance segmentation, and (iii) motivates every design choice of CGNet by directly contrasting it with the most relevant baselines.

Explicitly stating the gaps / drawbacks in every sub-section

One-stage mask-based methods

Added:“…they still demand pixel-wise dense supervision, which explodes computational load once the tiny but numerous objects of remote-sensing images appear.”

Drawback: dense prediction → prohibitive inference cost on large images with hundreds of instances.

One-stage contour-based methods

Added:“…by modeling only the object boundary they throw away the very background context that is indispensable for telling ‘ship-like pier’ from ‘real ship’ in noisy harbors.”

Drawback: lack of background cues → accuracy loss in complex scenes.

Two-stage mask-based methods

Added:“The price of ever-heavier backbones is sluggish inference, exactly what real-time remote-sensing applications strive to avoid when hundreds of instances per tile must be segmented.”

Drawback: accuracy bought with larger backbones → unacceptable speed penalty.

Two-stage contour-based methods

Added:“…these methods do not perform well for targets with complex boundaries, especially non-convex targets.”

Drawback: pure contour regression → poor handling of non-convex shapes.

Deep dive into remote-sensing-specific shortcomings

We added two sentences that consolidate the unique challenges:

“remote sensing images generally have more noise, and the target contours may be blurred or inaccurate due to noise interference… simply using contour-based instance segmentation methods in remote-sensing images may lead to significant accuracy loss.”

These statements are immediately followed by the observation that “recent work … highlights the challenges posed by complex backgrounds and noise … motivating our design of a dual-branch architecture that jointly optimizes contour and mask cues.”

Direct contrast between prior methods and CGNet

PolySnake

Added:“PolySnake … employs GRU for contour refinement, but its iterative module is designed exclusively for contour evolution, with the mask branch remaining independent and non-iterative. In contrast, CGNet introduces a dual-branch ConvGRU iteration mechanism that simultaneously refines both contour and mask representations in a shared 256-D DCT space … providing temporal memory to suppress noise accumulated across iterations.”

HQ-ISNet / Cascade Mask R-CNN

Added:“CGNet possesses only 64.2 M trainable parameters—17 % and 33 % fewer than Cascade Mask R-CNN (77.3 M) and HQ-ISNet (95.6 M), respectively—while still delivering the highest mask AP…”

FB-ISNet / SG-Former

Added:“Although these methods design modules specific to remote-sensing images … they still cannot achieve satisfactory performance in terms of accuracy… their improvements barely touch the chronic dilemma: similar contours across categories and blurred edges caused by sub-meter resolution.”

Moivation summary now appears at the end of Related Work

A new closing paragraph (just before Section 3) recaps:

“To overcome the inherent limitations of independent contour and mask branches—namely their incompatible dimensions and unequal susceptibility to background clutter—we introduce a mutually-refining iteration block powered by stacked GRUs … By first unifying dimensions with DCT, we allow the two branches to speak a common frequency-domain language … yielding noticeably crisper boundaries without extra parameters.”

We believe these revisions make it unambiguously clear why each component of CGNet (DCT-shared space, weight-shared ConvGRU, pixel–text CLIP supervision, attention fusion) is introduced as a direct counter-measure to the shortcomings identified in the existing literature.

(3) Reviewer:”The Method section outlines the model's architecture, which includes a backbone network, an object detection network, and a segmentation network. It introduces a DLA backbone with a CLIP text encoder for auxiliary supervision, explains how mask and contour information are aligned using DCT, and details the iterative module that uses GRU for refinement.

The use of "dilated circular convolution" is mentioned, but its specific function and benefit in the context of the CGNet architecture are not fully elaborated."

Response:Thank you very much for your constructive comments on our manuscript.

Clarify the term “RemoteCLIP” at its first appearance(Section 3.1 Paragraph 1)

We have added an explicit definition at the first occurrence:“RemoteCLIP refers to a domain-adapted variant of CLIP specifically fine-tuned for remote sensing imagery.…”

Explain the role of the lightweight MLP in pixel–text alignment

We extended the description as follows(pages 3 Paragraph 4):

“A lightweight MLP bridges any domain gap between the frozen linguistic space and the image CNN, projecting text embeddings into the visual feature space for consistent pixel-text similarity computation, so the enhanced feature maps remain cheap to compute yet rich in semantics.”

Detail the data-flow between mask and contour branches, especially how the initial contour generates the first DCT mask(Section 3.2 Paragraph 3)

We have rewritten the corresponding paragraph in §3.2 to make the pipeline transparent:

“The mask branch starts by producing an initial contour from center-point heatmaps and backbone features. Specifically, the center-point heatmaps first yield an object-level bounding box, from which a fixed number of uniformly-spaced contour points are sampled; these 2-D coordinates are concatenated with the corresponding RoI-aligned backbone features and fed into two FC layers that regress the first 256-D DCT vector representing the initial mask. In the following steps, the branch iteratively updates the DCT coefficients so that the mask gradually approaches the target mask; the refined coefficients are also sent to the contour branch by concatenating the updated DCT vector with the contour-point features at each GRU iteration, helping the contour evolve toward the target contour.”

Elaborate on the benefit of “dilated circular convolution”

We added the following explanatory sentence in §3.3(pages 9 Paragraph 1):

“Dilated circular convolution is used to model the topological structure of contours without breaking their cyclic nature. The dilation allows for a larger receptive field along the contour, capturing more context while preserving spatial continuity.”

We hope that these revisions fully address your concerns and improve the clarity of our method.

Once again, we thank the reviewers for their valuable suggestions. We believe that the revised manuscript is significantly clearer, more up-to-date, and better highlights the novelty of CGNet.

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

The current changes cover all the review comments and are acceptable.

Article Menu

CGNet: Remote Sensing Instance Segmentation Method Using Contrastive Language–Image Pretraining and Gated Recurrent Units

Further Information

Guidelines

MDPI Initiatives

Follow MDPI