Next Article in Journal
Collision Avoidance Strategies for Unmanned Surface Vehicles Based on Improved RRT Algorithm
Previous Article in Journal
The Influence of Sampling Hole Size and Layout on Sediment Porewater Sampling Strategies
 
 
Article
Peer-Review Record

Dynamic Mutual Adversarial Learning for Semi-Supervised Semantic Segmentation of Underwater Images with Limited and Noisy Annotations

J. Mar. Sci. Eng. 2025, 13(12), 2334; https://doi.org/10.3390/jmse13122334
by Han Chen 1, Ming Li 2, Yancheng Liu 1, Jingchun Zhou 3, Xianping Fu 3, Siyuan Liu 1,* and Fei Richard Yu 2
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
J. Mar. Sci. Eng. 2025, 13(12), 2334; https://doi.org/10.3390/jmse13122334
Submission received: 15 October 2025 / Revised: 5 November 2025 / Accepted: 26 November 2025 / Published: 8 December 2025
(This article belongs to the Section Ocean Engineering)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper addresses an important problem: semantic segmentation of underwater images with limited and noisy annotations.  The proposed DMAS framework is novel, combining adversarial pre‑training with dynamic mutual learning. Experimental validation on DUT and SUIM datasets is comprehensive and shows clear improvements over several baselines. The methodology is described in detail, with algorithmic steps and mathematical formulations provided.

My main comment is noisy annotations claim.  The experimental setup does not use noisy ground-truth labels. The setup validates the method's success with limited data, not noisy initial annotations. I think the authors must either conduct new experiments on a benchmark with known label noise or revise the paper's claims, title, and abstract to focus solely on the "limited annotations" aspect.

Dynamic Mutual Learning is poorly explained and needs revising. Algorithm 1 is a bit  confusing. In semi-supervised training part, L_pre takes X_all but in the text L_pre is taking X_l. Figure 1 does not fully match with the explanations in the text neither. 

Some minor comments on improving the presentation of the paper

-Revise the abstract and introduction for conciseness and clarity.
-Improve figure readability (especially framework diagrams).
-Provide sensitivity analysis for key hyperparameters.
-Add comparisons with more recent semi‑supervised segmentation methods.
-Discuss computational complexity and training stability in more detail.
-Proofread thoroughly for grammar and style.

Author Response

Comments 1: My main comment is noisy annotations claim. The experimental setup does not use noisy ground-truth labels. The setup validates the method's success with limited data, not noisy initial annotations. I think the authors must either conduct new experiments on a benchmark with known label noise or revise the paper's claims, title, and abstract to focus solely on the "limited annotations" aspect.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have conducted new experiments involving noisy annotations by randomly introducing label noise (including partial label missing and incomplete labels) into the training set to verify the method’s adaptability to noisy initial annotations, rather than revising the paper’s claims, title, or abstract. This change can be found in the revised manuscript in Section 4.1 "Dataset" on page 9, paragraph 2, lines 305-307. “[To verify the method’s adaptability to noisy annotations, label noise including partial label missing and incomplete labels was randomly introduced into the training set.]”

 

Comments 2: Dynamic Mutual Learning is poorly explained and needs revising. Algorithm 1 is a bit confusing. In semi-supervised training part, L_pre takes X_all but in the text L_pre is taking X_l. Figure 1 does not fully match with the explanations in the text neither.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have revised the explanation of Dynamic Mutual Learning (DML) by clearly defining the four components of datasets A and B (initial labeled data, self-generated pseudo-labeled data, cross-supervised pseudo-labels, and remaining unlabeled data), added Algorithm 2 to explicitly describe the dual-model mutual optimization process of DML, resolved the contradiction between L_pre’s input in the text and Algorithm 1 by explaining that L_pre uses X_all (I_l ∪ I_u) instead of only X_l to leverage the discriminator’s probability evaluation for all data, and established the connection between Figure 1 and the text by specifying that "Re-labeled Datasets A/B" in Figure 1 correspond to the training data in the DML stage and the "Mutual Training" module corresponds to the model update logic using each other’s pseudo-labels. These changes can be found in the revised manuscript in Section 3.3 "Dynamic Mutual Learning" on pages 7-8, paragraphs 1-2 (lines 227-253) for DML’s dataset composition and Figure 1 connection, Algorithm 2 on page 8 (lines 233-251) for the detailed DML process, and Section 3.3.1 "Dynamic Mutual Iterative Framework" on page 8, paragraph 1 (lines 240-253) for resolving L_pre’s input contradiction. “[Dataset i ∈ {A, B} consists of four components: 1) initial labeled data I_l; 2) re-labeled pseudo-labeled data I_{ps}^i generated by the model itself in previous iterations; 3) cross-supervised pseudo-labels y_{ps}^j iteratively updated from the other model S^j(·) (j ∈ {A, B}, j ≠ i); 4) remaining unlabeled data I_u^i (after excluding samples used for I_{ps}^i). This composition aligns with the two-stage workflow of Figure 1, where Dataset A and B respectively correspond to the "Re-labeled Datasets A/B" output from the adversarial pre-training stage and participate in mutual training between Model A and B.]” “[This framework is consistent with Algorithm 1 and extends it to dual-model mutual optimization, as detailed in Algorithm 2. Algorithm 2 explicitly reflects the dual-model mutual optimization logic of DML: Steps 1–2 complete the update of S^B(·) using pseudo-labels from S^A(·), while Steps 3–4 symmetrically update S^A(·) using pseudo-labels from S^B(·).]”

Reviewer 2 Report

Comments and Suggestions for Authors

The DMAS framework is a smart piece of work. I particularly like how you've set up the dual-model architecture where each model essentially keeps the other honest, it's a natural extension of mutual learning, but the way you integrate adversarial pre-training with confidence-based pseudo-label selection feels fresh. The experimental setup covers the bases well: good choice of datasets, testing across multiple label ratios shows real data efficiency, and you've compared against relevant baselines. The dynamic reweighting loss is probably the strongest technical contribution here, downweighting pixels where the models disagree heavily makes complete sense for filtering out garbage pseudo-labels, and your ablation studies back this up with real gains. The math is clean, implementation details look reproducible, and you've done your homework on related work.

A few things worth addressing: you mention instability in the mutual learning convergence but don't really dig into it. Can you provide some convergence curves or analysis of what's happening during training? How sensitive is this to initialization choices? Also, while I appreciate the honesty about limitations, the discussion of failure cases feels a bit thin - the visual results show improvements but there's clearly still room for growth on concealed objects. More importantly, please consider releasing code and models. This kind of work is way more useful to the community if people can actually build on it rather than reimplementing from scratch. On the future work front, you mention multi-modal fusion and multi-scale features, which makes sense, but I'm also curious how much headroom you'd get from just swapping in a more modern backbone. The datasets and ratios you tested are fine, but have you looked at how this performs with really noisy labels versus just sparse ones? Overall this is strong work that addresses a real problem, just needs a bit more depth on the practical aspects.

Author Response

Comments 1: you mention instability in the mutual learning convergence but don't really dig into it. Can you provide some convergence curves or analysis of what's happening during training? How sensitive is this to initialization choices?

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have added convergence curves, training loss fluctuation trends, and mIoU fluctuation distributions under different initialization schemes to analyze the convergence instability of mutual learning, verifying that DMAS converges faster, has smaller loss fluctuations, and is less sensitive to initialization choices compared to the single-model baseline. This change can be found in the revised manuscript in Section 4.5 "Ablation Study" on page 15, Figure 8 and its corresponding description (lines 467-472). “[To further elaborate on framework stability, evidence from controlled experiments on the DUT dataset is visualized in Fig. 8.  The figure includes training loss fluctuation trends, mPA and mIoU fluctuation distributions across different schemes, reflecting DMAS’s superior stability on DUT. DMAS stabilizes earlier and no longer improves with additional iterations, with its convergence speed outperforming the single-model baseline. ]”

 

Comments 2: while I appreciate the honesty about limitations, the discussion of failure cases feels a bit thin - the visual results show improvements but there's clearly still room for growth on concealed objects.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have supplemented the discussion of failure cases by specifically analyzing the segmentation challenges of concealed objects (e.g., starfish obscured by aquatic plants, human divers hidden behind reefs) and low-contrast targets, explaining the underlying reasons (feature overlap leading to low discriminator confidence, small feature gradients amplifying model divergence), making the discussion more detailed. This change can be found in the revised manuscript in Section 4.4.2 "Qualitative Analysis" on page 12, paragraphs 2-3 (lines 369-383). “[Specifically, for concealed targets (starfish obscured by aquatic plants in the first row of Fig. 2, human divers hidden behind reefs in the tenth row of Fig. 3), DMAS only achieves relatively low coverage of occluded regions. This is attributed to feature overlap between occluders and targets, which causes the discriminator D to generate probability maps with low confidence, leading the dynamic reweighting loss L_DR to excessively downweight valid pixels and suppress the learning of occluded target features. For low-contrast targets (e.g., in Fig. 3), the boundary pixel error rate is relatively high, resulting from small feature gradients that amplify divergence between the two segmentation models S^A and S^B (i.e., Å·_{ps}^A ≠ Å·_{ps}^B), prompting the dynamic mutual learning mechanism to reduce the weight of boundary pixels and limit the acquisition of fine-grained boundary features.]”

 

Comments 3: More importantly, please consider releasing code and models. This kind of work is way more useful to the community if people can actually build on it rather than reimplementing from scratch.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we plan to release the code and pre-trained models of the DMAS framework on a public code repository (e.g., GitHub) after the paper is accepted, to facilitate reproducibility and further development by the community. This statement is added in the revised manuscript in Section 5 "Conclusion" on page 17, paragraph 2 (lines 538-540). “[To promote the reproducibility and application of this work, the code and pre-trained models of the DMAS framework will be released on a public GitHub repository after the paper is formally accepted.]”

 

Comments 4: The datasets and ratios you tested are fine, but have you looked at how this performs with really noisy labels versus just sparse ones?

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have added experiments with noisy labels (partial label missing and incomplete labels) in the training set to verify the method’s performance under both sparse and noisy label scenarios, and the results show that DMAS maintains stable performance even with noisy labels, which is supplementary to the previous sparse label experiments. This change can be found in the revised manuscript in Section 4.1 "Dataset" on page 9, paragraph 2 (lines 305-307) and Section 4.4.1 "Quantitative Analysis" on page 11, paragraph 2 (lines 340-343). “[To verify the method's adaptability to noisy annotations, label noise including partial label missing and incomplete labels was randomly introduced into the training set.]” “[Notably, DMAS maintains a consistently high performance regardless of reduced training sample sizes or the presence of label noise. For instance, at a labeled ratio of 0.125 with noise, DMAS still achieves mIoU and mPA metrics comparable to those of fully supervised methods without noise, confirming its robustness to both sparse and noisy labels.]”

Reviewer 3 Report

Comments and Suggestions for Authors

The paper introduces a new scheme called DMAS the proposed scheme is used to address the challenges of limited and noisy annotations in underwater image semantic segmentation. The framework consists of a dual-model configuration that allows for cross-checking and correction of mislabeling, and a dynamic reweighting loss (L_DR) that improves precision by conveying lower weights to potentially erroneous pixels. The proposed scheme performance is supported in both quantitative and qualitative results on the DUT and SUIM datasets, with metrics comparable to or better than fully-supervised and other semi-supervised methods. However, there are some points should addressed as the paper lacks essential detail in the ablation study, which should clarify the contribution of each component and provide more detailed discussion on the stability of the DMAS framework during training due to interdependence among models. The paper also mentions the delineation of the confidence map, which is facilitated by adversarial networks and the discriminator's ability to differentiate between real and predicted . The authors can address the complexity of their scheme. 

Author Response

Comments 1: the paper lacks essential detail in the ablation study, which should clarify the contribution of each component and provide more detailed discussion on the stability of the DMAS framework during training due to interdependence among models.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have designed four controlled variants (Baseline, Baseline+AP, Baseline+DML, Full DMAS) in the ablation study to quantify the performance contribution of each component (adversarial pre-training (AP), dynamic mutual learning (DML), dynamic reweighting loss (L_DR)), and supplemented the discussion on framework stability by analyzing how AP reduces DML’s error detection burden and L_DR mitigates instability caused by model interdependence, supported by convergence curves in Figure 8. These changes can be found in the revised manuscript in Section 4.5 "Ablation Study" on pages 15-16, Table 1 and its corresponding analysis (lines 440-466) for component contribution, and Section 4.6 "Discussion" on page 16, paragraph 2 (lines 496-500) for stability discussion. “[Table 1. Performance Contribution of DMAS Core Components on DUT and SUIM Datasets (Labeled Ratio = 0.125). Values in parentheses indicate absolute performance gain relative to the Baseline; Avg. Gain represents the average of mIoU and mPA gains across both datasets.][ Adversarial Pre-training (AP) contributes a modest but critical 1.8% average gain, which comes from the discriminator’s ability to distinguish real label maps from model predictions, reducing noise in initial pseudo-labels and laying a reliable foundation for subsequent semi-supervised learning on DUT. Second, Dynamic Mutual Learning (DML) drives the most significant individual gain of 3.4% on average, as inter-model divergence effectively identifies pseudo-label errors that single-model self-training cannot detect, enabling more efficient utilization of unlabeled DUT data. Third, dynamic reweighting loss L_DR adds an additional 1.4% average gain to Full DMAS, as its pixel-wise weight adjustment suppresses residual errors from DML on DUT, confirming its role as a “fine-grained error filter.” Notably, Full DMAS exhibits a synergistic effect on DUT: its 6.6% average gain equals the sum of individual component gains, which is attributed to AP’s high-quality pseudo-labels reducing DML’s error detection burden and L_DR mitigating instability caused by inter-model interdependence, leading to super-additive performance improvement.]”

 

Comments 2: The paper also mentions the delineation of the confidence map, which is facilitated by adversarial networks and the discriminator's ability to differentiate between real and predicted . The authors can address the complexity of their scheme.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have supplemented the detailed delineation process of the confidence map and quantified the computational complexity of DMAS (trainable parameters, training time, inference speed) by comparing it with the single-model baseline, explaining the rationality of the complexity increase. These changes can be found in the revised manuscript in Section 3.2.2 "Semi-supervised Training" on pages 6-7, paragraphs 2-3 (lines 189-210) for confidence map delineation, and Section 4.5 "Ablation Study" on page 16, Table 2 and its corresponding analysis (lines 473-486) for complexity discussion. “[The confidence map serves as a pixel-level reliability filter in the adversarial pre-training stage, directly linking the discriminator’s probability evaluation to pseudo-label refinement. Its generation strictly follows the GAN interaction logic between the generator S^i(·) and discriminator D(·), and complies with established threshold and function definitions. After fully supervised training, the discriminator outputs a pixel-wise probability map for any input image. Each pixel value in this map represents the similarity between the corresponding segmentation result and real label maps—a higher value indicates lower prediction error risk. This probability map undergoes pixel-wise normalization and is used directly as the raw input for confidence map generation, with no additional processing required. To convert the continuous probability map into a binary reliability indicator, a threshold T_semi=0.2 and an indicator function are adopted.]” “[Table 2. Computational Complexity Comparison Between DMAS and Single-Model Baseline. DMAS has 85.6 million trainable parameters, approximately twice the baseline’s 42.0 million—this increase stems from the dual segmentation networks and fully convolutional discriminator required for adversarial pre-training. DMAS requires 28 hours of training, longer than the baseline’s 15 hours, due to the additional computational load from its multi-component design. However, this overhead is justified by its performance gains: the synergistic AP+DML mechanism optimizes unlabeled data utilization, avoiding redundant iterations common in single-model pseudo-label training. For inference—critical for real-time underwater robot perception—DMAS achieves 18.3 FPS on 512×512 images, slightly lower than the baseline’s 19.5 FPS but still meeting the 15 FPS practical requirement for underwater perception systems.]”

Round 2

Reviewer 3 Report

Comments and Suggestions for Authors

This version is much better. 

Back to TopTop