Review Reports - Real-Time Detection and Segmentation of Oceanic Whitecaps via EMA-SE-ResUNet

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The accurate identification of whitecap coverage in videos under dynamic marine conditions is a tough task. This study achieved precise segmentation of whitecap coverage in complex marine environments by constructing the dual-end enhanced EMA-SE-ResUNet model, which integrates the ResNet-50 residual network with a U-Net encoder-decoder architecture. This study can provide highly reliable technical support for studies on air-sea flux quantification and marine aerosol generation. And the introduction provides sufficient background, the research design is appropriate, the methods are adequately described, the results are clearly presented, the conclusions are supported by the results. But the specific problems to be solved are as follows.

(1) Line 140: This paragraph should be deleted. Please check.

(2) Line 161-162: This description contradicts the data in Table 1, which indicates that the EMA module has the best performance rather than the SENet module. Please check.

(3) The evaluation metrics in Table 1 and 2 are not marked with ↑ and ↓. Please check.

(4) The indicator of “FLOPS” in Table 1-4 should be corrected to “FLOPs”. Please check.

(5) Some parameters in Formula 5 are not explained clearly. Please check.

(6) There are some inconsistencies between the data descriptions in the text and the data in the table. Please correct them. Such as “57.72%” in line 363, “1.348M” in line 365, “72.33%” in line 369.

(7) In the title of Figure 5, it is stated that the second column represents ground truth labels, but the column of ground truth labels is missing in the figure.

(8) The subfigures in Figure 8 lack numbered labels, such as (a)-(e).

Comments for author File: Comments.pdf

Author Response

Response to Reviewer #1

Comments 1: Line 140: This paragraph should be deleted. Please check.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have removed the paragraph at Line 140 as suggested.

Comments 2: Lines 161–162: This description contradicts the data in Table 1, which indicates that the EMA module has the best performance rather than the SENet module. Please check.

Response 2: Thank you for pointing this out. We agree with this comment. This has been revised to accurately reflect that the EMA module achieves the best performance, consistent with the results in Table 1.

Comments 3: The evaluation metrics in Table 1 and Table 2 are not marked with ↑ (higher is better) or ↓ (lower is better). Please check.

Response 3: Thank you for pointing this out. We agree with this comment. Therefore, we have added directional indicators (↑/↓) to all relevant evaluation metrics in Tables 1 and 2 to clarify the interpretation of each metric.

Comments 4: The indicator “FLOPS” in Tables 1–4 should be corrected to “FLOPs”. Please check.

Response 4: Thank you for pointing this out. We agree with this comment. Therefore, we have corrected the term to “FLOPs” (with proper capitalization and pluralization) throughout Tables 1 to 4.

Comments 5: Some parameters in Formula 5 are not explained clearly. Please check.

Response 5: Thank you for pointing this out. We agree with this comment. Therefore, we have provided explicit definitions for the missing parameters in Equation (5) in the revised manuscript.

Comments 6: There are inconsistencies between the data descriptions in the text and the values reported in the tables. Please correct them. Examples include “57.72%” in Line 363, “1.348M” in Line 365, and “72.33%” in Line 369.

Response 6: Thank you for pointing this out. We agree with this comment. Therefore, we have cross-checked all numerical values in the main text against the corresponding tables and corrected the inconsistencies at the specified locations.

Comments 7: In the title of Figure 5, it is it is stated that the second column represents ground truth labels, but this column is missing in the figure.

Response 7: Thank you for pointing this out. We agree with this comment. Therefore, we have revised the caption of Figure 5 to remove the incorrect reference to a ground truth column.

Comments 8: The subfigures in Figure 8 lack numbered labels, such as (a)–(e).

Response 8: Thank you for pointing this out. We agree with this comment. Therefore, we have added clear subfigure labels (a) through (e) to Figure 8 to improve readability and facilitate reference in the text.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors are asked to implement the following minor and major observations to improve the manuscript. These revisions will enhance the clarity and overall quality of the work.

Please explain the acronyms that first appear in the text, e.g., “EMA-SE-ResUNet.”

Please insert as the last paragraph in the introduction section the main contributions and objectives of the study. The main contributions of this study include the development of a novel architecture that enhances feature extraction and representation, along with a comprehensive evaluation of its performance against existing models, aiming to provide more accurate and efficient segmentation results.

In section 2.1, Data, the authors are asked to perform a complete description of the dataset; it can include the number of images for each subset: training, testing, and validation. This thorough description will help future researchers replicate the study and build upon its findings.

In the section “2.2. Dataset Generation,” it is unclear what set = 9:1:1 means.

In the “2.3. Evaluation Metrics” section the accuracy metric must be added; also, in the results section the values must be added. Clarifying these points will not only strengthen the methodology section but also improve the overall transparency of the research process.

In the “3.4. Training Environment” section, the version of the programming environment and the libraries used must be added. By providing these additional details, other researchers will be able to replicate the study and comprehend the context of the experiments. Furthermore, including this information will enhance the credibility of the findings and support future advancements in the field.

The comparison with state-of-the-art papers must be performed. This comparison will not only highlight the strengths and weaknesses of the current research but also provide a benchmark for future studies.

The limitations of the proposed study are missed. Addressing these limitations is crucial for a comprehensive understanding of the research's implications. B

The Conclusion section must be added. Incorporating a well-structured conclusion will summarize the key findings and reinforce the significance of the research. Additionally, it will guide readers in understanding the broader impact of the study on ongoing developments in the field.

Author Response

Response to Reviewer #2

Comments 1: Please explain the acronyms that first appear in the text, e.g., “EMA-SE-ResUNet.”

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have ensured that all acronyms, including "EMA-SE-ResUNet", are fully expanded upon their first appearance in the manuscript.

Comments 2: Please insert as the last paragraph in the introduction section the main contributions and objectives of the study. The main contributions of this study include the development of a novel architecture that enhances feature extraction and representation, along with a comprehensive evaluation of its performance against existing models, aiming to provide more accurate and efficient segmentation results.

Response 2: Thank you for this suggestion. We agree with this comment. Therefore, we have inserted a new paragraph at the end of the Introduction section that lists the main contributions and objectives in itemized form.

Comments 3: In section 2.1, Data, the authors are asked to perform a complete description of the dataset; it can include the number of images for each subset: training, testing, and validation. This thorough description will help future researchers replicate the study and build upon its findings.

Response 3: Thank you for this comment. We agree and have accordingly revised the manuscript. A more detailed description of the dataset, including the exact number of images allocated to the training, validation, and test subsets, has been provided in Section 2.2 (“Dataset Generation”). [The exact counts for each subset are now explicitly stated in the revised manuscript.]

Comments 4: In the section “2.2. Dataset Generation,” it is unclear what “set = 9:1:1” means.

Response 4: Thank you for pointing this out. We agree with this comment. Therefore, we have clarified the meaning of "set = 9:1:1" by explicitly stating the exact counts for each subset in the revised manuscript to avoid any confusion.

Comments 5: In the “2.3. Evaluation Metrics” section the accuracy metric must be added; also, in the results section the values must be added. Clarifying these points will not only strengthen the methodology section but also improve the overall transparency of the research process.

Response 5: We thank the reviewer for raising this important point. While accuracy is a commonly used metric, we respectfully note that it is not informative in the context of our task due to extreme class imbalance: whitecap pixels typically constitute less than 0.02% of the total image area. In such scenarios, a model that predicts all pixels as background (i.e., fails to detect any whitecaps) can still achieve an accuracy exceeding 99.5%, which is misleading and does not reflect actual segmentation performance.

Therefore, following best practices in imbalanced segmentation tasks, we prioritize metrics that are robust to class imbalance—namely, Intersection over Union for the whitecap class (IoU_W), F1-score for the whitecap class (F1_W), and the newly introduced Pixel Absolute Error (PAE), which directly quantifies the deviation in whitecap coverage estimation. These metrics align closely with the scientific goal of our study and provide a more meaningful assessment of model performance.

Comments 6: In the “3.4. Training Environment” section, the version of the programming environment and the libraries used must be added. By providing these additional details, other researchers will be able to replicate the study and comprehend the context of the experiments. Furthermore, including this information will enhance the credibility of the findings and support future advancements in the field.

Response 6: Thank you for this valuable suggestion. We agree and have accordingly added the specific software environment details: Python 3.10.15 and PyTorch 2.3.1+cu121. As recommended by another reviewer, this information has been moved to the beginning of the “Results and Analysis” section to improve manuscript flow and experimental reproducibility.

Comments 7: The comparison with state-of-the-art papers must be performed. This comparison will not only highlight the strengths and weaknesses of the current research but also provide a benchmark for future studies.

Response 7: We fully agree with the importance of benchmarking against state-of-the-art methods. However, to the best of our knowledge, there are currently few publicly available deep learning–based methods specifically designed for oceanic whitecap segmentation in the literature. Existing approaches primarily rely on classical image processing techniques (e.g., adaptive thresholding, optical flow, or infrared-visible fusion), which are often tailored to specific imaging conditions (e.g., lighting, sensor type, or viewing angle) and rarely accompanied by open-source code or sufficient implementation details for fair comparison.

Given these limitations, we have benchmarked our proposed EMA-SE-ResUNet against widely adopted and representative semantic segmentation architectures—including U-Net, DeepLabv3+, HRNet, and PSPNet—under identical experimental settings. This allows for a rigorous and fair evaluation of our model’s advantages within the general deep learning framework. We acknowledge this as a current gap in the field and plan to incorporate comparisons with emerging whitecap-specific methods as the research community develops standardized benchmarks and open datasets.

Comments 8: The limitations of the proposed study are missed. Addressing these limitations is crucial for a comprehensive understanding of the research's implications.

Response 8: Thank you for pointing this out. We agree and have added a limitation analysis before Figure 6, discussing representative failure cases such as undetected sparse whitecaps.

Comments 9: The Conclusion section must be added. Incorporating a well-structured conclusion will summarize the key findings and reinforce the significance of the research. Additionally, it will guide readers in understanding the broader impact of the study on ongoing developments in the field.

Response 9: We have removed the summary subsection and created a dedicated Conclusion section as Section 6. [Section structure reorganized, and a new Section 6: Conclusions added.]

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper Introduces a synergistic integration of spatial (EMA) and channel (SENet) attention, which enhances edge clarity and suppresses background noise — a novel adaptation for marine imagery. The paper systematically evaluates multiple attention modules (EMA, SENet, CBAM, BAM, SIMAM, etc.) and provides detailed ablation studies for placement optimization. Establishes a curated 1,100-sample dataset with expert annotations from real shipborne video.

Comments:

A related work section is necessary in the paper, that provide a literature of the recent works and the background theory of the utilized modules.
Authors should provide at the end of the introduction the contributions of the paper itemized.
It would be good if the authors porvide more example images in Figure 1 of the dataset especially focusing on the challenges of the dataset or compare with previous datasets if applicable.
Dataset size (1,100 images) is relatively small for deep learning; generalization to other sea states or lighting conditions may be limited. Authors should Incorporate datasets from various sensors (drone, satellite, fixed platforms) and environmental conditions to validate cross-domain robustness. Integrate environmental variables (wind speed, turbulence, etc.) as auxiliary inputs to improve physical interpretability.
Only shipborne data are used — no cross-validation with aerial or satellite datasets, which limits generalizability.
It is better to move the 2.3. Evaluation Metrics in the resutls section.
Please also move the "3. Model and Training Parameters" into the results section under implementations details the params of the model.
The paper is not well organized or sructured, it is better to use a standard strucure in the manuscript used in community. For example, introduction, related work, method (architecures, contributions, equations), results (implementations details, evaluation metrics, datasets, quantitative resutls, qualitative results, ablation study of the contributed modules, limitation and discussion, and future work), conclusion.
The paper lacks confidence intervals or statistical significance testing for reported gains in IoU and F1-score.
“10 fps” is mentioned but without detailed latency or hardware configuration benchmarks.
Highlight/Bold the best performance in the tables.
What are the main contributions of the paper in Figure 2, highlight them withe a different color.
Provide quantitative runtime analysis (latency, throughput, energy efficiency) under different hardware configurations.
Authors should Include standard deviations, boxplots, or paired significance tests (e.g., t-tests) to confirm that performance improvements are statistically meaningful.
Although ablations are thorough, visualization of where and why certain modules improve attention maps would strengthen interpretability. Please provide some qualitative resuls of the proposed or contribured module's attention map.
Cite the methods in the Table 3. report the FPS in the table.
Please remove the 5.1. Summary and create a conclusion section for that.
Show some qualitative failure cases of the models.

Comments on the Quality of English Language

Minor grammatical revisions and stylistic tightening would improve readability for an international audience.

Author Response

Response to Reviewer #3

Comment 1: A related work section is necessary in the paper, that provide a literature of the recent works and the background theory of the utilized modules.

Response 1: Thank you for pointing this out. We agree with this comment. Therefore, we have added a dedicated Related Work section summarizing recent studies and briefly introducing the theoretical background of the modules (EMA, SENet) used in this work. [A related work section has been inserted between the original Sections 1 and 2.]

Comment 2: Authors should provide at the end of the introduction the contributions of the paper itemized.

Response 2: Agree. We have, accordingly, revised the end of the Introduction section to list the contributions of this study in itemized form. [Revised at the end of the first section, Introduction.]

Comment 3: It would be good if the authors provide more example images in Figure 1 of the dataset especially focusing on the challenges of the dataset or compare with previous datasets if applicable.

Response 3: Thank you for this suggestion. We have added two additional representative examples to Figure 1 to better highlight dataset challenges. Currently, no comparable public datasets are available. [Figure 1 updated.]

Comment 4: Dataset size (1,100 images) is relatively small for deep learning; generalization to other sea states or lighting conditions may be limited. Authors should Incorporate datasets from various sensors (drone, satellite, fixed platforms) and environmental conditions to validate cross-domain robustness. Integrate environmental variables (wind speed, turbulence, etc.) as auxiliary inputs to improve physical interpretability.

Response 4: We appreciate the reviewer’s valid concern regarding dataset size. To mitigate potential limitations in generalization, we employed extensive data augmentation strategies during training, which effectively expanded the diversity and size of the training data. Moreover, our dataset was collected across multiple time periods and encompasses a range of natural illumination natural illumination conditions, enhancing model robustness. Future work will integrate multi-sensor data and environmental variables.

Comment 5: Only shipborne data are used — no cross-validation with aerial or satellite datasets, which limits generalizability.

Response 5: We thank the reviewer for raising this important point. The dataset was collected in shipborne conditions, which differ from aerial or satellite imagery. Therefore, this study focuses on sea-surface segmentation. Future work will extend the framework to aerial and satellite datasets to enhance generalizability.

Comment 6: It is better to move the 2.3 Evaluation Metrics in the results section.

Response 6: We thank the reviewer for raising this important point. Evaluation metrics are essential for training and analysis; hence, they remain in the Data and Methods section where first introduced.

Comment 7: Please also move the “3. Model and Training Parameters” into the results section under implementation details.

Response 7: Agree. We have moved the “Model and Training Parameters” section to the beginning of the Results and Analysis section as Implementation Details. [The section has been moved.]

Comment 8: The paper is not well organized or sructured, it is better to use a standard strucure in the manuscript used in community. For example, introduction, related work, method (architecures, contributions, equations), results (implementations details, evaluation metrics, datasets, quantitative resutls, qualitative results, ablation study of the contributed modules, limitation and discussion, and future work), conclusion.

Response 8: We agree with this comment. The manuscript has been fully reorganized to follow the standard structure: Introduction → Related Work → Data and Methods → Model and Training Parameters → Results and Analysis (Discussion) → Conclusions. [The structure has been revised throughout the document.]

Comment 9: The paper lacks confidence intervals or statistical significance testing for reported gains in IoU and F1-score.

Response 9: Thank you for this suggestion. We have added standard deviations for IoU, F1, and PAE in Tables 3 and 4, and included paired t-test results showing statistically significant improvements (p < 0.05). [Tables 3 and 4 have been updated, and a brief explanation of the t-test has been added after Table 4.]

Comment 10: “10 fps” is mentioned but without detailed latency or hardware configuration benchmarks.

Response 10: We have now included FPS values under our experimental setup in Tables 3 and 4, and hardware configuration details are provided in the Model and Training Parameters section. [Tables 3 and 4 have been updated.]

Comment 11: Highlight/Bold the best performance in the tables.

Response 11: Agree. We have boldfaced all best-performing results in Tables 1–4. [All relevant tables updated.]

Comment 12: What are the main contributions of the paper in Figure 2, highlight them with a different color.

Response 12: We have revised Figure 2 to highlight the EMA and SENet modules in red and bold to emphasize the main contributions. [Figure 2 updated.]

Comment 13: Provide quantitative runtime analysis (latency, throughput, energy efficiency) under different hardware configurations.

Response 13: Due to hardware limitations, we are unable to conduct runtime analysis across multiple configurations at this time.

Comment 14: Authors should include standard deviations, boxplots, or paired significance tests (e.g., t-tests) to confirm that performance improvements are statistically meaningful.

Response 14: We agree. We have added standard deviations and paired t-test results after Table 4, confirming that the improvements are statistically significant (p < 0.05). [A brief explanation of the t-test has been added after Table 4.]

Comment 15: Although ablations are thorough, visualization of where and why certain modules improve attention maps would strengthen interpretability.

Response 15: Thank you for your insightful suggestion. In response, we clarify that systematic ablation studies were conducted at four key positions, with quantitative results summarized in Tables 1 and 2. To enhance interpretability, we provide both theoretical analysis explaining how specific modules refine attention mechanisms and visualizations of the resulting attention maps in Figures 5 and 6, which illustrate where and why these improvements occur.

Comment 16: Cite the methods in Table 3. Report the FPS in the table.

Response 16: Agree. Both Table 3 and Table 4 now include FPS values and proper citations for all referenced methods. [Tables 3 and 4 have been updated.]

Comment 17: Please remove the 5.1 Summary and create a conclusion section for that.

Response 17: We have removed the summary subsection and created a dedicated Conclusion section as Section 6. [Section structure reorganized, and a new Section 6: Conclusions added.]

Comment 18: Show some qualitative failure cases of the models.

Response 18: We have added a limitation analysis before Figure 6, discussing undetected sparse whitecaps (e.g., upper-left of Group D) as representative failure cases. [A brief description of the model's partial limitations has been added before Figure 6.]

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The paper was significantly improved.

Author Response

Thank you for your reply, I have received your feedback.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors addressed the majority of my comments: only the following comments need to be considered:

first Authors should use more recent year's citations like 2024-2025.
It would be also good to see results in the table from the YOLO world models.
Cite the models in Table 3.

Author Response

Comment 1: first Authors should use more recent year's citations like 2024-2025.

Response 1: Thank you for pointing this out. We have revised the manuscript by replacing several outdated references and adding newly published studies from 2024–2025, such as [11], [13], [27], and [28].

Comment 2: It would be also good to see results in the table from the YOLO world models.

Response 2: Thank you for your valuable suggestion. In our early experiments, we also attempted to employ the YOLO framework (e.g., YOLOv8 and YOLOv10) for sea surface feature detection. However, the YOLO models showed limited accuracy in delineating the edges of complex objects, as their outputs are primarily rectangular bounding boxes designed for instance-level object detection. Even with the addition of a mask branch (e.g., YOLOv8-seg), the generated masks are typically upsampled from relatively coarse feature maps, leading to less precise boundary representations. Therefore, our comparative evaluation focused on pixel-level segmentation models such as U-Net, DeepLabv3+, PSPNet, and HRNet, which are more consistent with the task characteristics and objectives of this study.

Comment 3: Cite the models in Table 3.

Response 3: Thank you for this suggestion. At the beginning of Section 5.2, we have added descriptions of the models presented in Table 3 and incorporated the corresponding citations for these modules.

Author Response File: Author Response.pdf