Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

CNN–Transformer-Based Model for Maritime Blurred Target Recognition

Electronics 2025, 14(17), 3354; https://doi.org/10.3390/electronics14173354

by Tianyu Huang¹, Chao Pan^2,*, Jin Liu¹ and Zhiwei Kang³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Giorgia Franchini

Electronics 2025, 14(17), 3354; https://doi.org/10.3390/electronics14173354

Submission received: 26 July 2025 / Revised: 21 August 2025 / Accepted: 22 August 2025 / Published: 23 August 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a custom Convolutional Neural Network (CNN) with an enhanced transformer to accurately and efficiently recognize targets within blurred maritime images. However, the following issues should be address before the manuscript published:

1) While results show the methods work, the paper often lacks a deep discussion on why specific mathematical formulations were chosen over others, relying more on empirical validation than on theoretical grounding.
2) The text is highly repetitive, with the same claims and descriptions appearing in multiple sections, which bloats the manuscript and tires the reader.
3) The manuscript is extremely dense and difficult to read due to the overwhelming number of acronyms and the lack of intuitive explanations. The quality of the writing does not match the quality of the research.
4) Improve the resolution quality of figures between 5 to 9.

Author Response

Comment 1: While results show the methods work, the paper often lacks a deep discussion on why specific mathematical formulations were chosen over others, relying more on empirical validation than on theoretical grounding

Response 1: In the revised version, we conducted theoretical analyses of the relevant formulas for LGC, GCG, and FIPE.

Comment 2: The text is highly repetitive, with the same claims and descriptions appearing in multiple sections, which bloats the manuscript and tires the reader

Response 2: In the revised manuscript, we have reorganized the structure of the Introduction section by removing redundant technical descriptions and consolidating repetitive discussions about CNN and Transformer to enhance clarity.

Comment 3: The manuscript is extremely dense and difficult to read due to the overwhelming number of acronyms and the lack of intuitive explanations. The quality of the writing does not match the quality of the research.

Response 3: In the revised version, we have removed less important abbreviations such as SW-MSA, retaining only a few core module abbreviations like MSG-ARB. We have also explained the functions of the modules used and written out the full names of these modules to enhance reader comprehension.

Comment 4: Improve the resolution quality of figures between 5 to 9

Response 4: In the revised version, we have improved the resolution quality of figures between 5 to 9.

Reviewer 2 Report

Comments and Suggestions for Authors

Some weaknesses must be addressed before publication.

The manuscript does not clearly differentiate DyT-Net from other recent CNN-Transformer hybrid models beyond naming its modules. The claim of "first demonstration of de-normalization effectiveness" lacks strong evidence or comparison to similar normalization-free methods.
The network combines multiple modules (MSG-ARB, FIPE, DyT, CBAM, DGU, etc.) without sufficient ablation studies to show the unique contribution of each part in final performance.
The manuscript introduces many module names and acronyms (MSG-ARB, DyT, FIPE, FAU, MSFU, DGU, CBAM) which makes reading cumbersome and hard to follow.
Several figures (e.g., Fig. 1, Fig. 2) are schematic and lack detailed visual examples of feature maps or attention patterns that would demonstrate the claimed advantages. Some sections are verbose with redundant descriptions, while mathematical formulations (Eqs. 1–10) are underexplained.
Some works about detection should be cited in this paper to make this submission more comprehensive, such as 10.1109/TPAMI.2024.3511621.

Comments on the Quality of English Language

I recommend major revisions to enhance the quality of this manuscript. Additional details and explanations would greatly improve the manuscript.

Author Response

Comment 1: The manuscript does not clearly differentiate DyT-Net from other recent CNN-Transformer hybrid models beyond naming its modules. The claim of "first demonstration of de-normalization effectiveness" lacks strong evidence or comparison to similar normalization-free methods.

Response 1: In the revised version, it is demonstrated that the DyT module serves as a dynamic function replacing normalization layers in Transformer architectures, rather than a network architecture. And rephrased the statement as "the first empirical validation of denormalization's efficacy." to "The DyT module was integrated into a dual-branch model, and its effectiveness in blur image recognition tasks has been empirically validated. "

Comment 2: The network combines multiple modules (MSG-ARB, FIPE, DyT, CBAM, DGU, etc.) without sufficient ablation studies to show the unique contribution of each part in final performance.

Response 2: In the revised version, Table 1 presents the impact of the DyT module and fusion module on the final performance. Table 7 demonstrates the contribution of different unit combinations in the Fusion Module to overall results, while Table 9 displays the cascade analysis of the four key modules.

Comment 3: The manuscript introduces many module names and acronyms (MSG-ARB, DyT, FIPE, FAU, MSFU, DGU, CBAM) which makes reading cumbersome and hard to follow.

Comment 4: Several figures (e.g., Fig. 1, Fig. 2) are schematic and lack detailed visual examples of feature maps or attention patterns that would demonstrate the claimed advantages. Some sections are verbose with redundant descriptions, while mathematical formulations (Eqs. 1–10) are underexplained.

Response 4: In the revised version, revised Figure 11 presents the feature images generated by the dual-branch outputs, the fused feature images from the integration module, and the final output images utilized for recognition. A new Figure 12 has been added to display the attention weight heatmaps, providing a visual representation of the model's focus areas. We have thoroughly reviewed the entire paper, consolidated some paragraphs in the introduction, and streamlined the overview content in the methodology section. Detailed explanations of key formulas for LGC, GCG, and FIPE have been supplemented, including their physical significance.

Comment 5: Some works about detection should be cited in this paper to make this submission more comprehensive, such as 10.1109/TPAMI.2024.3511621.

Response 5: In the revised version, We have cited additional relevant detection references.

Reviewer 3 Report

Comments and Suggestions for Authors

The work is clear and well written. To improve the experimental part, I recommend adding more images for a qualitative analysis of the results and presenting the metrics and graphs also in terms of computation time and not only epochs.

Author Response

Comment 1: To improve the experimental part, I recommend adding more images for a qualitative analysis of the results and presenting the metrics and graphs also in terms of computation time and not only epochs.

Response 1: In the revised version, revised Figure 11 presents the feature images generated by the dual-branch outputs, the fused feature images from the integration module, and the final output images utilized for recognition. A new Figure 12 has been added to display the attention weight heatmaps, providing a visual representation of the model's focus areas. Section 3.9 has been supplemented with computational time dimension data.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

It is better to have the paper proof-read.
There are inconsistencies in abbreviations and terminology (e.g., the spelling of the term accuracy).
A table of abbreviations in alphabetical order should be included.

Author Response

Comment 1: It is better to have the paper proof-read.

Response 1: In the revised version, we conducted grammatical proofreading of the paper, modified certain terms to better align with academic conventions, simplified some complex sentences, standardized the initial definitions of all abbreviations, and ensured consistent formatting throughout the document.

Comment 2: There are inconsistencies in abbreviations and terminology (e.g., the spelling of the term accuracy).

Response 2: In the revised version, we standardized abbreviations and terminology to eliminate inconsistencies.

Comment 3: A table of abbreviations in alphabetical order should be included.

Response 3: In the revised version, we have added a table of abbreviations. (Please review lines 742 of the revised manuscript)

Reviewer 2 Report

Comments and Suggestions for Authors

No more comments.

Author Response

Comment 1: Does the introduction provide sufficient background and include all relevant references?

Response 1: In the revised version, we have provided relevant background materials and included pertinent reference sources.

Comment 2: Are the methods adequately described?

Response 2: In the Methods section, we describe the methods employed.

Comment 3: Are the conclusions supported by the results?

Response 3: Our conclusions are supported by the experimental results.

Thanks for your thoughtful and constructive suggestions on our manuscript. Your valuable comments have significantly enhanced the quality and clarity of our research.

Article Menu

CNN–Transformer-Based Model for Maritime Blurred Target Recognition

Further Information

Guidelines

MDPI Initiatives

Follow MDPI