Next Article in Journal
Investigating Transfer Learning in Graph Neural Networks
Previous Article in Journal
Crescent Microstrip Antenna for LTE-U and 5G Systems
 
 
Article
Peer-Review Record

Hybrid Architecture Based on CNN and Transformer for Strip Steel Surface Defect Classification

Electronics 2022, 11(8), 1200; https://doi.org/10.3390/electronics11081200
by Shunfeng Li 1, Chunxue Wu 1,* and Naixue Xiong 2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2022, 11(8), 1200; https://doi.org/10.3390/electronics11081200
Submission received: 16 March 2022 / Revised: 7 April 2022 / Accepted: 8 April 2022 / Published: 9 April 2022
(This article belongs to the Section Artificial Intelligence)

Round 1

Reviewer 1 Report

  1. Authors are recommended to proof-read the manuscript for grammar and spelling mistakes by a native speaker.
  2. It would be interesting to verify the results for training set of 70% and 75%.

Author Response

Dear Reviewer,

We thank you for your valuable comments. We have carefully studied these comments and revised our manuscript. Please see the attachment.

Thank you and best regards.

Yours sincerely,

Chunxue Wu

Author Response File: Author Response.pdf

Reviewer 2 Report

* (line 26-28) I suggests to emphasize why CNN is preferable than other types (e.g., RNN, ANN) of deep learning models. For example, CNN learns local patterns and captures promising semantic information, and it is also known to be efficient (i.e., less number of parameters) compared to other types. When you mention this, you may refer some relevant previous studies below.
    (1) M. Jeon et al., "Compact and Accurate Scene Text Detector," 2020.
    (2) Thang Vu et al., "Fast and Efficient Image Quality Enhancement via Desubpixel Convolutional Neural Networks," 2018.

* (line 33-35) "self-attention mechanisms ignore local information in the early stages..." I don't agree with this sentence. Would you provide a reference that supports this? The self-attention mechanism allowed the Transformer "encoder" to grasp the whole given sentence by looking at all words (context), but not losing local information. Please note that Transformer architecture includes "encoder" and "decoder". I don't see any encoder-decoder architecture in this paper. Transformer "encoder" layer (or BERT layer) would be correct word in this paper. Anyhow, as mentioned in the first comment, the CNN is known to be effective in capturing local patterns, while the Transformer "encoder" layer is good at understanding context but it is heavy (i.e., has a lot of parameters). So, I think the combination of CNN and Transformer "encoder" proposed in this paper can be explained that CNN converts the input image to a compact representation (which prevent the entire model from being too large), and the Transformer "encoder" layers find higher-level patterns from the compact representation. What do you think?

* (line 94-110) Authors need to find more relevant articles in terms of "hybrid architecture of CNN and BERT (or Transformer encoder)". For example, the paper below suggested BERT + CNN architecture that is a hybrid approach in the opposite way of this paper. It would be nice if the authors provide theoretical comparison between this relevant architecture and the proposed model of this paper.
    - Changai He et al., "Using Convolutional Neural Network with BERT for Intent Determination," 2019.

Author Response

Dear Reviewers,

Thank you very much for your time involved in reviewing the manuscript and your very encouraging comments on the merits.  We have revised the manuscript based on your valuable comments. Please see the attachment

Author Response File: Author Response.pdf

Reviewer 3 Report

The abstract is quite descriptive; it could be completed by adding the numerical values given by the proposed architecture.

In the keywords, "Transformer network" is more proper than using only "Transformer."

In the related works, sections some hybrid methods using CNN as features extractor and traditional SVM classifiers should be explicitly included. Indeed, machine and deep learning approaches should be represented separately to highlight advantages and drawbacks.

Include an exclusive Table containing all the hyperparameters used in training the proposed CNN architecture.

All the functional blocks should be labeled in Figure 4.

In Figure 5, it should be nice to include the six categories'  names, Cracks (CR) explicitly, ...

Please include in the evaluation the ROC curve (Section 5.2).

Figure 9 could use the best point of view (frontal-parallel).

The caption of Table 1 is partially occluded.

A discussion subsection must be included before giving the conclusions. 

The conclusions are a bit shallow; complete this section highlighting the contributions of this interesting work.

 

Author Response

Dear Reviewer,

Thank you very much for your time involved in reviewing the manuscript and your very encouraging comments on the merits. We have revised the manuscript based on your valuable comments. Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop