Next Article in Journal
E-APTDetect: Early Advanced Persistent Threat Detection in Critical Infrastructures with Dynamic Attestation
Next Article in Special Issue
Improved First-Order Motion Model of Image Animation with Enhanced Dense Motion and Repair Ability
Previous Article in Journal
Automatic Decentralized Vibration Control System Based on Collaborative Stand-Alone Smart Dampers
Previous Article in Special Issue
Filter Pruning via Attention Consistency on Feature Maps
 
 
Article
Peer-Review Record

A Study on the Super Resolution Combining Spatial Attention and Channel Attention

Appl. Sci. 2023, 13(6), 3408; https://doi.org/10.3390/app13063408
by Dongwoo Lee 1, Kyeongseok Jang 1, Soo Young Cho 2, Seunghyun Lee 3 and Kwangchul Son 2,*
Reviewer 1:
Reviewer 2:
Reviewer 3:
Appl. Sci. 2023, 13(6), 3408; https://doi.org/10.3390/app13063408
Submission received: 15 February 2023 / Revised: 1 March 2023 / Accepted: 3 March 2023 / Published: 7 March 2023
(This article belongs to the Special Issue AI-Based Image Processing)

Round 1

Reviewer 1 Report

The paper has proposed a deep learning architecture for image super-resolution using attention mechanism to address the shortcomings of conventional CNN-based approaches. For this purpose, attention mechanism for both channel and spatial domains have been proposed. The subject is interesting and the provided experimental results are promising. There are, however, a few things that need to be addressed in the manuscript.

1-     Please explain the role of H1 convolution operation in eq. 10. According to the text, it is supposed to calculate the correlation between pixels in the feature map. It would be better to provide an equation for this operation.

2-     In section 3.4 (Upsampling) It has been stated that checkerboard pattern arises out of redundant calculations in deconvolution process. In reality, it arises due to the insufficient stride and can be fixed by using a much smaller stride compared to the kernel. Please expand this section to provide more explanation of eq. 16

3-     Image resolution using attention mechanism has been considered before as well. It is advisable to cite some references related to those works. E.g.

a.      Zhu, H.; Xie, C.; Fei, Y.; Tao, H. Attention Mechanisms in CNN-Based Single Image Super-Resolution: A Brief Review and a New Perspective. Electronics 2021, 10, 1187. https://doi.org/10.3390/electronics10101187

b.      S. Fang, S. Meng, Y. Cao, J. Zhang and W. Shi, "Adaptive Channel Attention and Feature Super-Resolution for Remote Sensing Images Spatiotemporal Fusion," 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 2021, pp. 2572-2575, doi: 10.1109/IGARSS47720.2021.9555093.

Author Response

Thank you for reviewing our paper.
I checked all the reviews and proceeded with the correction.

 

Response to Reviewer 1 Comments

Point 1: Please explain the role of H1 convolution operation in eq. 10. According to the text, it is supposed to calculate the correlation between pixels in the feature map. It would be better to provide an equation for this operation.

 Response 1: The role of H1 convolution is explained in 8p.

-(8p) Convolution is a process of calculating surrounding pixels into a kernel, and can use Spatial information.  calculates the relation between pixels in the feature map.

 

Point 2: In section 3.4 (Upsampling) It has been stated that checkerboard pattern arises out of redundant calculations in deconvolution process. In reality, it arises due to the insufficient stride and can be fixed by using a much smaller stride compared to the kernel. Please expand this section to provide more explanation of eq. 16

Response 2: we explained the disadvantages of deconvolution and the sub-pixel convolution in detail.

-(10p) The traditional upsampling method using deconvolution can suffer from the checkerboard artifact problem, which has been addressed by using small strides and large kernel sizes. However, this approach requires a significant number of parameters and computations, which can cause performance degradation. The proposed network expands feature maps using Sub-pixel Convolution, which avoids the need for a large number of parameters and computations.

 

Point 3: Image resolution using attention mechanism has been considered before as well. It is advisable to cite some references related to those works.

Response 3: Added reference using Attention Mechanism.

-(2p) Recently, an attention-based super-resolution method using such high-frequency feature emphasis has been proposed[10-12].

 

Author Response File: Author Response.docx

Reviewer 2 Report

The paper entitled “A Study on the Super Resolution Combining Spatial Attention and Channel Attentionproposed Single Image Super Resolution using an attention mechanism that emphasizes high-frequency features and a feature extraction process with different depths. The manuscript is well organized and not well written and its contribution is almost enough to publish in Applied Sciences. However, some of the ideas those need to be thoroughly clarified:

1.    It is strongly recommended that the text be carefully checked and its problems should be fixed.

2.    It is needed to compare the proposed method with the state-of-the-art methods (years 2021 and 2022).

3.    The Related Works section is weak and more papers should be reviewed in this section.

4.    Explain the “seq2seq” in line 111.

5.    In line 159, in the sentence “Fig. 6 is the process of sup-pixel convolution, a is the input feature map, b is the…”, you mentioned to a and b in Fig.6, but there is no a and b in this figure.

6.    Some typos in the text:

a.     Page 1, Line 40, “.” Should be after references. “After the codebook is generated, the …”, should be “After the codebook generation, the …”.

b.    Page 1, Line 43, “.” Should be after references in “neighboring pixel values.[3] The second…”.

c.     Page 1, Line 44, “.” Should be after references in “through deep learning-based CNN [4]. For interpolation …”, Check all text for such mistakes.

d.    Page 2, Line 82, in the sentence “Fig1 is a ResNet structure consisting of 152 convolution layers …”, “Fig1” should be “Fig. 1”.

e.    Page 3, Line 91, correct the sentence “(a)is the existing neural network and (b) is the residual…

f.      And more and more…

Comments for author File: Comments.docx

Author Response

Thank you for reviewing our paper.
I checked all the reviews and proceeded with the correction.

 

Response to Reviewer 2 Comments

Point 1: It is strongly recommended that the text be carefully checked and its problems should be fixed.

 Response 1: The description and expression of the thesis have been corrected.

 

Point 2: It is needed to compare the proposed method with the state-of-the-art methods (years 2021 and 2022).

Response 2: There is a difference in the learning method between the proposed method and the recent method, and it is difficult to compare directly.

 

Point 3: The Related Works section is weak and more papers should be reviewed in this section.

Response 3: The description and expression of the thesis have been corrected.

 

Point 4: Explain the “seq2seq” in line 111.

Response 4: The roles and ideas of seq2seq were further explained.

-(3p) Seq2Seq is a model used in natural language processing, such as machine translation, that uses an encoder-decoder architecture to transform one sequence into another[16]. However, during the process of compressing the encoder-decoder structure of the seq2seq model into a fixed-size vector, loss of information can occur and the gradient vanishing problem can occur when the input is long. To address these issues, Attention can be used effectively to train the model.

 

Point 5: In line 159, in the sentence “Fig. 6 is the process of sup-pixel convolution, a is the input feature map, b is the…”, you mentioned to a and b in Fig.6, but there is no a and b in this figure.

Response 5: Added a, b, c of the Fig 6 and explanation

-(5p) Figure 6. Spatial Attention Module: (a) Input Feature Map; (b) the feature map resulting from the convolution; (c) feature map rearranged from the convolution feature map.

 

Point 6: Some typos in the text:

Response 6: Checked for typos and errors and corrected them.

Author Response File: Author Response.docx

Reviewer 3 Report

This paper describes the single image Super Resolution using an attention mechanism that emphasizes high-frequency features and a feature extraction process with different depths.

The detailed comments are as follows. Please consider them for further revision.

1. Page 5, Sub-section 2.3, formula (6) is written incorrectly.

2. Page 6, Section 3,"Shallow Feature Extraction" and "Deep Feature Extraction" in Figure 7 can be described in more detail.

3. Page 9, in formula (14), why do we choose to use 3 attention blocks instead of other number of attention blocks in Shallow Feature Extraction. It can be explained and proved by experiment.

4. Page 10, in formula (16), parameter Ilr is not introduced.

Author Response

Thank you for reviewing our paper.
I checked all the reviews and proceeded with the correction.

Response to Reviewer 3 Comments

Point 1: Page 5, Sub-section 2.3, formula (6) is written incorrectly.

Response 1: Fixed wrong formula.

-(5p)

Point 2: Page 6, Section 3,"Shallow Feature Extraction" and "Deep Feature Extraction" in Figure 7 can be described in more detail.

Response 2: Added description of shallow feature extraction and deep feature extraction.

-(5-6p) To enable effective training of deep networks, an Attention Block combining CSBlock and Skip Connection was applied. The Attention Block was used to emphasize features using the structure of Shallow Feature Extraction and Deep Feature Extraction with Attention Block, and feature utilization was performed through connections. In this way, both shal-low features including shape information and deep features with emphasized high-frequency components were extracted from the insufficient information of low-resolution.

 

Point 3: Page 9, in formula (14), why do we choose to use 3 attention blocks instead of other number of attention blocks in Shallow Feature Extraction. It can be explained and proved by experiment

Response 3: In the experiment, the performance according to the number of Attention Blocks explained the reason for setting three.

-(11p) Fig. 13 shows an image of the DIV2K Validation PSNR results trained 50 epochs by the number of attention blocks in Show Feature Extraction and Deep Feature Extraction. As the number of attention blocks increased, the performance increased, and when the number of attention blocks exceeded a certain number of attention blocks, the PSNR de-creased. Through the experiment, the number of Attention Blocks in the Show Feature Ex-traction with the highest PSNR was 3 and the number of Attention Blocks in the Deep Feature Extraction was 10 applied.

 

Point 4: Page 10, in formula (16), parameter Ilr is not introduced

Response 4: Added description of Ilr that was missing.

-(10p) Equation 16 is an equation for extending the feature map,  is an extended feature map, UP is a Sub-pixel Convolution,  is a 3-depth feature,  is a 10-depth feature,  is input LR image and H is a convolution

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

The text still has some grammatical mistakes.

Back to TopTop