Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

DRFENet: An Improved Deep Learning Neural Network via Dilated Skip Convolution for Image Denoising Application

Appl. Sci. 2023, 13(1), 28; https://doi.org/10.3390/app13010028

by Ruizhe Zhong

and Qingchuan Zhang^*

Reviewer 1: Anonymous

Reviewer 2:

Eréndira Rendón-Lara

Reviewer 3:

Florian Kromp

Appl. Sci. 2023, 13(1), 28; https://doi.org/10.3390/app13010028

Submission received: 17 October 2022 / Revised: 14 December 2022 / Accepted: 15 December 2022 / Published: 20 December 2022

(This article belongs to the Special Issue Scale Space and Variational Methods in Computer Vision)

Round 1

Reviewer 1 Report

Please see the attachment.

Comments for author File: Comments.pdf

Author Response

Comment: My major concern about this manuscript is biasness in reference citation is prevalently visible. It should not be practiced. I request authors to work on this issue. Rest is all right.

Response: Thanks for the reminder. Our original writing did lack a broad discussion of methods other than the CNN approach. This will cause biases in the reference citation. I will discuss other image denoising methods more extensively in the manuscript. The added sections will be presented below.

The image restoration task can be transformed into a linear inverse problem, and the diffusion models, acting as state-of-art generative models, use diffusion to increase the noise and gradually transform the empirical data distribution into a simple Gaussian distribution. The core idea is to simulate the approximation of "denoising" diffusion as opposed to diffusion noise. Therefore, it often has good performance in multiple image restoration tasks. [41,42,43] At present, most unsupervised methods to solve the noise removal problem focus on inefficient iterative methods. In order to make the unsupervised methods more efficient, [41] based on the variable input, an unsupervised posterior sampling method (DDRM) (unsupervised posterior sampling method) is proposed. DDRM [41] is a general sampling based linear inverse problem solver based on unconditional/class conditional diffusion generating models as learned priorities, which can effectively reverse significant noise. There are issues of over-smoothing, mode collapse and large model footprint issues in the super-resolution domain, to solve these problems, the Single image super-resolution diffusion probabilistic model (SRDiff) [43] was used for the first time in the field of super-resolution and can generate diversified results with rich details. To establish an extended framework that can be applied to different problems, [42] proposed a generic denoising Markov model to extend denoising diffusion models to general state-spaces. Finally, it is proved that the framework has excellent robustness in a series of problems. Vision Transformer (ViT) uses the idea of Transform for reference, divides an image into patches of fixed size and processes patches in a way similar to natural language processing. At present, ViT has made great progress in a number of computer vision fields [46]. ViT also has strong performance in multi-scale context aggregation, and each layer can obtain the global information of the image. Swin Transformer[47] enhances multi-scale feature fusion with the shift window method. However, compared with CNN, ViT lacks the sliding operation of the convolution kernel in the feature graph, resulting in the Self-Attention Inductive bias ability of ViT being weaker than that of CNN. [48] In addition, Transformer features need to be transformed as one-dimensional parameters, resulting in partial differences between images and sequences. The global self-attention in ViT is that each pixel must be compared with all other pixels in the image. However, in the field of image denoising, most of the compared pixels are irrelevant, which leads to redundancy in computational complexity. So CNN is better at solving pixel-level tasks than Transformer. In the image denoising task, CUR Transformer [48] divides the image into non-overlapped windows and establishes a communication mechanism to make up for the above shortcomings. DRFENet combines dilated convolution and skip connection to deal with multi-scale fusion more efficiently and has the ability to extract more fine-grained feature information.

Author Response File: Author Response.docx

Reviewer 2 Report

The submitted paper, “DRFENet: an improved deep-learning neural network via 2 dilated skip convolution for image denoising application”.

The proposed approach is shown clearly. Issues to be corrected are presented next:

1. A statistical analysis of the presented results is recommended.

2. In section “4.3 Experimental results”, table 4, table 5 and table 6, it can be clearly seen that the difference between the proposed method and the methods with which the comparison was made is smallest. Which suggests that perhaps statistically the proposed method is not a better option. For this reason, it is recommended to use ANOVA methods to comparison among algorithms.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Ruizhe Zhong and Qingchuan Zhang propose a novel network architecture to tackle the task of image denoising. They introduce a combination of multiple modules thereby improving the negative effects of pooling layers, feature aggregation and reduced receptive field .

General comments:

The paper is well structured, reads clearly and the presentation of the methodology and results is consistent and well organized.

My major concern about the presentation of the work is related to the state-of-the-art in this field. Nowadays, diffusion models are able to generate photo-realistic images, thereby outperforming Generative Adversarial Networks. They are not solely used for image generation, but for related tasks like image inpainting, image denoising and image super-resolution. When presenting a novel architecture on image denoising, diffusion models need to be considered [1,2,3]. They need at least to be mentioned in the related works section or even better been compared to.

The authors raise limitations of current architectures to underline the needs that gave rise to the proposed architecture. However, these limitations need to be more precise.

Line 65: "The network depth limits the flow of information, and the communication ability between the deep layers and the shallow layers of the model is low."

Why is this a limitation? What are the practical/performance differences if

this flow is missing?

Line 70: "Complex image backgrounds naturally hide the noise [14], resulting in the limited effect in extracting the noise features in images. "

It is unclear what the authors try to state here - what is meant by "naturally hide the noise"? I could not follow this issue when reading reference [14].

Comments to the network architecture:

The authors state on line 153 "Studies have shown that dilated convolutions contribute to multi-scale context aggregation." This is correct, but what is the advantage of using dilated convolution, skip connections and attention mechanisms in comparison to vision transformers, fulfilling the task of multiscale feature matching and attention mechanisms inherently [4]? By introducing these different modules, the contribution of each of the modules must at least be justified b experiments. The authors tried to approach this issue by presenting a comparison in Table 3, but fail in presenting details that are needed for justification. Which metric is shown here, why is this table called implementation details, how many images were used? A more detailed ablation study would help to shade light into the contribution of each of the different modules.

[1] B. Kawar et al. Denoising Diffusion Restoration Models, arXiv, 2022

[2] J. Benton et al. From Denoising Duffisions to Denoising Markov Models. arXiv, 2022

[3] H. Li et al. SRDiff: Single Image Super-Resolution with Diffusion Probabilistic Models. arXiv, 2021

[4] Z. Mao. MFATNet: Multi-Scale Feature Aggregation via Transformer for Remote Sensing Image Change Detection. Remote sensing, vol. 4, 2022

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

The observations have been correctly addressed

Author Response

Dear Reviewer

Thank you very much for your recognition.

Best

Article Menu

DRFENet: An Improved Deep Learning Neural Network via Dilated Skip Convolution for Image Denoising Application

Further Information

Guidelines

MDPI Initiatives

Follow MDPI