RDASNet: Image Denoising via a Residual Dense Attention Similarity Network

In recent years, thanks to the performance advantages of convolutional neural networks (CNNs), CNNs have been widely used in image denoising. However, most of the CNN-based image-denoising models cannot make full use of the redundancy of image data, which limits the expressiveness of the model. We propose a new image-denoising model that aims to extract the local features of the image through CNN and focus on the global information of the image through the attention similarity module (ASM), especially the global similarity details of the image. Furthermore, dilation convolution is used to enlarge the receptive field to better focus on the global features. Moreover, avg-pooling is used to smooth and suppress noise in the ASM to further improve model performance. In addition, through global residual learning, the effect is enhanced from shallow to deep layers. A large number of experiments show that our proposed model has a better image-denoising effect, including quantitative and visual results. It is more suitable for complex blind noise and real images.


Introduction
As an important information carrier, imaging is widely used in remote sensing, medicine, aerospace, and other fields. However, due to the interference of imaging equipment and external factors, images are very easily affected by various noises and become blurred [1], so image denoising is particularly important. In fact, image denoising has always been a fundamental and important problem in computer vision [2]. Its purpose is to recover a clean image from a noisy image [3]. In general, for noisy images y, the imagedenoising problem can be expressed as y = x + v, where x is the original image and v represents additive Gaussian noise (AWGN) with a standard deviation of σ.
From the point of view of Bayes, when the possibility is known, imaging prior modeling is a good method of image denoising. In the early days, most models were modeled based on the image prior method. For example, the non-local mean (NLM) algorithm [4] estimated the center point of the reference block using the weighted average of self-similar blocks to reduce noise. In addition, block-matching and 3D-filtering (BM3D) algorithms [5] enhanced sparsity through cooperative transformation to achieve image denoising. In calculating the weighted nuclear norm minimization (WNNM) [6], prior information was used to determine the nuclear norm for image denoising. Indeed, image denoising can be carried out using image-prior-based methods, and a good denoising effect can be achieved. However, these methods all face two problems [7]: (1) the optimization problem in the test stage is very complex, making the denoising process time-consuming; (2) parameters need to be manually adjusted to obtain a better image-denoising effect.
As evident in models such as AlexNet [8], VGG [9], ResNet [10], etc., deep learning has flourished, and convolutional neural networks (CNN) have been widely used in image denoising and have achieved a better denoising effect. For example, residual learning 1.
This paper proposes a new image-denoising algorithm framework: the residual dense attention similarity network (RDASNet). Different from the existing CNN denoising models, the core of the proposed model is the residual dense attention similarity module (RDASM), which extracts the local features of the image through CNN and captures the global similarity features of the image through the attention similarity module. Furthermore, this is very effective for dealing with complex noise images. The proposed model achieves a better denoising effect, both qualitatively and quantitatively; thus, it is more suitable for complex noise.

2.
Weight is used to represent the similarity of image details. Data redundancy exists in the image, that is, the textural detail of the whole image is similar. The attention similarity module (ASM) can make full use of the global information of the image, and similar details have similar weights. Furthermore, the ASM gives larger weight to key features. Ablation studies have also shown the effectiveness of ASM. 3.
In the attention mechanism, dilated convolution is used to enlarge the receptive field in attention so as to extract the global similarity information better. Furthermore, dilated convolution has a smaller number of parameters. Compared with RDN, the number of parameters in our proposed model increased only by 0.07 M, while PSNR increased by 0.10-0.22 dB.

4.
For global pooling in the attention mechanism, we found that the color of the image is dimmer after avg-pooling, which makes the noise in the image look less obvious than before; that is to say, avg-pooling is beneficial to smooth and suppress the noise in the image. Moreover, ablation studies have also shown that avg-pooling could further improve model performance for image denoising.

Deep CNNs for Image Denoising
Due to the abovementioned two main defects, image denoising is based on the image prior knowledge modeling. Moreover, a convolutional neural network can automatically extract features, thus reducing computational costs [19,20]. Therefore, deep CNNs are widely used in image denoising.
Zhang et al. [7] first designed deep CNNs for image denoising (DnCNN), and they improved the performance of the model by stacking multiple convolutional layers, residual learning [10], and batch normalization [21]. DnCNN obtained a better effect than traditional BM3D; therefore, it is a successful application of CNN in image denoising. It was noted that DnCNN has a better effect on certain noises, and the effect of blind noise is not ideal. Then, to deal with this problem, Zhang et al. [11] designed a fast and flexible denoising network (FFDNet). They used a trainable noise level map as the model input, and a single model can handle different noise levels. Furthermore, in order to make full use of the abundant features of all the layers to improve model performance, Zhang et al. [12] proposed a very deep residual dense network (RDN) for image super-resolution, and by cascading residual dense blocks, a continuous memory mechanism was formed. Moreover, to reduce computational costs, Tian et al. [14] proposed a batch renormalization denoising network (BRDNet), which used batch renormalization [22] to accelerate the convergence of network training. Furthermore, it is suitable for denoising on low-configuration hardware devices. Using dilated convolution instead of ordinary convolution can also reduce the number of model parameters. For example, Tian et al. [15] designed ADNet, and they used sparse blocks composed of dilated convolution and ordinary convolution to improve performance and efficiency; in addition, an attention mechanism was used to extract hidden information. Motivated by this, and that deep CNNs have shown better performance for image denoising, we used CNNs for image denoising.

Attention Mechanism and Similarity
Extracting key information in complex environments is very important for image denoising. Furthermore, there is redundant information in the image; specifically, there is a global similarity in image details. A better use of image data redundancy can improve the performance of the model in a complex environment.
The attention mechanism used in this study originated in the human brain [23], then it was introduced into natural language processing [24] and applied to computer vision [25]. From a mathematical point of view, the attention mechanism provides a pattern of weights to perform operations on. In a neural network, the attention mechanism enables the use of some network layers to calculate the corresponding weight value of the feature maps and to carry out the attention mechanism of the feature maps. Furthermore, an attention mechanism can be understood as giving more attention to the most meaningful parts (with more weight) [26]. Thus, this is very useful in complex environments to obtain key information about the image, where attention gives more weight. Jaderberg et al. [27] proposed a spatial transformer network (STNet), which focused on the spatial information of the image. Hu et al. [28] proposed a squeeze-and-excitation network (SENet) to study channel dimensions, which can adaptively adjust the key features of image channel dimensions. Furthermore, ADNet [15] uses only one convolution focus channel information to guide the CNN in training the model. In addition, Woo et al. [29] proposed a plug-and-play convolutional block attention module (CBAM) that extracts global image features in the channel spatial dimension, respectively, through max-pooling and avg-pooling. Inspired by these methods, our attention mechanism also includes two dimensions, channel and spatial dimensions.
Classical image denoising, such as classical NLM [4] and BM3D [5], make full use of image similarity information. NLM gives large weight to neighborhoods with similar pixels, and BM3D searches for similar patches of a given patch. In image denoising with deep CNNs, there are few studies focused on the global similarity of images. Motivated by this, we used an attention mechanism to represent the similarity of image details.

Proposed RDASNet Denoising Method
In this section, a new image-denoising network is described, along with the residual dense attention similarity network (RDASNet), as shown in Figure 1. Firstly, the shallow information of the image is extracted using the preprocessing module (PM), which only contains two convolution layers. The core of our model is the residual dense attention similarity module (RDASM), which consists of the residual dense module (RDM, motivated by RDN [12]) and the attention similarity module (ASM, motivated by CBAM [29], NLM [4], and BM3D [5]). The RDM captures the local features of the image through residual learning and dense layers, while the ASM uses attention to assign similar weights to areas with similar image details (pixels) and gives large weight to the key features of the image. This is useful for image denoising in a complex background. Then, global residual learning is used to enhance the effect from a shallow layer to a deep layer in the network.

Network Structure
Assume that I noise and I denoising represent the input image containing noise and the output denoising image, as shown in Figure 1. Specifically, in the preprocessing module (PM), we use two convolution and activation layers to extract shallow features with 64 3 × 3 convolution kernels for each layer to obtain the shallow feature map F pre as follows: where H pre1 and H pre2 denote convolution and activation operations. Furthermore, we use the leaky rectified linear unit (LReLU [17]) activation function. Then, F pre is sent to the N-stacked residual dense attention similarity module to capture the image features. Through RDASMs, we can obtain F N B using N-RDASM.
where H d B denotes the operations of the d-th RDASM, and H d B is a non-linear transformation; it's a series of operations, such as convolution and LReLU. More details on RDASM are given in Section 3.2. Then, the output features of all RDASMs are fused, and the output feature F RDASMs can be obtained.
indicates that the feature maps from 1 to N RDASM will be concatenated. H F1 means to control the number of output channels with a 1 × 1 convolution, and H F2 means that a 3 × 3 convolution is used to improve the expression ability of the model.
Finally, we use global residual learning to enhance the effect from a shallow layer to a deep layer, and thus we can obtain the output feature map F out .
Then, through a 3 × 3 convolution, it will be converted to three channels or one channel (depending on whether the input is a color image or a gray image).
where I denoising is the final output of our model, that is, the image after denoising through the RDASNet. In addition, the L1 loss function is used to optimize the difference between the denoising output image I denoising of our model and the ground truth image I GT . Assuming that the training set has N pairs of images I i noise , I i GT (i = 1, 2, · · · , N), the loss function of RDASNet can be calculated by where F RDASNet (·) denotes the predicted output of the model.

Residual Dense Attention Similarity Module (RDASM)
Our proposed residual dense attention similarity module is the core of RDASM; it includes the residual dense module (RDM) and the attention similarity module (ASM), as shown in Figure 2. RDM is used to obtain the local features of the image, form dense layers through a series convolution, and enhance the representation of the model through residual learning. ASM is used to obtain key similarity features in a complex background, including channel attention similarity (CASM) and spatial attention similarity (SASM). CASM focuses on the global image similarity information from the channel dimension, while SASM focuses on the global image similarity information from the spatial dimension. In Figure 2, we can obtain the channel attention similarity map M C ∈ R C×1×1 , and we can obtain the spatial attention similarity map M S ∈ R 1×H×W . Then, the attention similarity module can be described as where F d B denotes the input feature map of the d-th RDASM. ⊗ denotes element-wise multiplication. More details on CASM and SASM are given in Sections 3.2.2 and 3.2.3, respectively.
Then, the output F SASM of the attention similarity module and the serial output F d,i B of the residual dense more are concatenated (more details are provided in Section 3.2.1). where Then, the 1 × 1 convolution is used to control the number of output channels. Thus, the input feature map of d + 1-th RDASM can be obtained by where F d+1 B denotes the output of d + 1-th RDASM, and the feature map of the current layer is passed backward through local residual learning.

Residual Dense Module (RDM)
The residual dense module is designed with reference to RDN, and each RDM block adopts an eight-layer convolution and activation operation to achieve contiguous memory by transferring the feature F d B of the current layer to each subsequent layer, respectively, so as to make full use of the features of each layer, as shown in Figure 2. The output feature map F d,c B of the c-th Conv layer of the d-th RDASM can be expressed as where is the feature map extracted from the i-th (i = 0, 1, · · · , c−1) Conv layer of the d-th RDASM, and W d,c B is the weight of the c-th Conv layer. σ denotes the LReLU activation function.

Channel Attention Similarity Module (CASM)
CASM uses global avg-pooling on each channel to compress the feature map from C × H × W to C × 1 × 1 so that one pixel represents one channel to achieve global information embedding [28], as shown in Figure 3. We use F d B ∈ R C×H×W to represent the feature map input to the CASM. The global spatial information is compressed into a channel descriptor z ∈ R C×1×1 through global avg-pooling. Furthermore, the c-th channel of z c is calculated by where H GAP is used to represent global avg-pooling. Compared with max-pooling, avgpooling is beneficial to smooth and suppress noise and achieve a better denoising effect (more details are provided in Section 4.5). Furthermore, f c (i, j) represents the value of position (i, j) of feature map F d B . Then, through Dilated+LReLU+Dilated, a gating mechanism with a sigmoid activation is used to learn the interrelationships between the channels. Furthermore, we can obtain the channel attention similarity map CAS c ∈ R C×1×1 . It should be noted that here, we use dilated convolution [18] instead of standard convolution, and the dilation rate is 2. There are two reasons: (1) dilated convolution can enlarge the receptive field, which is beneficial to obtain better global similarity information; (2) there are fewer dilated convolution parameters.
where σ 2 denotes the sigmoid activation function. H DConv1 ∈ R C r ×C and H DConv2 ∈ R C× C r represent the two layers of the dilated convolution, respectively. The r is the reduction ratio, and then the number of parameters is reduced and set to 16 [28]. σ 1 represents the LReLU activation function after the first dilated convolution layer. Furthermore, CAS c is the channel attention similarity obtained through the CASM, as shown in Figure 3. In addition, similar channel pixels have similar weights. The two channels in dark red in Figure 3 have similar weights w1. Furthermore, CASM gives a large weight to key channel features.

Spatial Attention Similarity Module (SASM)
SASM also uses global average pooling to compress the channel in the spatial dimension from C × H × W to 1 × H × W, as shown in Figure 4. We use F CAM ∈ R C×H×W to represent the input feature map of the spatial attention module. Then, It is compressed to R 1×H×W using global average pooling. Then, through a set of non-linear transformation processes, we can obtain the feature map SAS S ∈ R 1×H×W of the spatial attention similarity.
where H DConv denotes the dilation convolution, the kernel size is 3 × 3, and the dilation rate is 2. σ 2 denotes the sigmoid activation function.

Implementation Details
In the model, a 3 × 3 convolution kernel was used in all cases without special instructions. Furthermore, we used a zero-padding strategy to keep the size of the image constant. At the same time, a 1 × 1 convolution kernel was used behind each concatenation layer to control the number of output channels. In addition, in the attention similarity module, we used dilated convolution instead of standard convolution, to enlarge the receptive field and reduce the number of parameters. Moreover, RDASMs had a total of 16 residual dense attention similarity modules, and there were 8 levels of convolution in each RDASM.

Experiment Results
In this section, we present the experimental setup of the model, the experimental results, and the corresponding ablation experiments.

Train Datasets
The training datasets consisted of three types: gray image datasets, color image datasets, and real noisy image datasets. The gray image datasets were trained using blind noise and consisted of two public datasets, namely the Waterloo Exploration Database [30] and the BSD400 dataset [11,19]. The BSD400 dataset was randomly selected from Ima-geNet's [31] validation set and stored in PNG format. The Waterloo Exploration Database consists of 4744 natural images in PNG format. The color image datasets included the Waterloo Exploration Database and BSD432 [7]. The BSD432 dataset is derived from the Berkeley Segmentation datasets and contains 432 color images. The real noisy image datasets used polyU-Real-World-Noisy-Images datasets [32] to train the model. The polyU-Real-World-Noisy-Images datasets consists of 100 color images with real noise, which are obtained from five cameras: Sony A7 II, Nikon D800, Canon 80D, Canon 600D, and Canon 5D Mark II.

Test Datasets
Similarly, the test datasets also included gray image datasets, color image datasets, and real noisy image datasets. The gray image datasets consisted of Set12 and BSD68 [7]. Set12 has 12 gray images, while BSD68 has 68 gray images. The color image test datasets included CBSD68, McMaster [33], and Kodak24 [34]. McMaster and Kodak24 contain 18 and 24 color images, respectively. The real noisy image test dataset was cc [35]. The cc contains 15 real noisy images from different ISO (1600, 3200, and 6400).

Experimental Settings
The main parameters of model training are shown in Table 1. For gray images, the patch size was set to 80 × 80; for color images, it was set to 80 × 80; and for real noisy images, it was set to 64 × 64. In gray image datasets, color image datasets, and real noisy image datasets, we trained 400, 400, and 65 epochs, respectively. In addition, we set the initial learning rate at 1 × 10 −4 , and LR remained unchanged in the first 80% of the epochs and subsequently changed to 0.1 times the original with each epoch. Furthermore, in each epoch, we obtained a blind noisy patch by adding the AWGN of noise range σ = [0, 75] to the clean patch.

RDASNet for Gray Image Denoising
For gray image denoising, we chose several state-of-art denoising methods with the same test datasets, including BM3D [5], DnCNN [7], FFDNet [11], BRDNet [14], ADNet [15], and RDN [13]. In addition, BM3D is a denoising method based on the prior knowledge of images; DnCNN, BRDNet, and ADNet are image-denoising non-blind methods based on CNNs, and FFDNet is blind image denoising. It should be noted that the design of our residual dense module was inspired by RDN, and the noise levels of RDN test datasets are different from other methods, so we retrained the RDN. Moreover, BM3D, DnCNN, FFDNet, BRDNet, and ADNet yielded PSNR, and with direct reference to this, the SSIM was recalculated.
Tables 2 and 3 report the PSNR and SSIM results on Set12 and BSD68 datasets, respectively. In terms of the quantitative results, our RDASNet achieved the same or better results in most cases than all other methods. The quantitative results of RDASNet were mostly optimal or suboptimal. In particular, in the complex noise, our model was superior to all the most advanced image-denoising methods, mainly because more global image similarity information is paid attention to using our proposed model.  in Figure 5 as an example, The effect of BM3D restoration was the least ideal, and other methods, such as FFDNet, BRDNet, ADNet, etc., all had different degrees of distortion. In contrast, our RDASNet could better alleviate blur and restore more image details.

RDASNet for Color Image Denoising
For color image denoising, we compared RDASNet with CBM3D [5], DnCNN, FFDNet, BRDNet, ADNet, and RDN [13]. Table 4 reports the PSNR and SSIM results using the CBSD68, Kodak24, and McMaster datasets. From the quantitative results, it can be inferred that the RDASNet proposed by us outperformed all the other image-denoising methods on color images. This is mainly due to the fact that our model can pay more attention to the global information of the image. Furthermore, Figures 7 and 8 show the visualization results.   For real noisy images, we chose several commonly used image-denoising methods with the same test datasets, such as CBM3D, WNNM [6], DnCNN, BRDNet, ADNet, and RDN. Table 5 reports the PSNR results using the cc dataset. From Table 5, we can observe that the model proposed by us still achieved the best effect of denoising for real images in terms of the overall mean value. Furthermore, in contrast to other methods, our RDASNet had a good denoising effect taken by different camera devices and could better adapt to different devices. On the one hand, CNN extracts the image features of the fixed receptive field through convolution operation and pays more attention to the local information of the image. In addition, there are two methods to enlarge the receptive field of convolution [36]: (1) using a larger convolution kernel, for example, a 5 × 5 or 7 × 7 convolution kernel instead of a 3 × 3 convolution kernel; (2) deepening the network layer. Both methods lead to a surge in the number of parameters in the model. On the other hand, there is redundant information in the image; that is, some details of the whole image are similar. Classical image-denoising methods, such as NLM [4] and BM3D [5], obtain better performance by using image similarity. The core idea of the NLM is that the estimate of the current pixel is obtained using the weighted average of the pixels with similar structures in the image. Furthermore, BM3D is a block-matching and 3D-filtering method, and during the blockmatching process, similar blocks are found and then filtered. However, in image denoising, few CNN models make full use of the global similarity information of the image, which limits the representation ability of the models. Inspired by this, we designed the RDASM.
RDASM Includes the RDM and ASM. The RDM is formed through residual learning and dense convolution layers, as shown in Figure 2. The RDM focuses on the local information of the image and extracts its features. The ASM consists of CASM and SASM and focuses on the global similarity information of the images from two dimensions of channel and space, respectively, as shown in Figures 3 and 4. The ASM we designed has several differences: (1) It uses an attention mechanism to mine global similarity information. Through CASM or SASM, the channel attention similarity or spatial attention similarity map can be obtained. Similar image details have similar weights, and key features are given larger weights. (2) Dilated convolution is used to enlarge the receptive field, so as to better focus on the global information of the image, and the number of parameters is less than standard convolution. (3) Avg-polling is beneficial to smooth and suppress the noise; more detail can be found in Section 4.4.2.
In order to verify the effectiveness of the RDASM, especially the effectiveness of the attention similarity module (ASM), we designed a set of comparative experiments to compare the merits and demerits of the RDM and RDASM in image denoising on multiple datasets. The results in Table 6 show that the proposed RDASM has a better effect on image denoising, which proves the effectiveness of the proposed structure. In addition, we visualized the proposed RDASM, as shown in Figure 9. In Figure 9, label (*1) is the noisy image, (*2) is the heatmap, and (*3) is the denoising image. In (a2) heatmap, we can see that the lower left corner a1, the middle a2, and the upper right corner a3 of the image have similar details and similar weights. Furthermore, in the figure (a3) denoising image, it can be seen that the details of a1, a2, and a3 are indeed very similar. In addition, the attention mechanism gives more weight to the key features (red in the figure indicates large weight), and therefore a2's weight w2 is larger than a4's w4. This shows that through RASM, the model can make full use of the redundant information in the image. The global similarity information, in particular, is very useful in images with complex noisy backgrounds. Figure 9. The thermodynamic image of RDASM is proposed. (a1-a3) are the noise images, (b1-b3) are the corresponding heatmaps, (c1-c3) are the corresponding denoising images. In (a2) figure, the details of (a1-a3) are similar, and the weights w1, w2, and w3 are similar. Red indicates high weight, and attention gives more weight to key features.

Global Pooling Design
As mentioned in the section describing the RDCSAM design, it is global pooling that enables the attention mechanism to pay attention to the global information of the image, so global pooling is very important.
Furthermore, here, we made a notable discovery. During an experiment, it was accidentally found that after avg-pooling, the color of an image would be dimmer, and the noise would appear less obvious. However, max-pooling was just the opposite, highlighting the noise, as shown in Figure 10. In real life, if the light is dim at night, it is difficult to find a small spot on the face, but if it is in a well-lit situation, it is easy to find a small spot on the face. Therefore, we wondered whether the average pooling would have a positive effect on the final performance of the model. Subsequently, experiments were carried out to verify this conjecture.
Then, we only changed the way of global pooling in CBAM and only used average pooling, maximum pooling, and CBAM training models, respectively. In order to save time, the model only trained 200 epochs, and the results are shown in Table 7. This proves once again that we can achieve a better image-denoising effect using avg-pooling; that is, avg-pooling can suppress the noise.

Complexity Analysis
The testing speed of the model is also an important evaluation index. Thus, Table 8 shows the running time results of BM3D, WNNM, DnCNN, FFDNet, BRDNet, RDN, and RDASNet for gray image denoising with sizes of 256 × 256 and 512 × 512, where the noise level is σ = 50. Furthermore, we found that compared with RDN, the speed of our model was faster. In addition, we compared the number of parameters with the PSNR of McMaster (σ = 50), as shown in Figure 11. ADNet and BRDNet are lightweight models proposed to solve resource-constrained situations, their parameters were smaller, and the performance of our model was better. Our model had high parameters, compared with RDN, and the parameters of our proposed model only increased by 0.07 M, and from Tables 2-4, it is evident that PSNR increased by 0.10-0.20 dB for gray images and by 0.11-0.22 dB for color images. The evaluation was conducted in a PyCharm community (2021) environment with an Nvidia GeForce RTX 3090 Ti GPU.

Conclusions
In this paper, we proposed a residual dense attention similarity network (RDASNet) for image denoising. The local information of the image was extracted using a CNN, and the global information of the image was extracted using the attention similarity module. In addition, our model obtained the shallow features through a preprocessing module; then, the CNN was used to pay more attention to the local information, and the attention similarity module was used to pay more attention to the global similar information, so as to fully benefit from the redundant information of the image. Moreover, similar details had similar weights, and more weight would be given to key features, making the model more suitable for complex noise. Furthermore, global residual learning was used to enhance the effect from a shallow layer to a deep layer. Our proposed RDASNet is more suitable for blind noise, complex environments, and real noise.
In the future, we hope to deploy RDASNet on the mobile end. In general, deployment on mobile platforms requires a smaller model. One possible solution is to use deeply separable convolution instead of traditional convolution to reduce the number of model parameters, or to compress the model through knowledge distillation to design a lightweight RDASNet.