Considering Image Information and Self-Similarity: A Compositional Denoising Network

Recently, convolutional neural networks (CNNs) have been widely used in image denoising, and their performance has been enhanced through residual learning. However, previous research mostly focused on optimizing the network architecture of CNNs, ignoring the limitations of the commonly used residual learning. This paper identifies two of its limitations, which are the neglect of image information and the lack of effective consideration of image self-similarity. To solve these limitations, this paper proposes a compositional denoising network (CDN), which contains two sub-paths, the image information path (IIP) and the noise estimation path (NEP), respectively. IIP is trained via an image-to-image method to extract image information. For NEP, it utilizes image self-similarity from the perspective of training. This similarity-based training method constrains NEP to output similar estimated noise distributions for different image patches with a specific kind of noise. Finally, image information and noise distribution information are comprehensively considered for image denoising. Experimental results indicate that CDN outperforms other CNN-based methods in both synthetic and real-world image denoising, achieving state-of-the-art performance.


Introduction
Image denoising is a commonly studied problem in computer vision and has been shown to be important in medical images [1,2], remote sensing images [3], mobile phone images [4], etc. It aims to restore a corrupted image x to the ground-truth clean image y, which can be modeled as y = x − v, where v is the noise. Synthetic noisy images and real-world noisy images are studied in this paper.
Recently, convolutional neural networks (CNNs) have been popularly adapted to image denoising. Zhang et al. [5] proposed a feed-forward denoising a convolutional neural network (DnCNN) with residual learning and batch normalization to remove the additive white Gaussian noise (AWGN). Residual learning here is training networks to estimate the noise of the noisy image and then subtract it to obtain the corresponding clean image. Based on residual learning, CNNs obtained remarkable denoising results, completely exceeding the traditional methods such BM3D [6] and WNNM [7]. Currently, most work focused on designing more effective network modules to enhance denoising performance. For instance, depth networks [8][9][10], width networks [4,[11][12][13][14] and attention mechanisms [15][16][17][18][19] were deeply studied. Additionally, some methods employed variations of convolution, such as deformed convolution [20], to improve image denoising.
Despite achieving high performance in image denoising, the aforementioned methods have not fully addressed the limitations of residual learning. Firstly, residual learning Residual learning was proposed in ResNet [23] to solve the performance degradation problem with the increasing network depth. With such a learning strategy, the residual network learns a residual mapping for a few stacked layers. Before ResNet, learning the residual mapping had already been adopted in some low-level vision tasks [24,25]. Zhang et al. [5] extended this concept to image denoising, using a single residual unit to predict the residual image instead of many stacked units. Nowadays, residual learning is widely used in most deep denoising networks. However, residual learning alone may not be sufficient to obtain satisfactory denoising results since image information acquisition is also important. Furthermore, existing residual learning methods do not consider image self-similarity.

Deep Networks for Image Denoising
Over the years, many methods have been proposed for image denoising, including both traditional and deep-learning-based approaches. In this paper, we focus on the deeplearning-based methods. Zhang et al. [5] proposed a deep convolutional neural network (CNN) that goes beyond traditional Gaussian denoisers by utilizing residual learning to learn a residual mapping between noisy and clean images. Ren et al. [9] further extended the use of residual blocks in image denoising through their DN-ResNet, which incorporates dense connections and residual learning. Additionally, Zhang et al. [10] introduced an effective residual block that improves image denoising performance. Tian et al. [12], on the other hand, introduced batch renormalization to deep CNNs for image denoising, which effectively reduces the impact of different batch sizes on training. This study shows the effectiveness of renormalization compared to previous normalization techniques. These approaches demonstrate the efficacy of increasing network depth in addressing image denoising.
However, the increased depth makes models suffer from gradient vanishing or exploding. Hierarchical networks were proposed to use wide network structures to alleviate the problem. Tian et al. [12] utilized a two-path network to increase the width of the network and thus obtained more features. They also proposed a dual denoising network (DudeNet) with two paths and further designed their different functions [11]. Specifically, the top sub-path of DudeNet uses a sparse mechanism to extract global and local features. A non-local hierarchical network (NHNet) [17] used two sub-paths to process different resolutions of the noisy image. For the high-resolution path, it employed a novel upsampling method with a non-local mechanism to obtain effective features. Some U-Net-based networks adopt a three-path structure to improve denoising performance. DHDN [4] replaced the convolution block in the original U-Net [26] with dense blocks and obtained better denoising results. MCU-Net [14] added an extra branch of atrous spatial pyramid pooling (ASPP) based on residual dense blocks. Sub-paths of these models extract different resolution image features, which are fused at the end of the network for denoising. From a frequency domain perspective, some methods [13,27] have employed the multi-level wavelet transform in image denoising and achieved high performance. This paper utilizes the hierarchical structure to design a network, and the sub-path functions are clearly defined. Specifically, IIP extracts the image information, and NEP estimates the noise distribution.

Network Architecture
The proposed CDN is shown in Figure 1; it consists of three main modules, IIP, NEP, and IDM. Here, C denotes convolution layer, BN denotes batch normalization [28], PR denotes parametric rectified linear unit [29], and R denotes rectified linear unit [30]. Convolution layers in CDN are set kernel size (3 × 3), stride 1, and padding 1. During training, an input noisy image x is divided into four equal patches, x 1 , x 2 , x 3 , and x 4 . This splitting operation aims to utilize the similarity of patches to train NEP. Here, we empirically divide the image into four patches for the following reasons: (1) fewer patches do not take advantage of similarity; and (2) more patches cause fewer noise samples per patch; thus, they can not sufficiently estimate the noise distribution. The training method of IIP is image-to-image. Without loss of generality, we choose the first patch x 1 and use y 1 to denote the ground-truth clean image of x 1 . Finally, the denoised x 1 is output by CDN, noted asx 1 .
When testing, the input of CDN is a complete noisy image, and the output is the denoised image.   IIP consists of one convolution layer and seven DBlocks, as shown in Figure 1. It is proposed to extract the image information. During training, the input of IIP is x 1 . IIP extracts the image features of x 1 , and then the denoised image x 1 c and noise estimation x 1 n are obtained based on image features. It is expressed as: where Conv is the convolution layer changing the number of feature channels. x 1 c is used to constrain IIP to extract the image information via the image-to-image training method.
Therefore, y 1 is the optimization target of x 1 c , where SSIM is chosen as the loss function. For x 1 n , it is further processed in IDM. The SSIM loss is as follows: where µ, σ, and σ x 1c y 1 denote the mean, standard deviation, and covariance, respectively. C1 and C2 are the image-dependent constants, which provide stabilization against small denominators. For testing, IIP receives a complete image and outputs its noise based on the image information.

Noise Estimation Path (NEP)
The architecture of NEP is similar to that of IIP. During training, x 1 , x 2 , x 3 , and x 4 are input into NEP, respectively, and their estimated noise, n 1 , n 2 , n 3 , and n 4 , are outputted. We suggest that n 1 , n 2 , n 3 , and n 4 should have a similar distribution when noise in the input image is specific. Therefore, Kullback-Leibler divergence (KLD) is used to evaluate the distance of these noise distributions: where P and Q are probability distributions. The sum of these distances forms the loss function: By minimizing L KLD , this similarity-based training method solves the limitation of residual learning.
For testing, NEP receives a complete noisy image and outputs its noise distribution.

Integration Denoising Module (IDM)
IDM is proposed to integrate the outputs from IIP and NEP. It is the U-Net-based network shown in Figure 2. DBlock in IDM is the basic feature extraction block, and PixelShuffle is the upsampling method based on the efficient sub-pixel convolution [31]. IDM outputs the final estimated noise and then obtains the denoised imagex 1 via residual subtraction. L1 loss is used as the loss function ofx 1 and ground-truth clean image y 1 :

Training Loss
The loss function used in this paper consists of three equally important components. Firstly, the image information loss L SSI M is utilized to train IIP to extract image information. Secondly, the noise estimation loss L KLD is employed to train NEP to estimate image noise accurately. Finally, the overall residual loss L1 is used to ensure that the whole network outputs the corresponding clean image of an input noisy image. By combining these three components, CDN can be effectively trained to remove image noise. The training loss can be formulated as follows:  [32] is commonly used in image processing; it contains 800 images for training, 100 for validation, and 100 for testing. We used the training set of DIV2K to train CDN.
For testing, we used gray-scale image datasets Set12 [5] and BSD68 [33] and color-scale image datasets Set5 [34] and Kodak24 [35]. Figure 3 shows images of Set12, which contains C.man, House, Peppers, etc. Images in these datasets are all clean and their corresponding synthetic noisy images are generated by adding AWGN. We refer to the AWGN generation algorithm from [5], in which the noise level is determined by the standard deviation σ. Three noise levels, σ = 15, σ = 25, and σ = 50, were chosen to train and test the CDN.

Real-World Noise Datasets
Real-world noise images are directly obtained in the natural environment. Here, we used the training set of the Smartphone Image Denoising DATA (SIDD) sRGB track [36] to train CDN. It contains 160 scene instances captured by five smartphone cameras under different lighting conditions and camera settings. There are two pairs of high-resolution images for each scene instance, and each pair contains one noisy image and its corresponding clean image. In total, 320 pairs of images were used for training. For testing, we used the the SIDD validation set and the Darmstadt Noise Data set (DND) [37]. DND does not provide any training data. It has 50 pairs of images captured by four different consumer cameras for testing. We obtained the PSNR and SSIM results by submitting the denoising images to the official DND website.

Training Setting
CDN is implemented by Pytorch 1.5.1 based on Python 3.5 and Cuda 9.2. Experiments were run on NVIDIA Tesla P100 GPUs. We used the Adam [38] algorithm with an initial learning rate of 0.0002 and a weight decay of 0.0001 to minimize the loss function. The learning rate will decrease with the increment in training epochs. During training, the mini-batch size was set to 64. Data augmentations were adopted to network training, which randomly splits the images into 128 × 128 patches and flips them horizontally and vertically.

Experimental Results
We evaluated the denoising performance of CDN on synthetic and real-world datasets and compared it with some popular methods.

Evaluation Metrics
Peak signal-to-noise ratio (PSNR) was used as the evaluation metric; it is one of the most common indicators for image processing methods. It measures the level of distortion or error between the original image and the reconstructed image by comparing their pixel values. PSNR is calculated based on the mean squared error (MSE) between the two images. The formula for calculating MSE is as follows: where N is the total number of pixels in the image and I and R represent the pixel value of the original image and the reconstructed image, respectively. PSNR is then computed as the ratio of the maximum possible pixel value (usually 255 for 8-bit images) to the square root of the MSE: PSNR = 20 · log 10 (MAX) − 10 · log 10 (MSE) (9) where MAX is the maximum possible pixel value of the image. A higher PSNR value indicates a lower level of distortion and better image quality. SSIM is also a widely used method to measure similarity between two images. It is designed to evaluate the perceived quality of images by taking into account their structural information. SSIM compares local patterns of pixel intensities in the reference and distorted images and computes a similarity score ranging from 0 to 1. The formulation of SSIM is described in Section 3.1.1. Similar to PSNR, a higher SSIM value indicates a lower level of distortion and higher image quality.
Consistent with most previous studies, we present the PSNR results for synthetic image denoising and both PSNR and SSIM results for real-world image denoising.

AWGN Denoising
Gray-scale image: We first report the training PSNR curve in Figure 4, demonstrating that CDN was well trained and obtains good denoising results on new data from Set12. Tables 1 and 2 show the synthetic gray-scale noisy image denoising results of different methods on Set12 and BSD68, respectively. CDN outperforms other methods at all noise levels on Set12 and σ = 15 and σ = 25 on BSD68. Additionally, CDN is a hierarchical network, and compared with other hierarchical networks-U-Net [26], DIDN [39], BRDNet [12] and NHNet [17]-CDN has superior performance. CDN exhibits the most substantial improvement in denoising the Barbara image in Set12, with improvements of 0.34 at σ = 15, 0.53 at σ = 25, and 0.9 at σ = 50 over the second-best method. As illustrated in Figure 3, Barbara is a highly textured image, which indicates that CDN performs well in preserving image details. Moreover, the superior PSNR results of CDN on other texture-rich images, such as Monarch, Man, and Couple, further support this point. As shown in Figure 5, CDN exhibits the best visual effect in denoising the image Monarch. Overall, these results suggest that CDN is a powerful and effective method for image denoising with the ability to preserve image details and texture.

Color-scale images:
We also tested CDN on color-scale noisy images. As shown in Table 3, CDN achieves the highest PSNR results on the Kodak24 dataset. The visual comparisons of CDN with other methods on Set5 and Kodak24 are shown in Figures 6 and 7, respectively. The results indicate that CDN outperforms the other methods and can recover cleaner images.

Real-World Image Denoising
While the AWGN task can provide some insight into the effectiveness of a denoising method, its limitation is clear. Real-world noise is more complicated and unpredictable, so evaluating denoising methods on real-world noisy images is more meaningful. To assess CDN's performance in real-world image denoising, we used the SIDD validation set and DND. Table 4 lists denoising results of different methods, where CDN achieves the best PSNR and SSIM results on the SIDD validation set and competitive performance on DND. Figure 8 shows some denoised images of CDN on the SIDD dataset, indicating that CDN successfully removes noise. These results demonstrate that CDN is also useful in denoising real-world images and has practical application value.

Ablation Experiments
The effectiveness of CDN in denoising relies on two key components: IIP, which extracts image information through image-to-image training, and NEP, which estimates noise through image self-similarity training. In this section, we study the contributions of these components in detail and demonstrate their effectiveness in improving denoising results.

Role of IIP
IIP in CDN is crucial for extracting image information, and it is necessary to determine whether this information improves the denoising performance. We first conducted an ablation experiment by removing it from a pretrained CDN model, named CDN-IIP. This was achieved by replacing the output of IIP with a zero matrix of the same size. Figure 9 shows the denoising results of CDN-IIP. Compared to CDN, the denoised images produced by CDN-IIP are significantly blurred and lack details, demonstrating the importance of image information in restoring image details. Then, we further evaluated IIP by removing IIP and retraining CDN, denoted CDN-IIP(R) in Table 5. The results show that removing IIP leads to a decrease in both PSNR and SSIM, further confirming the effectiveness of IIP. IIP uses SSIM loss as the loss function because it provides a comprehensive measure of image similarity based on brightness, contrast, and structure. We compared the denoising results using different loss functions, including L1 loss and mean square error (MSE) loss, and found that the SSIM loss provides better denoising performance, as shown in Table 6.  Figure 9 denotes CDN cutting off NEP. It can be seen that although the denoised image of CDN-NEP contains sufficient image information, the noise is obviously not removed well. Therefore, the noise distribution estimated from NEP is essential for removing noise. Similar to the study of IIP, we also report the PSNR and SSIM results of retained CDN-NEP. The results of CDN-NEP(R) on Set12 are listed in Table 5, and they are are significantly lower than that of CDN. This demonstrates that using the estimated noise distribution can improve denoising performance.

Role of Training Methods
CDN solves the limitations of traditional residual learning by using the image-to-image training method to train IIP and the similarity-based training method to train NEP. Here, we study the effect of these training methods on the denoising performance. CDN-SSIM in Table 5 denotes that CDN is trained without the image-to-image training method, which is implemented by removing the SSIM loss during training. Similarly, CDN-KLD denotes that CDN is trained without the similarity-based training method. CDN-SSIM-KLD denotes CDN is trained without either training method, which can be considered ordinary residual learning. The results in Table 5 show CDN that performs comprehensively better than the other methods. In particular, CDN significantly outperforms CDN-SSIM-KLD, indicating that considering image information and self-similarity improves residual learning.
The input image is divided into patches during training. The number of patches also affects the denoising performance. Table 7 lists the PSNR denoising results of different numbers of patches, which shows that too many patches leads to noise performance degradation. The reason is that many patches cause small patch sizes and thus an individual patch can not contain enough image information. In order to concisely describe the proposed model, four patches were selected in this paper.

Discussion
Deep-learning-based image denoising methods are increasingly popular among researchers due to their ease of implementation and fast processing speed. While most research focuses on improving network architecture, potential limitations in the commonly used residual learning method are often neglected. This paper points out two limitations of the residual learning and proposes a novel denoising CDN to solve them. We conducted comparison experiments between the proposed methods (CDN) and original residual learning (CDN-SSIM-KLD), which demonstrated that our solution significantly improves denoising performance.
IIP and NEP in CDN with their training methods aim to solve the limitations of residual learning that do not consider image information and image self-similarity, respectively. Specifically, IIP is trained to extract image information using an image-to-image approach, while NEP estimates image noise by leveraging image self-similarity. To explicate their functions, we conducted corresponding experiments. Firstly, we observed that removing either IIP or NEP could result in an increase in PSNR and SSIM, which demonstrated their significance in image denoising. Secondly, we visualized the denoised images of CDN-IIP and CDN-NEP. Results revealed that CDN without IIP was able to successfully remove noise but failed to preserve fine image details. The inverse situation can be found in CDN without NEP. These results illustrate that IIP and NEP achieve the expected function and improve the residual learning.
However, there are still some limitations in our work. The same architecture is used for both IIP and NEP, and exploring different architectures could potentially improve the performance of the network. For example, vision-transformer-based image denoising networks have achieved state-of-the-art performance [45,46]. Incorporating a vision transformer as the backbone of IIP or NEP may enhance the denoising performance of CDN. In addition, although SSIM loss shows better results than L1 and MSE for the image information loss, there might be other loss functions that could be more effective. These concerns will be explored in future studies.

Conclusions
This paper introduces a novel denoising network, CDN, which aims to overcome the two limitations of residual learning in image denoising. Firstly, residual learning fails to fully consider image information. This is tackled by training IIP in CDN to extract image information through the proposed image-to-image method. Secondly, residual learning does not take into account image self-similarity. To solve this issue, we propose a similaritybased training method to train NEP in CDN to estimate image noise. Consequently, CDN can successfully remove image noise using the extracted image information from IIP and noise estimation from NEP. Experimental results show that CDN achieves superior performance in both synthetic and real-world image denoising. Besides the high performance, we also discuss the potential limitations in our work. While previous studies primarily focused on improving network architecture, we present a novel perspective that improves the training method for enhanced denoising results. It may encourage further exploration into the effectiveness of training methods in image denoising research. Data Availability Statement: Datasets used in this paper are open access and are available from: DIV2K is openly available in "NTIRE 2017 challenge on single image super-resolution: Dataset and study", reference number [32]; Set12 is openly available in "Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising", reference number [5]; BSD68 is openly available in "Fields of Experts: a framework for learning image priors", reference number [33]; Set5 is openly available in "Accurate Image Super-Resolution Using Very Deep Convolutional Networks", reference number [34]; Kodak24 is openly available in "Kodak lossless true color image suite: PhotoCD PCD0992" at url: http://r0k.us/graphics/kodak.182 (accessed on 14 June 2023), reference number [35]; SIDD is openly available in "A High-Quality Denoising Dataset for Smartphone Cameras", reference number [36]; DND is openly available in "Benchmarking Denoising Algorithms with Real Photographs", reference number [37]. A preprint has previously been published in arXiv by Zhang et al. [47]. Our code will be released on https://github.com/JiaHongZ/CDN (accessed on 14 June 2023).

Conflicts of Interest:
The authors declare no conflict of interest.