Dense residual Transformer for image denoising

Image denoising is an important low-level computer vision task, which aims to reconstruct a noise-free and high-quality image from a noisy image. With the development of deep learning, convolutional neural network (CNN) has been gradually applied and achieved great success in image denoising, image compression, image enhancement, etc. Recently, Transformer has been a hot technique, which is widely used to tackle computer vision tasks. However, few Transformer-based methods have been proposed for low-level vision tasks. In this paper, we proposed an image denoising network structure based on Transformer, which is named DenSformer. DenSformer consists of three modules, including a preprocessing module, a local-global feature extraction module, and a reconstruction module. Specifically, the local-global feature extraction module consists of several Sformer groups, each of which has several ETransformer layers and a convolution layer, together with a residual connection. These Sformer groups are densely skip-connected to fuse the feature of different layers, and they jointly capture the local and global information from the given noisy images. We conduct our model on comprehensive experiments. Experimental results prove that our DenSformer achieves improvement compared to some state-of-the-art methods, both for the synthetic noise data and real noise data, in the objective and subjective evaluations.


Introduction
The acquisition of images and videos are basically dependent on some digital devices, however, the collection process is often interfered by various degradation factors. Lots of noise are blended into the images/videos, resulting in the lossy of image quality. Therefore, image denoising which aims to recover the clean image with high quality. Obviously, this problem is ill-posed, because the degradation process is unknown and irreversible. To tackle this problem, some early denoising methods utilize some specific filters, such as mean filters, nonlocal means(NLM) [1] and 3D transform-domain filtering (BM3D) [2], to eliminate the possible noise in the given noisy images. In recent years, some convolutional neural networks (CNN) have been also employed for image denoising. To explore how to further improve the denoising performance is still a hot challenge, especially for the real scenario.
With the development of deep learning, many state-of-the-art CNN models have achieved good performance in many computer vision tasks. Due to several revolutionary works, CNN-based networks have become the main benchmark method for image denoising. Some classical architectures such as residual skipconnection [3] and dense skip-connection [4] are utilized to improve the feature representation ability for image denoising. Nevertheless, CNN-based networks showed that the fitting ability of the noise distribution could not be improved following with the increasing of CNN layers. Some technical issues like gradient vanishing also plagued researchers. Moreover, for real image denoising, the performance of removing complex noise is also limited by only utilizing residual learning strategy.
To tackle the above problems, on one hand, many works have attempted to employ the attention mechanism to enhance the representation of the local features, including channel attention and spatial attention. On the other hand, the global information are also considered to be an important addition to recover the clean images. So the non-local operation is widely utilized in some latest network architectures. Specifically, Transformer [5] has been proved to be an useful tool for capturing the global information, by utilizing the long-range dependencies of pixels, which is originally used in the natural language processing task. Alexey et al. proposes ViT [6] network, successfully applies Transformer in computer vision tasks. Then, various of exciting works [7,8,9] make a lot of efforts to design Transformer-based architectures for the specific tasks. However, some technique issues still exist and limit the application of Transformer.
For example, border pixels of images can be hardly utilized in Transformer, because the adjacent pixels are out of the range. Transformer cannot capture local information well enough. Therefore, how to apply Transformer, especially for the low-level computer vision tasks, is still an challenge.
In this paper, we propose an end-to-end Transformer-based model for image denoising, named Dense residual Transformer (DenSformer). DenSformer is composed of a preprocessing module, a local-global feature extraction module, and a reconstruction module. Specifically, the preprocessing module is exploited to extract shallow features from input images. The local-global feature extraction module consists of several Sformer groups, and each Sformer group includes several ETransformer layers and one convolutional layer. Additionally, the residual skip-connection is utilized to assemble these layers. Finally, the reconstruction module is utilized to restore the clean image. We quantitatively compare our DenSformer with other existing denoising methods. Experimental results on different test sets verify the effectiveness of our model in objective and subjective evaluation. Overall, our contributions of this paper can be summarized as: • We propose an end-to-end Transformer-based network for image denoising, where both Transformer and convolutional layers are utilized to implement the fusion between the local and global features.
• We design a residual in residual architecture to assemble multiple Transformers and convolutional layers, to achieve better performance in a deeper network.
• We introduce a depth-wise convolutional layer into Transformer, which is used to preserve the local information in the forward process of Transformer.

CNN-based Image Denoising
With the development of deep learning, many researchers attempt to design novel denoising models based on convolution neural network (CNN) and most of them have achieved impressive improvement on the performance. Early CNNbased models are trained in the synthetic image data with AWGN. Harold et. al propose to apply the original feedforward neural network model for the denoising task, which can achieve comparable results with BM3D [2]. Chen et. al [11] propose a trainable nonlinear reaction diffusion (TNRD) model to remove noise of different levels. Zhang et. al [12] design a named DnCNN Network, which demonstrates the availability of residual learning for image denoising, and Lin et.al [38] propose an Adaptive and Overlapped Average Filter(AOAF) to get better distribution of noise. Later on, more works are focused on the design of CNN-based network architectures [39], including the improvement of receptive field and the balance of performance and model size. These models can learn the noise distribution well from the training data, but not well suitable for the real noise removal. Then, Zhang et. al [14] further generate a noise map by making a relation between the noise level and the noisy image, and demonstrates a spatialinvariant denoising algorithm for real image denoising. MIRNet [15] presents a network architecture which is able to handle multiple image restoration tasks including image denoising, super-resolution and enhancement, with many new building blocks that could extract and utilize different feature information at multi-scale. RDN [4] combines DenseNet blocks and global residual learning to implement the fusion of local and global features. Although these models make lots of efforts to increase the performance of real denoising, it still needs lots of efforts to balance the trade-off between model size and performance.

Vision Transformer
Recently, Transformer [5] has been popular in the computer vision field. The emergence of ViT [6] marks the beginning of applying Transformer in the computer vision field. Nevertheless, there are still some weaknesses, such as the requirement for large-scale datasets and the high time complexity. Driven by ViT, various of Transformer-based models are proposed to solve the problems related to computer vision tasks, such as image classification [9], image/video segmentation [21] and object detection [8] etc.. Some latest works have tried to apply Transformer for image restoration.Wang et al. propose a Detail-Preserving Transformer [44] for light field image super-resolution, which can effectively restore detail information of light field images by using Transformer. Chen et al. propose a backbone model named IPT [23] which is based on the standard Transformer to restore multiple image degradation tasks. However, IPT is a pre-training model, which means it need to be firstly trained on a large-scale dataset, and the performance of image restoration is also limited. Wang et al.
further propose a U-shaped Transformer network [24], and it achieves good performance on image restoration. However, how to apply Transformer to solve the problem of the low-level computer vision tasks is still a hot issue.

Proposed Method
In this section, we first describe our designed network architecture. Then, some details of the main components in DenSformer are provided. Finally, we discuss the effects of the residual skip-connection strategies on the fusion between the local and global features.

Network Architecture
As shown in Fig. 1  weight of the image , and C in is channel number of the given image). Then, the shallow features F 0 ∈ R H×W ×C can be obtained, where C is the channel number of the shallow feature. It can be represented as, where H pre denotes the preprocessing module.
Next, a local-global feature extraction module is designed to extract the local and the global features from F 0 , where the features are learned by multiple Transformers (named Sformer Block) in a residual-in-residual. These Transformers are assembled by some dense skip-connections. Also, a long skip connections are exploited. So we can further have H T B (·) denotes our proposed dense Transformer-based structure, which contians M Sformer blocks. We treat the output F D as deep features, which is then converted to the high quality image I out by concatenating with F 0 with a long skip-connection where H res denotes the reconstruction module, which is made up by a 3×3 convolutional layer. With dense residual skip-connection, our network can transmit the shallow layer features directly to the reconstruction module. Besides, we use residual learning to reconstruct the residual between the noisy image and the corresponding clean image. H DenS is the mathematical representation of our DenSformer.
To improve the effectiveness of our DenSformer in the image denoising task, we choose L 1 loss to optimize the whole network. The training set {I noisy , I clean } N contains N pairs of noisy images and the corresponding clean images. The training goal of DenSformer is to minimize the L 1 loss function [46], where Θ denotes the parameters set of our network. The loss function is optimized by using stochastic gradient descent. More details of our network can be found as below.

Sfomer Block
We now give more details about our proposed Sformer block. As shown in ETransformer is designed depending on LeWin Transformer [24]. To enhance the learning ability of local and global information in Transformer, the 2D input token is firstly reshaped to 3D along the spatial dimension and then processed by a depth-separable convolution. Instead of generating by direct linear layer operation, some local information can be preserved by the convolutional operation. Thereby, the Q, K and V are changed as: where H DS (·) denotes the depth-wise convolution layer, LN (·) denotes the layer normalization. It is noted that, the convolutional operation can also decrease the memory-consuming in the process of calculating self-attention, where the spatial resolution of input features can be reduced. Otherwise, we perform the self-attention within non-overlapping local windows, which can further reduce the computational cost significantly.  The computational cost is highly optimized from quadratic complexity to linear complexity corresponding to the image size.
As shown in Fig. 2(c). However, adjacent pixels are crucial references for image restoration. To further enhance the local information, we utilize LeFF [24] in substitution for MLP in the original Transformer. As shown in Fig. 2(c), a linear layer is firstly used to increase the feature dimension, so the output tokens can be reshaped to 2D feature maps. In the following, a 3 × 3 depth-wise convolution [29] is cascaded to capture local information, then we flatten the features maps to tokens again, and another linear layer is used to adjust the channels for matching with the input of next ETransformer in the dimension. In this way, the depth-wise convolution in LeFF will help enhance the local information extraction with avoiding heavy computation.
Finally, the computation of our ETransformer layer is represented as: where X l and X l are the outputs of W-MSA module and LeF F module, respectively. lN represents the layer normalization. On one hand, the convolutional operation in ETransformer layers can decrease the memory-consuming in the process of calculating self-attention. On the other hand, local information will be extracted and then fused with the learned global information in ETransofmer layers. Overall, the Sformer block can be formulated as follows: where F i denotes the input feature, F Di denotes the output of i-th Sformer block, H T C (·) denotes the convolution layer in Sformer block, H ET (·) denotes ETransformer layers, and H S (·) denotes the whole Sformer block. With Sformer blocks, our network can extract deep features from the input and help reconstruct high-frequency features.

Dense Residual Skip-Connection
With the emergency of ResNet [3] and DenseNet [26], skip-connection in neural network has been proved to make the network more robust and stable training. Besides, skip-connection is also able to facilitate the fusion of feature information between different layers, especially when crossing multiple layers.
Furthermore, applying dense residual-connection between multiple layers can make a better promotion on the performance. Generally speaking, shallow features mainly contain low-frequency information, while deep features focus on recovering the high-frequency information. With a number of long or short skip-connections, the low and high frequency information can be aggregated. In DenSformer, some convolutional operation and multi-head self-attention mechanism are exploited to capture the local and global features, respectively. Therefore, to better implement the fusion of the local-global information, we present a dense residual skip-connection to better bridge the local and global features.
Combining schemes of variants skip-connection, we design a skip-connection structure named Dense residual skip-connection. To fuse the shallow features and deep features, we add skip-connection between every two Sformer groups, and a long skip-connection between the shallow features to the deep features.
Besides, all Sformer groups are densely skip-connected to enhance the feature representation. As shown in Fig. 1, each Sformer group is connected to all the following groups. In this way, shallow features and deep features can be densely fused, where low and high frequency information can be well reconstructed.

Experiment
In this section, we demonstrate the quantitive effectiveness of our method on both synthetic datasets and real noisy datasets, moreover, some visual results of denoised images are also provided to evaluate the subjective performance of our models.

Experimental settings
Training Data. We use DIV2K and Flickr2K as our training dataset.
DIV2K dataset is a high-quality dataset which consists of 800 training images, 100 validation images, and 100 test images. Flickr2K dataset contains 2, 650 images with high resolution. In the synthetic denoising experiment, different levels of AWGN [37] are added to the clean image for generating multiple degraded images. And for real image denoising, we adopt the SIDD Medium dataset [29] for training. SIDD utilized five diverse mobile phones to take 30, 000 noisy images in different scenes. For the clean image, SIDD removes the wrong pixels in each image, then aligns the image, and finally generates a "noiseless" real image.

Synthetic Noisy Images
To achieve a fair comparison, same noise, AWGN, with different noise levels σ = 30, 50, 70, is added into the high quality images to get noisy images. The final results of synthetic noisy images are listed in Table 1. For synthetic noise removal of gray-scale images, our method outperforms other image denoising methods and achieves the best performance in Set12 and BSD68 test sets.
In case of color image, our network obtains 0.39dB improvements in Kodak with σ = 70. The result proves the effectiveness of our proposed DenSformer on denoising the synthetic noise. It is noted that when the noise level is σ = 70, our network performs not well enough and worse than RDN. We analyse that it might be the dense skip-connection is not suitable for high noise level situation.
It is the point that we will fix in the future work.
We also show visual denosing results of different methods. Specificly, the building textures and the Chicken feathers are presented in Fig. 3 and Fig. 4 respectively, which are hard to be separated with heavy noise. In the subjective results, we can find that other denoising methods tend to remove the edge details and the noise together, which make the results too smooth. BM3D [2] preserves image structure to some degree but fails to remove noise deeply, DnCNN [12] and IRCNN [31] would make edges of images over-smoothed and FFDNet [14] is slightly better than the former two methods. RDN [4] restores more clean images, but there are still some textures and tiny details not restored well during the denoising process. The main reason is that these methods are limited to extract high frequency features. On the contrary, our network is able to reconstruct local detail high frequency features and clean smooth areas, because Enhanced Transformer in our method can better capture global information of the image, and dense residual skip-connection scheme can fuse the shallow and deep features to help reconstruction.

Real Noisy Images
To evaluate the denoising performance of our method on real noisy images, we conduct a series of experiments. We adopt SIDD datasets as our train dataset, which contains 1280 noisy-clean image pairs whose resolutions are 256 × 256. SIDD validation dataset and DnD dataset are adopted as test dataset.
In comparative experiments, the quantitative results on testset and parameter comparison are shown in

Ablation Study
For ablation study, we train our DenSformer on DIV2K and Flickr2K, and test it on Kodak24. We firstly compared three variants of skip-connection in our network, including dense residual skip-connection, local residual skipconnection, and cross residual skip-connection. specifically, dense residual skipconnection is what we employed in DenSformer. Local residual skip-connection means there are only residual skip-connection between every two Sformer groups, and a long skip-connection between the input and output. Cross residual skipconnection is the scheme of dense residual skip-connection but without local residual skip-connection.
To evaluate the impact of ETransformer layers in Sformer, we conduct another ablation experiment about the number of ETransformer layers. The result is summarized in Fig. 7. It can be observed that the PSNR result is positively correlated with the number of ETransformer layer in Sformer block. Nevertheless, the performance gain becomes saturated gradually with layers increasing.
Meanwhile, the number of parameters will also increase and make our model huge. To balance the tradeoff between parameters and performance, we choose 4 as ETransformer layer numbers.
We also show the effects of different schemes of skip-connection in Table  Figure 7: Impact of ETransformer layers in Sformer. Results are tested on Kodak with σ = 50.
3. It can be observed that the result of dense residual skip-connection is the best, which outperforms 0.04 and 0.05 dB in PSNR than the other two schemes respectively.Meanwhile, there is also a slight increase in training time, which means our dense residaul skip-connection scheme will get high PSNR performance only with small time cost. Effects of different layers and blocks in ETransformer are further showed in which means that the convolution layer in the generation of Q, K and V, and LeFF layer in feed forward network can help Transformer better extract and utilize information, which can perform well in different denoising tasks.

Conclusion
In this paper, we propose a dense residual skip-connection network based on