A Multi-Scale Feature Extraction-Based Normalized Attention Neural Network for Image Denoising

: Due to the rapid development of deep learning and artiﬁcial intelligence techniques, denoising via neural networks has drawn great attention due to their ﬂexibility and excellent performances. However, for most convolutional network denoising methods, the convolution kernel is only one layer deep, and features of distinct scales are neglected. Moreover, in the convolution operation, all channels are treated equally; the relationships of channels are not considered. In this paper, we propose a multi-scale feature extraction-based normalized attention neural network (MFENANN) for image denoising. In MFENANN, we deﬁne a multi-scale feature extraction block to extract and combine features at distinct scales of the noisy image. In addition, we propose a normalized attention network (NAN) to learn the relationships between channels, which smooths the optimization landscape and speeds up the convergence process for training an attention model. Moreover, we introduce the NAN to convolutional network denoising, in which each channel gets gain; channels can play different roles in the subsequent convolution. To testify the effectiveness of the proposed MFENANN, we used both grayscale and color image sets whose noise levels ranged from 0 to 75 to do the experiments. The experimental results show that compared with some state-of-the-art denoising methods, the restored images of MFENANN have larger peak signal-to-noise ratios (PSNR) and structural similarity index measure (SSIM) values and get better overall appearance.


Introduction
Image denoising is a fundamental and classic topic of image processing tasks. Due to the varying environment and sensor noise, the captured image usually contains noise and the transmission and storage process may also cause the image to be degraded by noise [1]. Therefore, image denoising is an important and indispensable part of many high-level vision tasks [2][3][4]. Additive White Gaussian noise (AWGN) is the most representative noise among all kind of noises and we make a common assumption that the images are degraded by AWGN. The model of an image which is degraded by AWGN can be described as: y = x + v, where y is the observed degraded image, x is the noiseless clean image, v is the AWGN with zero mean and standard deviation is σ. The image denoising problem is to restore the noiseless clean image x from the observed image y.
Recently, a large number of methods have been proposed for image denoising [5][6][7][8][9]. A direct way to restore the image is to estimate the noise v, and the noiseless clean image is acquired by y − v. However, for a long period, accurately estimating the noise was once a difficult and almost impossible mission. Before the convolutional neural networks become popular, in [10], a deep convolutional neural residue network was proposed to learn the noise, which got superior results to many typical denoising methods. The bilateral filter [11] is a kind of widely used denoising method for its adaptability and good performance, but the performance decreases rapidly at high noise levels. An improvement was the non-local means (NLM) denoising method [12], which is achieved under the assumption that natural scenes tend to repeat themselves in the same and different scales. NLM acquires better performance than the bilateral filter, but a difficulty for NLM is to tune the hyper-parameters which depend on the noise's standard deviations, so an improper choice of hyper-parameters would cause it to lose edges or leave noise. Ville uses Stein's unbiased risk estimate to monitor the mean square errors and avoid tuning the hyperparameters [13]. BM3D [14] reaches the peak of the improved NLM methods, and it is a benchmark for image denoising methods. Transform domain denoising is another kind of popular denoising method [15,16]. It transforms a noisy image to a transform domain and removes noise by tuning the coefficients. Fourier transform denoising (FTD) is a typical transform domain denoising method [17]. It transforms a noisy image to the frequency domain and removes frequencies connected with noise, and recovers the image by inverse Fourier transform. FTD faces a difficulty in determining whether the high frequency information is noise or features. Wavelet domain denoising is a development of the Fourier transform, which maps an image to the wavelet domain; the wavelet coefficients of higher amplitude are information; noise is removed by clipping smaller amplitude coefficients [18,19]. Rajwade in [20] used singular value decomposition for image denoising; noises are considered to relate to smaller singular values, and the noises are removed by dropping smaller singular values. Sparse and redundant representation is another popular transform domain denoising method, which trains a redundant dictionary from the noisy image, and acquires a restored image by optimizing an object function with sparse coefficients priors [9]. Protter [21] generalized the sparse and representation methods to image sequence denoising. Later, sparse and representation methods were combined with non-local means to get better performance [22]. There are also many other denoising methods, such as total variation [23,24] and statistical neighborhood approaches [25]. Most of the above methods are model-based methods that rely on prior knowledge, and they are realized by optimization methods. Three drawbacks for these methods are trying to balance the noise removal and detail-preservation, choosing the prior knowledge and searching for the optimal solution.
An alternative way is the discriminate learning methods, which learn the mapping that maps noisy images to corresponding noiseless clean ones. Burger in [26] proposed a plain neural network for image denoising, which acquires comparable performance to BM3D. ZhangK in [10] proposed a fully convolutional network for image denoising. By learning the residue of an image, it can not only remove AWGN, but also work on other image processing tasks, such as image super-resolution and image deblocking. However, for the above discriminate learning methods, they demand training models for each noise level, which brings great inconvenience. To tackle this problem, Isogawa in [3] proposed a novel activation function with a varying threshold; the noisy images with different noise levels are restored by a unique network. Zhang et al. [27] used the down-sampled subimages to train the model and adopted the noisy image and noise level map as the input; noisy images under different noise levels can be handled by a single network. Lefkimmiatis in [28] integrates non-local self-similarity into a convolutional neural network (CNN) and gets results competitive with many state-of-the-art methods.
Although the CNN achieves excellent denoising effects, for most CNN based methods, the convolution kernels are of a size of one; the features on distinct scales are neglected. In addition, all the channels of features are treated equally; the relationships of channels are not considered. In this paper, we propose a multi-scale feature extraction-based normalized attention neural network for image denoising. In MFENANN, we define a feature extraction block which extracts and combines features at scales of 1 × 1, 3 × 3 and 5 × 5 of the noisy image. Moreover, we introduce 1D normalization techniques to NAN, which smooths the optimization landscape in training, and refines the relationship between channels. Furthermore, we introduce the NAN to MFENANN for denoising, in which every channel is augmented by assigning an amount of gain. In this paper, we take the down-sampled images to train the network, which enlarges the receptive field, and reduces the number of calculations in the training. A residual net is used to avoid losing shallow features. Moreover, the network learns the residue of the noisy image, and the noiseless clean image is obtained from the difference between the noisy image and the residue.
In general, the contributions of the paper are summarized as follows: (1) We define a feature extraction block to extract and combine different scale features of the noisy image, which causes the feature maps to contain more detailed information of the original image, and enhances the ability of the network to maintain details. (2) We propose a normalized attention network to learn the relationship between channels, which smooths the optimization landscape and speeds up the convergence process for training an attention model. (3) We introduce NAN to image denoising, in which each channel gets an amount of gain, and channels play different roles in the subsequent convolution, which improves the performance of image denoising.

Residual Network
The residual network (ResNet) [29] was proposed by He, in which the underlying mapping of stacked layers is addressed as H(X), and the output of the ResNet block is addressed as F(X) , where X is the input. The process to learn the output F(X) is called residual learning. The concrete model is defined as: where F (·, ·) is the residual mapping to be learned and F (W, X) = F(X). ResNet is helpful to avoid a vanishing gradient when the network is deep, and greatly improves recognition accuracy. Later, He in [30] demonstrated the theories of ResNet and improved ResNet. Soon ResNet drew great interest and many variants appeared [31,32]. Huang in [33] proposed DenseNet, in which each layer before another layer is connected to the layer as the input; it is widely used for its flexibility and good performance. Later, ResNet was widely used in computer vision tasks, such as image super-resolution [34] and pedestrian trajectory prediction [35].

Batch Normalization and SENet
With the rapid development of deep learning, many techniques have been proposed to raise efficiency and improve performance. Rectified linear unit (Relu) [36,37] is a widely used unsaturated activation function, which relieves the vanishing gradient problem and accelerates convergence speed. The convolution [38] greatly reduces the number of calculations for sharing weights. The dropout [39] decreases overfitting for networks. The inception [40] is used to extract different scales' features and concatenates them for subsequent convolution.
(1) Batch normalization (BN): BN [41] is proposed for increasing the accuracy of classification, which decreases the number of calculations and simplifies the process of parameter adjustment. Santurkar in [42] analyzed the principle of why batch normalization can improve performance. He pointed out that no evidence shows BN stands in relation to interval covariate shift and put forward the idea that BN improves the performance by smoothing the optimization landscape in the training. He also has testified that networks can achieve similar, even better performance by using other normalization techniques. Recently, BN has been widely used in networks for image denoising [3,10,27].
(2) SENet: SENet [43] is an attention network that learns the relationships between channels, in which every channel is augmented with an amount of gain. SENet squeezes every channel to be a point with the average value; after the forward propagation of the two-layer fully connected (FC) network, the outputs are gains, and each channel is augmented by the corresponding gain. Wang in [44] used 1D convolution instead of an FC layer, which improved computational efficiency. Li in [45] proposed a selective kernel network which chooses kernel size for each channel by learning.

Proposed MFENANN for Image Denoising
In this section, we detail the MFENANN proposed for image denoising. In MFE-NANN, we define a simple multi-scale feature extraction block which extracts different scale features from noisy image with convolution kernels of different sizes. Moreover, we propose a normalized attention network to learn the relationship between channels which improves SENet by adding 1D normalization techniques. In addition, we introduce the normalized attention network to CNN denoising, in which, each channel gets an amount of gain, and channels play different roles in subsequent convolution. We also define ResNet blocks for MFENANN, which can effectively integrate shallow and deep features and avoid vanishing gradient problem. In the training phase, we assume the size of batchsize is N, and randomly generate N values in the noise level range as the noise standard deviations. We expand each standard deviation into a tensor with the same length and width as the input image as the noise level map and add noise to the corresponding image in the batch. In the testing phase, if the noise level is known, we expand it into a tensor with the same length and width as the input image as the noise level map. Figure 1 shows the architecture of the proposed MFENANN. We down-sample the noisy image using an interlaced sampling way and concatenate the down-sampled subimages and the noise level map as the input for network. Suppose the size of the noisy image is W × H × C, where W is the width of image, H is the height of image and C is the number of channels. Therefore, the size of input for MFENANN is W 2 × H 2 × (4C + 1) for grayscale image, and W 2 × H 2 × (4C + 3) for color image. The multi-scale feature extraction block (MFEBlk) is used to extract and combine distinct scale features for the subsequent convolution. The Relu function has the form of max(0, ·). MFEBlk is detailed in Figure 2. ResNetBlock is a defined ResNet block, which is detailed in Figure 3. NAN is the normalized attention block which is detailed in Figure 4. In MFENANN, there are five ResNetBlocks and four NAN blocks, the number of layers is 21. The output of network is residue rather than the clean image. The clean image is obtained as follows:

Network Architecture
whereX is a restored image, Y is the observed noisy image, residue is the learned residue. Figure 2 shows the architecture of MFEBlk. X M is an input for MFEBlk, the channels number of X M is 5 for a grayscale image and 15 for a color image. In MFEBlk, we define 3 kinds of convolutions: 5 × 5 convolution, 3 × 3 convolution and 1 × 1 convolution with kernel numbers of 10, 76 and 10 respectively. Zero-padding is used to keep the same channel size. Y M is the concatenated feature maps of different scale convolutions. The mathematical model for the MFEBlk is defined as follows: where conv5 10 (·), conv3 76 (·) and conv1 10 (·) are convolutions with kernel size of 5 × 5, 3 × 3 and 1 × 1 respectively. The subscripts 10, 76 and 10 are numbers of convolution kernels. By experiments, we find the choice of quantity of kernels is a balance of the computational costs and performance. cat(·, ·, ·) is a function to concatenate channels of feature maps. Y M is the output which has 96 channels. Figure 3 shows the architecture of ResNetBlock. ResNetBlock has four "Conv+BN+Relu" blocks. Each "Conv+BN+Relu" block includes a 3 × 3 convolution layer, a BN layer and a Relu activation function. Z is the output of the fourth "Conv+BN+Relu" block which has 96 channels. The output Y R is the sum of the input X R and output Z:   Equation (4) can be reformed as Z = Y R − X R , the work to learn Z is the residual learning. Learning the residues reduces the computational cost in the training, and avoids vanishing gradient problem. The NAN Block is applied between two ResNetBlocks, each channel of the output for ResNetBlock can get an amount of gain. The architecture of NAN Block is detailed in the next section. Figure 4 shows the architecture of NAN block. Assuming the input X N has 96 channels, the squeeze operation adopts an average pooling function, which squeeze each channel to be a point. For each channel X l N , the mathematical expression of squeeze operation is described as follows:

NAN
where X l N (i, j) is the amplitude at position (i, j), x l N is the mean value of channel X l N . FC is the fully connected layer which has 96 neurons. 1DBN and Relu layers can avoid vanishing gradient problem and accelerate the convergence. Suppose the value of batchsize is k, thus each training process contains k samples. The 1DBN is described as follows: where x bn and y bn are input and output vectors of BN block, γ and β are variables which are updated during back propagation. On the last layer, the sigmoid function maps the input to range 0-1. We address the output of sigmoid function as s, and s is a vector of 96 dimensions which can be described as (s 1 , s 2 , · · · , s n−1 , s n ), n is 96. For each channel, the scale operation is described as follows: where Y m N and X m are the mth channels of Y N and X N , respectively. s m is the mth element of s. From Equation (8), we find every channel in X N is augmented by the corresponding element of s.

Role of MFEBlk
Inspired by inception [40], we propose a MFEBlk to extract distinct scale features using different size kernels. For image denoising problem, we hope the restored image holds most features, thus in the first layer, we take 5 × 5, 3 × 3 and 1 × 1 convolution kernels to take distinct scale features from the noisy image. Larger kernel size commonly brings about larger numbers of calculations; therefore, we use less 5 × 5 and more 3 × 3 convolution kernels to balance feature extraction and reduce the number of calculations. The 1 × 1 convolution kernels are used to increase nonlinearity and reduce the amount of parameter and calculation [46,47].

Dataset Generation and Experimental Settings
For the AWGN removal work, in order to train our proposed network, we take 4744 images in the Waterloo Exploration Database [48], and extract image patches of size of 44 × 44 point-wisely with stride of 30. Approximately, 577 × 1000 image patches are chosen for training the network. For every patch x i , we add an AWGN on it and address the noisy image as y i . The noises added have noise levels ranging from 0 to 75. We apply image sets "BSD68" [49] and "Set12" to testify the performance of the proposed network for grayscale image denoising, and "CBSD68" [49], "Kodak24" [50] and "McMaster" [51] to verify the effectiveness of the network on color image denoising. For MFENANN, The channel numbers of input are 5 and 15 for grayscale and color images respectively.
The experiments are performed on Pytorch 1.1 environment, on a PC with Ubuntu 16.04 operating system, Intel(R) Core(TM)i7-8700 CPU, 16GB RAM and a NVIDIA RTX 2070 GPU. We choose a loss function likes [10]: where Θ is parameters of the network need to be learned. L(·) is the loss function. Ψ is the noise level map. R(·; ·; ·) is the output residue. {x i , y i } are clean-noisy image patch pair used for training. The loss function makes residue close to the noise. We use PSNR [52] and SSIM [53] to measure the quality of the restored images. We adopt the Adam [54] optimized methods and adopt the default settings. We train 60 epochs for the MFENANN. The initial learning rate is 1 × 10 −3 , and it decays to 1 × 10 −5 and 1 × 10 −6 at 40 and 50 epochs. The value of batchsize is 128. The training process takes approximately 21 h.

Comparison Methods
To measure the performance, we compare our algorithm with some state-of-the-art denoising methods, including conventional methods (i.e., BM3D [14] and WNNM [55]), sparse and redundant representation method (i.e., SRR [9]) and discriminative learning methods (i.e., MLP [26], DnCNN [10], FFDNet [27], and BDMGIN [56]). For the SRR denoising methods, there are three ways (i.e., discrete cosine transform, global training and adaptive training) to build a dictionary, and the SRR denoising methods corresponding to these three ways are addressed as SRR-DCT, SRR-G and SRR-A respectively. For DnCNN, two ways are used to train the networks to remove noise with known and unknown noise levels and they are addressed as DnCNN-S and DnCNN-B respectively. DnCNN-S needs to train a network for each noise level; DnCNN-B trains a single network to remove noise with all noise levels. BDMGIN is designed to remove mixed Gaussian-impulse noise. In this section, we set the impulse noise density to be 0 to remove Gaussian noise.

Ablation Experiment
In order to verify the role of MFEBlk and NAN in MFENANN, we train the networks after removing MFEBlk and NAN respectively. We address the network without MFEBlk as NANN and address the network without NAN as MFEN. We also trained a plain convolutional network with the same number of channels and layers as MFENANN for image denoising and address it as plainNet.

ResNet vs. DenseNet
Densely connected convolutional networks (DenseNet) are effective to improve the performance of object recognition [33]. In this section, we use the DenseNet in image denoising and increase the number of convolution layers to 31. The settings of DenseNet are the same as [33]. We address the network used DenseNet blocks instead of ResNet blocks in the proposed network as DenseMFENANN. From Table 2, we find DenseMFENANN has higher PSNR values for images "Peppers" and "Starfish" at a noise level of 25 and for image "Peppers" at a noise level of 75. MFENANN has higher PSNR values in other situations and it has higher average PSNR values at all noise levels. This means the ResNet gets better performance than DenseNet used in the proposed network for image denoising. Therefore, in the paper, we use ResNet instead of DenseNet for image denoising.

Experimental Results and Analysis
We address the networks trained with patch numbers of 577 × 10 3 and 1 × 10 6 as MFENANN-5 and MFENANN-10 respectively. Table 3  increasing training sample numbers does not bring a performance improvement, but it costs more than twice the time. Therefore, in the following experiment, we used patches of number of 577 × 10 3 to train the networks. Table 4 shows the PSNR values of several state-of-the-art methods at noise levels of 15, 25, 35, 50 and 75. When noise level was 15, for image "Barbara", WNNM achieved the largest PSNR value, followed by BM3D. That is because "Barbara" contains a lot of stripe textures, but the MSE loss function tends to represent smooth and outstanding structural information. MFENANN has the largest PSNR values for other images and achieved the largest average PSNR value among all methods at this noise level. When noise levels were 25, 35, 50 and 75, the PSNR values showed the same law as the noise level of 15; for image "Barbara", WNNM got the largest PSNR values, followed by BM3D; MFENANN got the largest PSNR values for other images and achieved the largest average PSNR values. For each method except BDMGIN, the PSNR value decreased as the noise level increased. BDMGIN has the smallest average PSNR values for noise levels of 15, 25, 35 and 50. That is because BDMGIN is designed to remove the mixed Gaussian-impulse noise, so is not good at dealing with single Gaussian noise. Compared with other methods, the superiority of MFENANN increases as noise level increases; this is because as the noise level grows, effective information decreases, and the traditional prior-based method could not work well; MFENANN restores images by learning the correlations between noisy-clean image patch pairs. Moreover, MFENANN extracts and integrates different scale features of noisy images and pays attention to correlations of channels, which decreases the influence of increasing noise levels. Table 5 Table 6 shows the average PSNR values of images in "BSD68" for several state-of-the-art methods at noise levels of 15, 25, 35, 50 and 75. In general, discriminative deep learning methods except the BDMGIN (i.e., MLP, DnCNN, FFDNet and MFENANN) got larger PSNR values than traditional denoising methods (i.e., BM3D and WNNM). When noise level was 15, MFENANN got the largest PSNR value, which is numerically close to DnCNN. This is because at such a low noise level, DnCNN trains a specific model for this noise level which offsets the role of the superiority of networks. When noise levels were 25, 35, 50 and 75, MFENANN got the largest PSNR values for all noise levels, which shows the superiority of MFENANN. BDMGIN has the smallest PSNR values for all noise levels. Table 3. Average PSNR (dB) values for images in "BSD68" restored by MFENANN which were trained with patch numbers of 577 × 10 3 and 1 × 10 6 . The bold numbers are the largest ones at each corresponding noise level.  Figure 5 shows the restored images for "test033" in "BSD68" of several state-of-theart methods at a noise level of 50. At such a high noise level, we found that the noisy image had lost lots of details, which looks very poor visually and could not be used directly for many high-level image processing tasks. BM3D can effectively remove noise, but the restored image is over-smoothed; lots of details are lost. SRR-A introduces too many artificial effects while removing noise. In general, deep learning methods have better overall appearance than BM3D and SRR-A, but DnCNN-S, DnCNN-B and FFDNet also made the restored image over-smoothed and introduced some artificial effects. The restored image of MFENANN has the most details and best overall appearance. In quantity, discriminate learning methods get larger PSNR values than BM3D and SRR-A. MFENANN got the largest PSNR values among all discriminating learning methods. Figure 6 shows the restored images of some state-of-the-art methods for "test045" in "BSD68" at a noise level of 25. We found BM3D causes the image to be heavily over-smoothed, and lots of details were lost. SRR-A also over-smoothed the image and introduced many artifacts. The restored images of DnCNN-S and DnCNN-B have more details than BM3D and SRR-A, but the two methods also caused the loss of many texture features. FFDNet achieved better overall appearance than the previous methods, but it still led to the loss of many details. The restored image of MFENANN has the most details and best overall appearance. Moreover, restored image of MFENANN has the largest PSNR value among all methods. Figure 7 shows the restored images of "test064" in "BSD68" at a noise level of 75. We found that at such a high noise level, the observed image is heavily degraded and many significant details are lost. BM3D can effectively remove noise, but it causes the restored image to be over-smoothed and many significant details are lost. SRR (including SRR-DCT, SRR-G and SRR-A) could not effectively eliminate noise at such a high noise level; they caused too many artifacts. The image restored by FFDNet has better overall appearance than images restored by previous methods, but it was still over-smoothed and lots of details were lost. The image restored by MFENANN has the most details and best overall appearance. In quantity, deep learning methods (i.e., FFDNet and MFENANN) have larger PSNR values than BM3D and SRR. In addition, MFENANN has a larger PSNR value than FFDNet.  In reality, usually the level of noise is unknown. In this paper, instead of estimating the noise level, we traversed the entire noise level range with a stride of 1, computed the PSNR values for all restored images and chose the one that achieved the largest value as the output. For BM3D, FFDNet and MFENANN, we used the above-mentioned settings, and for the blind denoising methods BDMGIN and DnCNN-B, there was no need to enter the noise levels. Table 7 Table 8 shows the average PSNR values for color images in image sets "McMaster", "Kodak24" and "CBSD68", which were restored by several state-of-the-art methods. For "McMaster", when noise levels were 15, 25 and 75, CBM3D got larger PSNR values than CDnDNN, which indicates that CBM3D is superior to CDnCNN at these noise levels. FFD-Net got larger PSNR values than CBM3D and CDnCNN under all noise levels. MFENANN has the largest average PSNR values among all methods under all noise levels; this shows MFENANN had the best performance for "McMaster". For "Kodak24", when the noise level was 75, CBM3D gave a larger PSNR value than CDnCNN, and it has smaller PSNR values at other noise levels. FFDNet has larger PSNR values than CBM3D and CDnCNN. MFENANN has the largest PSNR values among all methods. For "CBSD68", similarly to "McMaster" and "Kodak24", MFENANN has the largest PSNR values, which shows MFENANN had the best performance among all methods under all noise levels.   Figure 8 shows the restored images of several state-of-the-art methods for image "kodim05" in "Kodak24" at a noise level of 50. CBM3D causes the restored image to be over-smoothed while removing noise; the image looks a little blurry. The restored image of FFDNet has better overall appearance than CBM3D, but still many details are lost. The restored image of MFENANN has the most detail and the best overall appearance among all methods. In quantity, MFENANN has the largest PSNR values among all methods. These facts show that MFENANN achieved a better performance than CBM3D and FFDNet for image "kodim05" at a noise level of 50. Table 9 shows the average runtimes of several state-of-the-art methods for images in "Set12" at noise levels of 15, 25, 35, 50 and 75. To be fair, all algorithms ran on the CPU. SRR-A used more time than the other methods; this is because it needs to construct a dictionary adaptively, which needs lots of time. The deep learning methods (i.e., DnCNN-B, FFDNet and MFENANN) used less time than traditional methods. MFENANN used a little more time than DnCNN-B and FFDNet; this is because it contains some improved SENet blocks. In general, the time used by MFENANN is comparable to FFDNet; the extra time consumed can be ignored for image denoising.

Conclusions
This paper proposes a novel multi-scale feature extraction-based normalized attention neural network for image denoising. The MFEBlk extracts features of distinct scales from noisy image and integrates them together. The NAN blocks learns the relationship between channels, in which each channel acquires an amount of gain, and channels can play different roles in the subsequent convolution. The residual unit effectively avoids a vanishing gradient and loses shallow features. Experimental results showed that the proposed MFENANN can effectively eliminate noise for noise levels ranging from 0 to 75. Moreover, compared to some state-of-the-art denoising methods, MFENANN has larger PSNR values and better overall appearance.
Applications and future research: The proposed MFENANN can be embedded in imaging equipment and integrated into application software to improve the quality of images. In addition, the noisy image sequence contains different characteristics of the image. Designing a deep neural network to extract features from a noisy image sequence and fuse the features to get a restored high quality image is worthy of further study.
Author Contributions: Y.W. and X.S. proposed the method; Y.W. analyzed the data and drafted the paper; G.G. and N.L. guided students to do the experiments and revised the paper. All authors have read and agreed to the published version of the manuscript.