Median Filter Aided CNN Based Image Denoising: An Ensemble Aprroach

: Image denoising is a challenging research problem that aims to recover noise-free images from those that are contaminated with noise. In this paper, we focus on the denoising of images that are contaminated with additive white Gaussian noise. For this purpose, we propose an ensemble learning model that uses the output of three image denoising models, namely ADNet, IRCNN, and DnCNN, in the ratio of 2:3:6, respectively. The ﬁrst model (ADNet) consists of Convolutional Neural Networks with attention along with median ﬁlter layers after every convolutional layer and a dilation rate of 8. In the case of the second model, it is a feed forward denoising CNN or DnCNN with median ﬁlter layers after half of the convolutional layers. For the third model, which is Deep CNN Denoiser Prior or IRCNN, the model contains dilated convolutional layers and median ﬁlter layers up to the dilated convolutional layers with a dilation rate of 6. By quantitative analysis, we note that our model performs signiﬁcantly well when tested on the BSD500 and Set12 datasets.


Introduction
Image denoising refers to the recovery of a noise-free digital image from one that has been contaminated by noise.The presence of noise in images can act as a deterrent in the course of various tasks in image processing, like facial recognition [1], form processing [2], environmental remote sensing [3], etc.Thus, the efficient removal of noise from those images at the first stage is extremely crucial to the entire process, otherwise the final result obtained is erroneous.Noise can be of different types, with the most common ones being Gaussian noise and salt-and-pepper noise, which is actually sparsely occuring white and black pixels on an image.The other types of noises are Poisson noise and impulse noise.Poisson noise, which is also known as shot noise, is generated through the Poisson process or Poisson random measures.Impulse noise is a variation of color or brightness on an image.The variation happens randomly and it usually takes place in the image sensors of digital cameras.That is why it is also known as electronic noise.In this paper, the removal of additive white Gaussian noise from images is focused on.This type of noise is a random value that is drawn from the normal (Gaussian) distribution that is superimposed on the clean pixels to obtain the noisy image.This noisy image is then fed into our model in an attempt to recover the noise-free image.The performance of our model is evaluated by comparing the output image to the input noisy image.
There are quite a few challenges that are faced by researchers in the process of image denoising.The quality of the images coupled with the level of noise make it difficult for the original image to be recovered from the noisy image.Further, the contextual data at different noisy regions in an image might be different and, thus, an appropriate method needs to be devised to distinguish all such noisy regions in an image.However, a single model may not be able to denoise all of the noisy pixels.To this end, we propose an ensemble learning approach to combine the denoising capabilities of three models, so that the final model's image denoising capability is better than the individual models and the output is optimal with respect to the results that are produced when compared to those produced by the individual models.Further, we ensure that the individual models are designed so that a variety of noisy pixels can be detected in the image and the image can be appropriately denoised.
The entire paper is organized as follows: in Section 2, we discuss the related work.In Section 3, we describe the steps of our image denoising model in details.In Section 4, levelwise experimental results are presented, and, finally, in Section 5, the paper is concluded.In this paper, our main contributions are as follows: 1.
Median filter layers are added to ADNet [4] up to the Sparse Block or SB along with a dilation rate of 8.

2.
Dilated convolutional layers are used in IRCNN [5] up to a dilation rate of 6, along with median filter layers for it.

3.
Median Filter layers are added up to half of the convolutional layers in DnCNN [6].

4.
An ensemble of the said models is formed and proposed by using weighted average of the output of each model in order to generate the final denoised image.We take 2  11 th part of ADNet output, 3  11 th part of IRCNN model and 6  11 th part of DnCNN model.

Related Work
Over the last few decades, a variety of methods have been proposed for image denoising.These include non-linear [7] and non-adaptive filtering methods [8], which help to preserve edge information as well as signal to noise ratio information to estimate the noise.Some methods use wavelet shrinkage for denoising.Visushrink [9] used an universal threshold for every wavelet coefficient, while BayesShrink et al. used an adaptive approach for wavelet soft thresholding [10], which helps in finding a unique threshold for every wavelet sub-band.Besides, an iterative clipping algorithm is used by Chambolle's algorithm (TV-Chambolle) [11] for total variation denoising.All of these methods remove Gaussian noise from images.
However, such methods require manual tuning of parameters, which is tedious.Hence, machine learning methods were introduced, which overcome the said drawbacks.Dabov et al. [12] proposed a sparse-based method using collaborative altering, which uses collaborative altering to support image denoising.Markov Random Field (MRF) based methods are used in [13], which provide competitive results when compared to the methods prior to it.However, such methods often do not support all types of images and, hence, are not flexible enough.
With the advent of deep learning, neural network based methods [14,15] were originally proposed.However, such methods are often time consuming for large images and spatial features are lost.Hence, CNNs were introduced for image denoising [16,17].Conventional CNNs, as well as the LeNet [18], have real-world application in handwritten digit recognition, but they have certain drawbacks.For instance, they use activation functions, like Sigmoid and Tanh, which result in high computational cost and they also generate vanishing gradients.However, these drawbacks were overcome by AlexNet [19], and then other architectures, like VGG [20] and GoogleNet [21].After that, many CNN based denoising models have been introduced in literature.In this context, Denoising CNNs (DnCNNs) have proven to be effective, while also efficient, in terms of time.Lefkimmiatis proposed a Color Non-Local Network (CNLNet) [22], which uses the inherent non-local self-similarity on natural images to efficiently perform denoising.For blind denoising, Zhang et al. proposed a fast and flexible denoising model known as FFDNet [23], with a tunable noise level as input, to efficiently denoise an image.Chen et al. [24] used a combination of Generative Adversarial Network (GAN) and CNN blind denoiser (GCBD) to first generate the noise samples and then use the noise patches to create a paired training dataset to train a CNN for denoising.All the methods mentioned till now are used for removing Gaussian noise from images.Another blind denoising model (CBDNet) [25] proposed by Guo et al. removes real world noise noise from the given real noisy image by two sub-networks, one in charge of estimating the noise of the real noisy image and the other in charge of obtaining the latent clean image.For denoising images affected with salt-and-pepper noise, the authors in [26] use CNN with median filter layers for denoising.In the work [27], responsive median filters and the modified Harris corner point detector are used for reduction of Poisson noise in X-ray images.For reduction of mixed Gaussian-impulse noise, a novel CNN based denoising method is proposed in [28].
The mentioned methods perform well overall; however, they often do not span wide enough to cover all types of images.To this end, in the current scope of our work, we propose an ensemble model that can combine the efficiencies of multiple CNN models to build a composite classification model that aims to denoise all varieties of images as far as possible.

Proposed Work
In this paper, we propose an ensemble of three image denoising models by using weighted mean of the output of the individual models, as shown in Figure 1.The models have been designed while keeping in mind that they should be able to span across a wide variety of images to efficiently denoise them.The architecture of individual models is described hereafter.

Attention-Guided CNN (ADNet)
There are 17 layers in total in ADNet, which are divided into four blocks, as shown in [4].The first block is the Sparse Block (SB), the second block is the Feature Enhancement Block (FEB), the third is the Attention Block (AB), and the fourth block is the Reconstruction Block (RB).The SB contains 12 layers, it has two types of convolution-the dilated convolution and the standard convolution.The layers in the SB are of two types, the Dilated Convolution and the Normal convolution, both types being supported by Batch Normalization (BN) followed by the Relu activation function.The Dilated Conv (Dilated convolution) helps in increasing the size of receptive field without increasing the computation cost of the model.A dilated filter with dilation factor of a can be interpreted as a sparse filter of size (2a + 1) × (2a + 1).
The dilation rate of the dilated convolution used in SB is 8 and, also with each layer in the SB, we add a median filter layer, which is a traditional non-linear filter that replaces the pixel centred in a window with the median of the window, and helps in the efficient removal of impulse noise as mentioned in [26] by Liang et al.It is applied to each element of a feature channel in a moving window fashion.In case of each feature channel, we extract a set of patches of size (3 × 3 or 5 × 5) centred at each pixel.Subsequently, we find the median of the sequence formed by all elements in that patch and replace the elements with the median value.
The FEB consists of four layers and it aims to capture global and local features of the model to enhance the ability to express in image denoising.The global and local features that are used by the FEB are the input noisy image and the output of SB, respectively.The output from the FEB is passed on to the AB.The AB uses the current stage to guide the previous stage for learning the noise information, which is useful in terms of blind denoising and for extremely noisy images.The output of the AB is passed on as input to the RB, which uses a residual learning technique to reconstruct the clean image.

Feed Forward Denoising CNN (DnCNN)
The architecture of DnCNN is an end-to-end trainable deep CNN for Gaussian denoising and it adopts the residual learning strategy to remove the latent clean image from noisy observation, as reported in [6] by Zhang et al.The size of each convolutional filter is set to 3 × 3 but without any pooling layers.Therefore, the receptive field of the architecture with depth of d should be (2d + 1) × (2d + 1).The total number of layers used in the model is 17 and each layer has Conv + BN + ReLU activation function.Along with it, median filter layers are used in the network, but only in the first half as the first half is used for removing the noise, whereas the second half is used to reconstruct the denoised image.

Deep CNN Denoiser Prior (IRCNN)
The architecture consists of 12 layers , which consist of three types of layers (i) Dilated Conv + ReLU, (ii) Dilated Conv + BN + ReLU, and (iii) Dilated Convolution which is used in the last layer as reported in [5] by Zhang et al.The dilation factors of 3 × 3 dilated Conv layers from first to the last are set to 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, and 1, respectively.The number of feature maps in each middle layer is set to 64.The use of batch normalization and residual learning helps to accelerate training.In particular, batch normalization and residual learning are helpful for Gaussian denoising, since they are beneficial to each other.Median filter layers are also added to the dilated Conv layer with a dilation rate of 6, so that it can help in increasing the denoising capability of the network.

Ensemble of Image Denoising Models
After training the individual image denoising models on the training dataset, we test them on images of the given testing dataset, and we then use the weighted average of the output of individual models to produce the final denoised image.As for the weights, we see the individual performance of the models on the images and use the corresponding results to decide the ratio in which the models are to be used for the ensemble.The ratio hence obtained is 2:3:6 for the ADNet, IRCNN, and DnCNN, respectively.

Dataset
For evaluating our image denoising ensemble based model, we use the BSD500 dataset, which was proposed by Martin et al. in [29] and it can be downloaded from https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/BSD500 download (accessed on 27 March 2021), and the Set12 dataset [12], which is a collection of 12 widely used testing images available at https://www.researchgate.net/figure/12-images-from-Set12-dataset_fig11_338424598Set12 download (accessed on 27 March 2021).We use 200 gray scale images of the BSD500 dataset and resize each image to (200) × (200) for training our model.The BSD500 dataset is actually an extension of the BSDS300, in which the original 300 images are used for training and validation, and 200 fresh images, annotated by humans, are added for testing.We introduce additive white Gaussian noise in the images by setting the standard deviation of the Gaussian noise distribution to 15, 25, and 50, also known as noise levels.For training purpose, each image is converted into patches of size (40) × (40), which results in a total of 596 patches per image.The model is evaluated on the 200 testing images of the BSD500 dataset, whose results are illustrated in Table 1, and also on the 12 images from Set12 dataset whose results are illustrated in Table 2.

Hyperparameters
During the training phase, each model takes approximately two and a half hours to train.Hence, the entire ensemble model takes seven and a half hours to train.For the deep learning based models, through extensive experimentation with the dataset, we finally set the hyperparameters of the model as: learning rate = 0.001, number of epochs = 50, and steps per epoch = 1000.For testing, we use weighted average of the outputs of the three models.For the DnCNN + median layer model, we take 6  11 th part of its output, for the IRCNN (dilation rate up to 6) + median layer model, we take 3  11 th part of its output and for the ADNet (dilation rate = 8) + median layer model, we take 2  11 th part of its output and then calculate the final output by adding all of them.We use PSNR (peak-signal-to-noise ratio) for evaluating the performance of the image denoising models.In case of the nondeep learning based methods, we use the scikit-image library [30] and, during application, we fine-tune the methods through extensive experimentation to obtain the best results.For TV-Chambolle [11], the denoising weight is set to 0.3 and the maximum number of iterations is set to 30.For Wavelet BayesShrink [10] and Wavelet VisuShrink [9] models, we use soft denoising and estimate the standard deviation of the noise using the method in [9].

Results
After thorough testing, as illustrated in Tables 1 and 2, we note that the ensemble model outperforms all of the other model combinations and also some non-deep learning based models for the BSD500 dataset.For the Set12 dataset, we see that it provides extremely competitive results and outperforms other model combinations in most of the cases.When compared to non-deep learning based models, it completely outperforms them.Additionally, in Table 2, for noise level 15, out of the three non-deep learning models, the highest PSNR value is 23.54 dB, whereas the lowest PSNR value in the case of CNN based methods is 32.36 dB, which is 8.82 dB more than the best result of the non-deep learning models.Additionally, in the case of noise levels of 25 and 50, the CNN based methods perform better than the non-deep learning based methods.For the Set12 dataset, the trend is similar, as shown in Table 2. Thus, it is clear that the CNN based denoising models perform far better than the non deep learning based methods for the purpose in concern.Figure 2 shows the denoised images, as produced by DnCNN, IRCNN, ADNet, and our proposed ensemble model, for a noisy image of noise level 50.Various noisy versions of a single image with noise levels of 5, 10, 15, 25, and 50, along with their denoised versions,are shown in Figure 3. Additionally, we have shown the denoised results, for a noisy image of noise level 25, produced by the non-deep learning methods and the proposed ensemble method in Figure 4.

Conclusions
In this paper, we have proposed an ensemble of three deep learning models after suitable customizations for the purpose of image denoising.In the case of the first model, we use an ADNet with an increased dilation rate, along with median filter layers.In the case of the second model, we use a DnCNN with median filter layers and, for the third model, we use an IRCNN model with dilated convolutional layers and a dilation rate of 6 along with median filter layers.The final output image is a weighted average of the individual model's output images.When comparing the performance of the proposed model to other state-of-the-art models, we observe that our model outperforms others when applied on the BSD500 dataset and, in the case of the Set12 dataset, our model outperforms most of the existing denoising models.In future, we would like to use our model for the denoising of images containing other types of noise, like salt-and-pepper noise and Poisson noise.

Figure 1 .
Figure 1.Our proposed ensemble image denoising model.In the figure (a) denotes the improved Denoising CNNs (DnCNN) model, (b) represents the improved ADNet model, and (c) denotes the improved IRCNN model.

Table 1 .
Average PSNR values of images predicted by the image denoising models on the images of BSD500 dataset.

Table 2 .
Peak-signal-to-noise ratio (PSNR) values of images predicted by the image denoising models for different noise levels on Set12 image dataset.