Denoising Single Images by Feature Ensemble Revisited

Image denoising is still a challenging issue in many computer vision subdomains. Recent studies have shown that significant improvements are possible in a supervised setting. However, a few challenges, such as spatial fidelity and cartoon-like smoothing, remain unresolved or decisively overlooked. Our study proposes a simple yet efficient architecture for the denoising problem that addresses the aforementioned issues. The proposed architecture revisits the concept of modular concatenation instead of long and deeper cascaded connections, to recover a cleaner approximation of the given image. We find that different modules can capture versatile representations, and a concatenated representation creates a richer subspace for low-level image restoration. The proposed architecture’s number of parameters remains smaller than in most of the previous networks and still achieves significant improvements over the current state-of-the-art networks.


Introduction
Image denoising is a classic problem in the low-level vision domain. A given image X goes through the following mapping to create its noisy counterpart.
Here, Y is the noisy observation, where N is the additive noise on a clean image X . Denoising is an ill-posed problem, with no direct means to separate the source image and corresponding noise. Hence, researchers follow the best possible approximation of X from Y with corresponding algorithmic strategies.
Typical methods without machine learning involve employing efficient filtering techniques such as NLM [1], BM3D [2], median [3], Weiner [4], etc. Due to their limited generalization capability, additional knowledge-based priors or matrix properties have been integrated into these denoising strategies. However, despite certain improvements with prior-based methods, many concerns remain unresolved, such as holistic fidelity or the choice of priors.
Convolution neural network (CNN) denoising methods later offered an unprecedented improvement over the previous strategies through their customized learning setup. Usually, CNN methods offer better performance through brute force learning [5], tricky training strategy [6], or inverting image properties [7] by various proposals. We observed gradual improvements over the years for denoising solutions. However, these methods with pure brute force mapping sometimes face fidelity issues within challenging noisy images. Furthermore, due to the lack of generalization properties, the methods provide reconstructed images that often result in cartoonized smoothing.
In contrast, the proposed approach rebuilds a previous ensemble-oriented denoising network that can successfully estimate a cleaner image with less cartoon-like smoothing. For the design of the proposed denoising network, we carefully maximized detail restoration by providing a variety of low-level ensemble features while keeping the network relatively shallow to prevent an oversized receptive field and hallucination effects. In summary, our study has the following contributions: • We propose a shallow ensemble approach through feature concatenation to create a large array of feature combinations for low-level image recovery. • Due to the ensemble of multiple modules, our model successfully returns fine details compared to previous data-driven studies. • The parameter space is relatively small compared to the contemporary methods with a computationally fast inference time. • Finally, the proposed study shows better performance with a different range of synthetic noise and real noise without the cartoonization and hallucination effect. See Figure 1. Here, the first column shows the ground truth, followed by the inference from the DEAMNet [8], and the final column shows the proposed result.

Filtering Based Schemes
Traditional filtering approaches aim for handcrafted filters for noise and image separation. These studies [3,4] utilized low-pass filtering methods to extract the clean images from the noisy images. The iterative filtering approach adopted a progressive reduction for image restoration [9]. Additionally, several methods used nonlocal similar patches for noise reduction based on the similarity between the counterpart patches in the same image. For example, NLM [1] and BM3D [2] assumed a redundancy within patches from a given image for noise reduction. Nonetheless, these methods usually produce flat approximations, as the given image severely degrades the noisy image quality with a heavy noise presence.

Prior Based Schemes
Another group of studies focused on selecting priors for the model, which produce clean images when optimized. These methods reformulated the denoising problem as a maximum a posteriori (MAP)-based optimization problem, where the image prior regulated the performance of the objective function. For example, the studies [10,11] assumed sparsity as the prior for their optimization process. The primary intuition was to represent each patch separately through the help of a function. Xu et al. [12] performed realworld image denoising by proposing a trilateral weighted sparse coding scheme. Other studies [12][13][14][15] focused on rank properties to minimize their objective function. Weighted nuclear norm minimization (WNNM) [12] calculated the nuclear norm through a lowrank matrix approximation for image denoising. Additionally, there are several complex model-based derivations using graph-based regularizers for noise reduction. However, their performance degrades monotonically for noisier areas, and recovering the detailed information is sometime difficult [16][17][18]. Additionally, these methods generally output significantly varying results depending on their prior parameters and the respective target noise levels.

Learning Based Schemes
Due to the availability of paired data and the current success of CNN modules, datadriven schemes have achieved significant improvements in separating clean images from noisy images. Recent CNN studies [19][20][21] utilized the residual connection for estimating the noise removal map before inference. These studies evaluated the clean image without taking any priors regarding the structure or noise. They achieved enhanced performance by using a noncomplex architecture with repeated convolutional, batch normalization, and ReLU activation function blocks. However, these methods can fail to recover some of the detailed texture structure in the presence of heavy noise area.
Trainable nonlinear reaction diffusion (TRND) [22] used a prior in its neural network and extended the nonlinear diffusion algorithm for noise reduction. However, the method suffered from computational complexity as it required a vast number of parameters. Similarly, the nonlocal color net [23] utilized the nonlocal similarity priors for the image denoising operation. Although priors mostly aid the denoising, there are some cases where the adaptation of the priors degrades the denoising performance. Very recently, DEAMNet [8] surpassed the previous state-of-the-art results by using an adaptive consistency prior.
With the success of the DnCNN [19], two similar networks called "Formatting net" and "DiffResNet" were proposed with different loss layers [5]. Later, Bae et al. [24] proposed a residual learning strategy based on the improved performance of a learning algorithm using manifold simplification, providing significantly better performance. After that, Anwar et al. [25] proposed a cascaded CNN architecture with feature attention using a self-ensemble to boost the performance.
A few recent approaches [26,27] followed the blind denoising strategy. CBDNet [27] proposed a blind denoising network consisting of two submodules, noise estimation and noise removal, by incorporating multiple losses. However, their performance was limited by manual intervention requirements and a slightly lower performance on real-world noisy images. In comparison, FFDNet [28] achieved enhanced results by proposing the nonblind Gaussian denoising network. Consequently, RIDNet [5] utilized perceptual loss with 2 apart from the DnCNN architectures for noise removal and achieved significant success by introducing a single-stage attention denoising architecture from real and synthetic noises. Liu et al. [7] introduced GradNet by revisiting the image gradient theory of neural networks. Recently, several GAN-based approaches [29][30][31][32] were introduced through generating denoised images following either a data augmentation strategy for creating diverse training samples or a strategy based on the distribution of the clean images.
The tendency for a modern denoising machine learning scheme is to use a deeper network with complex training. However, Figure 1 shows some of the hallucination and oversmoothing problems of deeper networks. A hallucination of a windowed building can be observed in the DEAMNet [8] results in Figure 1. We believe that deeper network architecture with its overfitting tendencies is the cause of the hallucination and over-smoothing. We also suspect the PSNR minimization is contributing to oversmoothing. Thus, in this paper, we propose a variety of shallow networks for low-level feature accumulation as well as a network that finds a balance between PSNR and SSIM. We show that multiple feature ensembles from a variety of shallow networks are more appropriate for denoising image problems compared to a single deeper and complex network. The shallow architecture prevents overfitting, and the necessary statistics are obtained from the feature ensemble.

Baseline Supervised Architecture
Recently, supervised model-based denoising methods embedded a similar baseline formation in their proposals [5,33]. In brief, it is possible to compartmentalize the baseline architecture in Figure 2 into three distinct modules: initial feature extraction, large intermediate blocks, and feature reconstruction. Typically, the primary module consists of a single layer that serves the purpose of the initial feature learner or initial noise estimator. For any noisy image I n , the representation of the initial block E p is as follows: where β is the initial convolutional layer for basic feature training. Following that, we see the main restoration part of the given network through the help of an intermediate processor. Typically where M i represents the ith instance of the learning stage of the intermediate block and E i is the corresponding outcome of the intermediate layer.
The final reconstruction module operates through a residual connection followed by a consecutive final convolution. If R is the final reconstruction stage before the output, then the recovered image I r is a combination of E i , E p , and I n .
Here, f N denotes the overall neural network, I n is the noisy input image, and I r is the recovered image. A typical choice of the cost function for this task involves the 1 or 2 loss. There are other customized loss functions available, such as weighted-augmentation of different loss functions that integrate spatial properties or relevant regularization [5]. In general, the network is optimized by minimizing the difference from clean images.
Here, θ is the learnable parameter, I i n is the noisy image, and I i c is the corresponding clean image. Most of the baseline network parameters are placed in the intermediate learning block.

Proposed Architecture
In contrast, we designed our network to allocate more resources to the concatenated learned features. Instead of developing a basic learning block for long cascading connections, we chose variety by proposing various individual feature learning blocks. The proposed network is focused on delivering richer and diverse low-level features. To fur-ther reduce complexity, we avoided using an attention operation, which is typically more expensive. More details are provided in Figure 3 and the following subsection.

Initial Feature Block
Three consecutive convolution layers are used to extract the initial features for the network. The layers are equal in depth, but their kernel sizes were in descending order. The input image goes through the 5 × 5 convolution operation at the first layer, followed by a 3 × 3 convolution, and ends at the 1 × 1 pixelwise operation. A larger kernel size makes use of a larger neighborhood of input features and estimates the representations on larger receptive fields. By limiting the kernel size and the number of layers, the network learns to focus on the smaller receptive fields and disregards the broader view, which we argue to be less meaningful in low-level vision tasks such as denoising. Therefore, the purpose of the primary layer is to project the representation for the denoising features from a smaller receptive field into individual responses which can be further diversified in the next four block modules in Figure 3.  In the above figure, we present the overall diagram of the proposed architecture for image denoising. Our pipeline first extracts the initial feature using consecutive convolution operation, followed by the four modules for feature refinements. These modules are standing upon the customized convolution and residual setup with supportive activation functions. After refinement, we concatenate all the refined feature maps into a single layer, followed by a final dilated convolution to make the inference.

Four Modules for Feature Refinement
Before presenting the four modules for feature refinement, we cover the convolution, activation functions, and residual connections used in the modules. Even though the attention mechanism is a common choice to learn richer representations, we can still find a similar or better result without it in this study. Additionally, our selection of residual blocks was for blocks to be no longer than six consecutive connections. The fundamental operations for our modules are introduced below.
Convolution. In the internal convolution operation, our choice of kernels varied from 1 × 1 to 7 × 7. Due to such a range, our network was naturally focused on both smaller and larger receptive fields.
Activation functions. Recent advancements in nonlinear activation functions have shown that better performance is achievable through the interconnected operation of different activation representations that are compacted into a single function. Hence, we chose the SWISH [34] and MISH [35] activation representations in addition to the ReLU operations. As a result, our network learned from diverse representations obtained from various parallel activated functions.
Residual connections. It is redundant to mention the efficacy of the residual connections in the vision tasks. In the literature, we can see that the customization of residual connections varies within the task. In the original ResNet paper [36], the authors included batch-normalization between the convolution layer, followed by the ReLU layer. In our study, we used the convolution layers, which were separated by the ReLU layer. This choice of the ReLU sandwich residual connection is prevalent in regression tasks [37].
We focus on the major processing modules below with the description of the utilized blocks. We propose four processing modules that perform the refinement operations on the initial features. The following subsections cover their descriptions and the basic reasoning behind the proposed architecture.

Residual Feature Aggregation Module
In our residual feature aggregation module, we used the aforementioned residual blocks as our underlying design mechanism. In the construction of this module, we took inspiration from the traditional pyramid feature extraction [38] and aggregation, which has been very influential in computer vision. A typical pyramid setup is motivated by the needs for multiscale feature aggregation, which, in essence, utilizes low-frequency information along with high-frequency features. However, the subsequent downsampling process is a lossy operation by nature. To mitigate information loss for low-frequency features, we chose to employ the concurrent residual blocks on the same initial features through three different kernel sizes. Naturally, our kernel choice ranged from 1 × 1 to 5 × 5, as seen in Figure 4a. Hence, a larger kernel allowed us to learn the features from a larger area of image, while the 1 × 1 kernel operation allowed us to maintain the initial receptive field and make use of more high-frequency information. We aggregated the response from all three residual block to learn the overall multiscale impact of the initial features. Finally, a typical 3 × 3 convolution with standard depth gave us the n number of diverse representations from this module. As a result, our model can learn the important multiscale features without going through a pooling operation.

Multiactivation Feature Ensemble
Activation functions are unavoidable components for neural network construction that aid the learning operation by projecting the impactful information to the next layer. Hence, widely different nonlinear functions are available as activation functions in all sorts of neural networks for various purposes. The ReLU is the most widely used activation function, which at heart is a "positive pass" filter. However, in some cases, zero-out negatives and a discontinuity in the gradient are argued to be unhelpful in the optimization process. To address some of its weaknesses, SWISH [34] and MISH [35] were proposed with smooth gradients while maintaining a similar positive-pass shape of the ReLU. A recent experiment [35] showed that these activation functions provided a smoother loss landscape than the ReLU.
Nonetheless, we incorporated all three activation functions, as seen in Figure 4b. SWISH, MISH, and ReLU activation functions were applied to the initial features, followed by a convolution layer. The subsequent responses were concatenated into a single tensor to learn from the integrated representation of varying activation functions. No further kernels and residual blocks were utilized for this module. The initial feature results of these modules were ensembled with the responses of the other three modules, but the multiactivation functions were also integrated into the multiactivated cascaded aggregation module described in Section 3.3.3.

Multiactivated Cascaded Aggregation
In this module, both shallow and relatively deeper layer features were concatenated. Typically, a deep consecutive convolution operation is formulated after the initial feature extraction, and the conventional thinking is to build a deeper network for complex problems. However, we added a single convolution layer feature to complement the deeper layer features because we believed that a shallower interpretation might be more appropriate for low-level vision problems. See Figure 5.

Conv-2 Conv-4 Conv-6
Conv ( For a single convolution path, a 3 × 3 kernel size was chosen with the same depth as the initial features. For the deeper path, five consecutive convolution layers with different kernel sizes were used. The activation functions between the layers were ReLUs, however, for both paths, a multiactivation feature ensemble was implemented as described earlier. Both the shallow and deeper responses were concatenated followed by another convolution layer.

Densely Residual Feature Extraction
The densely residual operation has shown great promise in both regression and classification tasks [39]. Dense residual connections are an efficient way to emphasize hierarchical representation. For this reason, we designed a densely residual module to aggregate features for the network. The proposed design in Figure 6 also utilized the concatenation between the final and previous aggregation in support of a total hierarchy concentration. A final convolution was added to combine the three concatenated features from the densely residual layers. After collecting and concatenating the individual responses from each of the four modules, the responses were merged by the final convolution layer with a dilation rate of 2, see the overall process in Figure 3. This layer's output contained the most refined representation for the restored image. The restored image was fed into a simple loss function consisting of 1 and SSI M .

Loss Function
We used two typical loss functions, 1 and SSI M , to update the parameter space. The total loss function was a simple addition of the two.
1 measures the distance between the ground truth, clean image and the restored image as shown in next equation.
Here, γ g is the ground truth clean image and γ p is the restored prediction image. The secondary component is the loss function from SSIM, which is another widely used similarity measure for images.

Experimental Results
This section describes the overall performance of our method on both real and synthetic noisy images.

Network Implementation and Training Set
For the proposed study, we utilized a TensorFlow framework with NVIDIA GPU support. Most of the convolutional layers in our network were 3 × 3 kernels, apart from the specific cases where 1 × 1, 5 × 5, and 7 × 7 kernels in addition to the 3 × 3 kernels were used. For the training phase, we used the method from He et al. [40] for the initialization and the Adam optimizer with a learning rate of 10 −4 , a typical default in many vision studies.
For the training, the DIV2K dataset was used. To enable diversity in the data flow, the typical rotation, blurring, contrast stretching, and inverse augmentation techniques were implemented. The training images were cropped into smaller patches. The noisy input images were created by perturbing the clean patches by additive white gaussian noise (AWGN) with 15, 25, and 50 standard deviations.

Testing Set
We use the BSD68, Kodak24, and Urban100 datasets for the inference comparison, where clean observations were available and noisy versions were created through the same artificial noise augmentation. The results are summarized in Table 1.
The DND, SIDD, and RN15 datasets were used to evaluate the proposed approach on images with natural noise. A brief description of the real-world noisy image dataset and the evaluation procedures are described below.
• DND: DND [41] is a real-world image dataset consisting of 50 real-world noisy images. However, near noise-free counterparts are unavailable to the public. The corresponding server provides the PSNR/SSIM results for the uploaded denoised images.
• SIDD: SIDD [42] is another real-world noisy image dataset that provides 320 pairs of noisy images and near noise-free counterparts for training. This dataset follows a similar evaluation process as for the DND dataset. • RN15: RN15 [26] dataset provides 15 real-world noisy images. Due to the unavailability of the ground truths, we only present the visual result of this dataset.

Denoising on Synthetic Noisy Images
For evaluation purposes, we considered previous state-of-the-art studies within various contexts. The evaluation procedure included two filtering methods, BM3D [2], WNNM [12], and several convolutional networks including DnCNN [19], FFDNet [28], IrCNN [20], ADNet [43], RIDNet [5], VDN [44], and DEAMNet [8]. Table 1 shows the average PSNR/SSIM scores for the quantitative comparison. From the average PSNR and SSIM score, the proposed study surpasses the previous studies with a considerable margin. We adopted three widely used datasets BSD68, Kodak24, and Urban100 with three different AWGN noise levels, 15, 50, and 50. The code for all methods used in this evaluation, including our own source code, is found in Appendix A.
For a visual comparison, Figures 7-9 from BSD68, Kodak24, and Urban100 are presented, respectively, with a noise level of 50. Figure 7 shows the "fireman" picture from the BSD68 dataset. The differences in the restoration are shown in detail with a more controlled smoothing. From Figure 8, we see that the proposed approach avoids image cartoonization and preserves details while restoring clean details. The proposed study manages to restore the structural continuity compared to other methods while preserving the appropriate color and contrast of the image. The last visual comparison for the synthetic noisy image is the "Interior" picture from the Urban100 dataset, shown in Figure 9. For a better illustration of the differences, a zoomed image of the interior wall of the place is shown, where the proposed method manages to preserve the brick's separating lines more clearly. We also provide Figure 10, where multiple images were combined with different intensities of noise with their proposed output.

Denoising on Real-World Noisy Images
The results for real-world noisy image restoration are presented in Table 2. Natural noise removal is challenging because the convoluted noises are not signal independent and vary within the spatial neighborhood. We chose three real noisy image datasets, the SIDD benchmark [42], the DnD benchmark [41], and RN15 [26], to analyze the generalization capability of our proposed method. For the SIDD and DnD benchmarks, the clean counterpart images are not openly distributed. Hence, the presented PSNR/SSIM in Table 2 was obtained by uploading the results into the corresponding server. For the RN15 dataset, there is no benchmark utility. Table 2 represents the comparative performance for both SIDD and DnD benchmarks. Among the existing methods, VDN [44] and DEAMNet [8] perform well. However, our method achieves a better result among the existing methods for both the real and synthetic noises.
To demonstrate the performance of our method with real images, we also provide some visual comparisons in Figures 11-13 on the SIDD, DND, and RN15 datasets, respectively. For a visual comparison on real noisy images, we included the recent VDN [44], RIDNet [5], and DEAMNet [8]. The visual comparison shows that our method tends to avoid cartoonization while effectively removing noise, suppressing artifacts, and preserving object edges. Overall, the qualitative and quantitative comparisons display an effective performance on all fronts.  Figure 12. Visual quality comparison with PSNR and SSIM scores for "Star" from the DnD dataset with real noises (for best view, zooming in is recommended).

Noisy
VDNet RIDNet DEAMNet Proposed Figure 13. Visual quality comparison for "Dog" and "Glass" from the RN15 dataset. RN15 dataset is a set of real noise images without the clean image counterparts (for best view, zooming in is recommended).

Ablation Study on Modules
In this section, we provide an ablation study based on the effect on our modules' correlation. We used four different modules, which work separately and generate various features. These different features cannot be considered separately as clean images. However, if we concatenate them together as the proposed method described, we can obtain a clean image. In Table 4, the modules are the residual feature aggregation block (RFA), multiactivation feature ensemble block (MFE), multiactivated cascaded aggregation block (MCA), and densely residual feature extraction block (DRFE). We removed each module separately and calculated the PSNR and SSIM for three different datasets. Here, we can observe that the PSNR value drops every time a module is removed. For the multiactivation feature ensemble block (MFE), the value of the PSNR drops the most, and for the module multiactivated cascaded aggregation block (MCA), the SSIM value drops the most. In Figure 14, we represent the output of these four modules separately with the ground truth and our proposed method's output.  Sample results for all four modules separately (for best view, zooming in is recommended).

Conclusions
In this paper, the basic strategy for the low-level denoising problem was to gather a variety of low-level features while keeping the interpretation simple by implementing relatively shallow layers. We argued that for low-level vision tasks, the principle of Occam's razor was more appropriate, and accordingly, we designed a network that focused on gathering a variety of low-level evidence rather than providing a deep explanation of the evidence. Thus, we revisited the feature ensemble approach for the image denoising problem. Our study offered a new model which concatenated different modules for creating large and varying feature maps. To enhance the performance of our network, we utilized different kernel sizes, residual and densely residual connections, and avoided deep unimodule cascaded aggregation. We carefully designed four different modules for our study, where each helped to restore different spatial properties. Finally, we validated our network with natural and synthetic noisy images. Extensive comparisons showed the overall efficiency of the proposed study. We observed that although our SSIM scores were much higher across the board, the PSNR scores were not the best in the comparison. Our model extracted a variety of shallow features from the image; however, for higher PSNR evaluation, a deeper network may be desirable. In future work, we are planning to apply a self-supervised strategy in training procedures using the same ensemble of shallow networks. The different versions of the noisy input images are planned to be used during the denoising self-supervised training.