1. Introduction
With the continuous spread of large numbers of images on the Internet due to the rapid development of mobile Internet and social networks, there is an increasing demand for more accurate, high-quality images. However, during the acquisition, compression and transmission of images, the image quality is inevitably deteriorated due to noise [
1,
2,
3]. Therefore, image denoising is applied in modern image processing systems [
4]. This method aims to remove noise from a noisy image while minimizing the differences between the original image and the denoised image [
5]. Various researchers have continuously investigated this technique but it remains challenging as it is a well-known ill-posed problem with no unique solution from a mathematical standpoint [
6,
7,
8]. Generally, image denoising can be modeled mathematically as follows:
where
y is the noisy image,
x is the clean image, and
n is the noise, which is usually modeled using standard deviation
as additive white Gaussian noise (AWGN) [
9]. In the past few decades, a variety of denoising methods have been proposed, which can be categorized into two groups: model-based and learning-based denoising methods [
10].
In earlier years, the typical methods were model-based. Earlier model-based methods rely on the statistical analysis of the internal information of the given images. Representative examples of model-based methods are non-local means (NLM) [
11] and block matching and 3D filtering (BM3D) [
12]. In 2005, Baudes et al. first utilized non-local self-similarity in images and proposed NLM. Based on the observation that a natural image patch includes several similar counterparts inside the same image, NLM removes the noise by stacking non-local self-similar patches, which works well with images that have repeating patterns in spatially diverse regions. However, it is computationally impossible to search for similar patches within the whole image; hence, usually, a small vicinity of the patch is explored while looking for optimal matches [
13]. Later, inspired by the idea of NLM, Dabov et al. proposed BM3D, which included a two-step denoising process. First, the input image was denoised and a basic estimate of image blocks was generated. Then, by using collaborative filtering of the generated estimate, the noise reduction effect was increased. However, the block-wise process may result in distorted output. For instance, when the noise level rises gradually, the denoising performance of the model decreases, and artifacts emerge, particularly in flat regions. In 2011, Dong et al. presented a non-locally centralized sparse representation (NCSR) [
14]. It makes use of a sparse representation of image patches and non-local similarity, altering the goal of image denoising to suppressing sparse coding noise. The NCSR has demonstrated remarkable performance for natural images with both smooth and textured regions. However, the algorithm is not suitable for many applications owing to its computational complexity in non-local estimates of unknown sparse coefficients. Although these model-based methods have fundamental mathematical derivations, they still have a few drawbacks. First, they are frequently time-consuming owing to the complexity of the iterative optimization processes. Second, only the internal information of the noisy image is used by the model-based methods. No external information is used; this leads to the denoising effect being considerably impaired while recovering flat regions and texture patterns with excessive noise [
15].
With the rapid development of deep learning, learning-based methods have shown great potential in image denoising [
16]. The learning-based methods aim to train a deep denoising network with datasets and learn the noise model implicit mapping from the noisy to the clean images to form external priors [
2]. In recent years, the popularity of learning-based methods has surpassed that of model-based methods. In 2016, Zhang et al. suggested a feed-forward denoising convolutional neural network (DnCNN) [
17]. They incorporated residual learning [
18] and batch normalization [
19] to speed up the training process while improving denoising outcomes. With the supervision of noisy–clean paired images, it achieved superior performance compared to traditional methods. However, the model can only handle noisy images well when the noise level is in the preset range; hence, it lacks generalization capability. To alleviate the problem in DnCNN of a lack of flexibility, Zhang et al. introduced a flexible and fast CNN solution for CNN-based image denoising called FFDNet [
20]. It contains an adjustable noise level map and can be trained to recognize the noise level. Experiments revealed that the method can handle several images with different noise levels, with high performance and efficiency. However, the denoising performance of FFDNet is limited to the training dataset, indicating that it can only obtain excellent results on noisy images that are similar to the images of the training set. In other words, FFDNet may generalize poor results for input images that are corrupted with complex noise, especially real-world noisy images. To improve the generalization capability of the deep CNN denoiser, in 2019, Guo et al. used real-synthetic noise and real-image noise to train a convolutional blind denoising network (CBDNet) [
21]. Although CBDNet performed well under different noise levels, it is less efficient and flexible owing to its separate subnets and the need for manual intervention. Soon after, Yue et al. presented a variational denoising network (VDNet) [
22]. It automatically estimates the noise distribution and the noise-free image inside a unique Bayesian framework using variational inference methods. The VDNet surpassed CBDNet in both performance and generalization capability. However, it does not consider signal autocorrelation, making it difficult to estimate the real-world noise in some forms. In 2022, for the enhanced control and interpretability of a learning-based denoiser, Liang et al. proposed a novel architecture based on a denoising network called controllable confidence-based image denoising (CCID) [
23]. CCID merges the deep denoising network with a reliable filter to reduce artifacts. The results revealed that CCID improves the interpretability and control and outperforms the former deep denoisers, such as DnCNN, especially when the test data differ from the training data. Compared to the model-based denoising methods, learning-based methods are advantageous in the following two aspects. (1) They benefit from numerous datasets. These learning-based methods make full use of external information and demonstrate an extremely competitive denoising effect. (2) Although the learning-based methods need to utilize their network to extract features, which is time-consuming, the inference of CNN is very efficient at the test stage owing to the computation ability of the GPU. Extensive experiments demonstrate that their processing speed is mostly higher than that of model-based methods [
24]. However, their strong dependence on data results in certain drawbacks. Owing to their strong dependence on external information, the performance of learning-based methods is limited to their training data [
25]. These methods find it difficult to deal with universal image denoising problems because they are not adaptive to a given noisy image with different noise levels, image texture and scene classification from the dataset; some fine-scale image details may not be well preserved. Therefore, for the learning-based models, their strong data dependence tends to result in poor generalization capability and flexibility [
26].
Although various image denoising methods have been proposed and have achieved highly promising results over the last decade, none of the denoising methods is perfect owing to their technical characteristics. We can safely state that it is difficult for an individual denoiser to promote the denoising performance better than others under various conditions. Conversely, model-based methods and learning-based methods are complementary to each other. For example, model-based methods are adaptable to strong self-similarity images [
27], whereas the learning-based methods need cumbersome training to learn a model before the test stage and tend to be restricted by certain images. As a result, various strategies have been proposed to investigate a combination of these methods for denoising, leveraging their respective advantages. To investigate their integration, which uses their respective merits, Choi et al. [
28] considered a combination of the denoised images obtained using these methods and then proposed a Consensus Neural Network (CsNet). However, there is still scope for further improvement. For example, it ignores the fact that each initial denoiser contains denoising characteristics regarding the details of local areas. Therefore, it is unsuitable to estimate the weight for each initial denoised image. Therefore, this study aims to further enhance the denoising effect. We propose a weight map generative network for the pixel-level image combination of initial denoised images to obtain an optimal, well-combined image. The contributions of this study are as follows:
Compared to CsNet [
28], which sets a stand-alone weight for each initial denoised image, we introduce a finer combining granularity scale in this study. Specifically, we use a deep learning network to set appropriate weights for each pixel, which contain information about the contribution of the initial denoised images on the combined image, ensuring that it can retain the details well. We call this a pixel-level combination strategy.
Conventional supervised CNN-based denoising methods need noisy–clean paired images to achieve high performance. In our study, we use an unsupervised learning method to achieve the optimal weight maps without noisy–clean paired images [
29]. It reduces the cost of collecting considerable training pairs for training and makes the generalization capability of our method excellent.
The proposed method can be easily extended, allowing the free combination of several initial denoised images generated by any denoiser. Therefore, in our study, we performed combination experiments with different classes as well as different numbers of denoised images to find the complementarity between different denoising methods. Currently, researchers are attempting to improve the denoising performance. As more efficient denoising methods are proposed, in the future, our approach will still perform well. We can employ several different methods that are more efficient and complementary to each other to enhance the denoising effect.
The remainder of this study is structured as follows:
Section 2 presents image-level and patch-level combination strategy models relevant to our study. In
Section 3, we outline the image combination problem and introduce the proposed pixel-level combination strategy. In
Section 4, we present our datasets, experimental setup and experimental results, including the selection of combinatorial patterns, ablation experiments and comparison with other state-of-the-art denoising methods. Finally, this study is concluded in
Section 5.
2. Related Work
Generally, the granularity of image combinations can be divided into three levels: image level, patch level and pixel level. Theoretically, the smaller the granularity of the combination, the higher the potential for denoising effect enhancement. Our method belongs to the pixel level and has the potential to obtain optimal combination results. To help the reader gain a better understanding of the background, in this section, we briefly review both the image-level image combination strategy and the patch-level image combination strategy. We first introduce the image-level combination strategy, CsNet. Subsequently, a patch-level combination strategy called structural patch decomposition (SPD) is reviewed.
2.1. Consensus Neural Network
In 2019, CsNet was proposed by Choi et al. They observed that denoised images using different denoising methods are significantly complementary, and there may be a way to combine the denoised images to generate an excellent overall result. Therefore, they integrated different denoisers in the CsNet to generate a combined denoised image, which is better than those obtained by the individual denoising techniques. It mainly handles Gaussian noise. As can be seen in
Figure 1, it can be divided into two stages: the combining stage and the boosting stage.
The first stage is employed for preliminary estimates to obtain the combined image. For simplicity, we used the same notation as in [
28]. A set of initial estimates
can be derived using
K image denoisers
. To concatenate these initial estimates, the matrix
is constructed. For image combination, CsNet focuses on the linear combination of estimators, which is used to keep the model unbiased and ensure lower variance. That is, given
, the linearly combined estimate can be obtained as
where
is the vector of combination weights. To identify the weights
, Choi et al. used a convex optimal combination framework based on the previous work of Jaggi [
30] to solve a convex optimization problem.
Using the estimator
M, the CsNet method estimates a set of typical denoisers to generate their respective mean squared error (MSE) [
31] of the initial denoised image. The MSE between the ground truth
and the combined estimate
is
According to the
of each denoised image, the optimal combination problem can be solved by obtaining the weight vector
and minimizing the
. Subsequently, depending on the optimal weight, it composites the denoised images together.
where the constraints
and
ensure that the total weights are 1 and the combined estimate is nonnegative, respectively.
However, after the first stage, the image quality of the combined image is not excellent. To enhance the contrast and recover the lost features of the combined image, a deep learning-based booster was used at the end of the CsNet method, where variables were updated one by one in the iterations. Specifically, based on the current estimate
and the observed image
y, the next estimate
can be obtained by the equation
Here,
is a multi-layer convolutional neural network that implements nonlinear mapping. In the CsNet model,
is composed of three convolutional layers and three deconvolutional layers, and the size of the convolution kernels is
. The input of the network is the pair
. In addition, Choi et al. provided skip connections to ensure that information is not lost while passing through the network layer. To obtain a better improvement effect, the CsNet model uses five cascaded networks
, where
T is 3.
CsNet uses multiple denoisers to improve the output image, which has superior denoising performance compared to both model-based and learning-based denoisers. It contributes to the solution of noise level mismatch, and the selection and combination of different denoiser types, but still displays coarse granularity, which has an impact on the quality of the final combined images. Thus, it must employ the booster module to optimize and improve the preliminary optimized image again, with the enhancement process repeated numerous times to obtain the final high-quality noise reduction effect. Moreover, the denoising effect of CsNet is mainly dominated by the MSE estimator. However, the MSE estimator does not perform well in terms of accuracy and it suppresses any further improvement in the image quality. In general, the implementation process of CsNet is complicated, resulting in a long execution time and limited improvement in the image quality.
2.2. Structural Patch Decomposition Approach
In addition to the image-level combination strategy, image combination can also be performed at the patch level. The patch-level combination strategy improves the granularity of image processing and retains more available details in images compared to the image-level combination strategy. In recent years, various researchers have proposed their patch-level fusion schemes, among which Ma et al. proposed SPD for multi-exposure image fusion in [
32]. Although the SPD was not originally used in image denoising, we found that applying it to denoised image combination yielded higher-quality results compared to typical methods, such as CsNet. Therefore, in this section, we present a discussion on structural patch decomposition.
Initially, the SPD approach uses a moving window with a fixed stride to extract
K overlapping image patch sequences
from the entire image. They decompose a given patch
into three parts, such as signal strength, signal structure and mean intensity. Its decomposition formula is
where
is the
norm of a vector,
is the mean value of the patch, and
is a mean-removed patch. As a result, the scalar
,
and vector
have physical significances that correspond to the strength component, mean intensity component and structure component of
, respectively.
First, the strength component is analyzed. Here, the contrast of the combined patch is determined by the highest contrast. It is expressed as
Regarding the structure component, to make the combined image patch structure reflect all source image patch structures, a basic implementation of this relationship is provided
where
is a weighting function given by
that defines the contribution factor for each source image patch.
For the mean intensity component, it is denoted as
where
is a weighting function. It contains two parameters, including the local mean value of the current patch and the global mean value
of the image, as inputs.
measures the exposures of
in
, imposing a high penalty when the image and/or the current patch are under/over-exposed. After determining the
,
and
components using Equations (
7)–(
9), respectively, the reconstructed patch can be expressed as
After the overlapping image patches are reconstructed, they are placed back into the entire image and the pixels in the overlapping patches are averaged to produce a great overall result.
In our study, we apply SPD to image denoising. For the co-located initial denoised patches produced by stand-alone methods, SPD decomposes them into signal intensity, signal structure and average intensity. Then, it combines the initial denoised patches by evaluating these three components. The combined patch always retains rich details, image structure information and intensity information. Finally, these combined patches are placed back into the entire denoised image. As a result, in comparison with image-level combination methods, SPD uses the image local information more effectively during combination and produces a higher-quality denoised image.
However, some artifacts around edges are produced because of the inability of the model to create a sufficiently smooth transition between exposures near strong edges. Therefore, the patch-level combination strategy is still not optimal and there is scope for improvement. In addition, the weights of each component of the SPD method are handcrafted, which does not ensure that the optimal combination results will always be achieved in all parts of the image. Hence, the patch-level combination strategy is still rough and has room for improvement.
5. Conclusions
In this study, a weight map generative network was proposed. Rather than designing a novel denoising model from scratch, we combined the initial denoised images processed by different typical and effective denoisers at the pixel level to produce the optimal denoised images. Specifically, for the experiment, we found that the combination of BM3D and FFDNet has the best performance. For BM3D, we used the non-local self-similarity such that the internal information of the target image can be fully mined, alleviating the excessive reliance on the dataset. For FFDNet, we utilized its deep learning network to construct an external prior. After the two models completed the denoising process for the noisy image separately, the superior part of the two results was combined to generate an optimal denoising result via corresponding weight maps. We used an unsupervised deep learning approach to generate the weight map, as manually setting a specific weight value for each pixel point of an image is complex and difficult. This method is called the pixel-level image combination strategy. The experimental results demonstrate that our method presents a significant overall improvement compared to the results of other state-of-the-art methods, surpassing the second-best method by 0.03 dB to 1.42 dB on average, on two datasets with different noise levels. In particular, our method increased the PSNR results by an average of 0.69 dB compared to BM3D, 0.48 dB compared to FFDNet and 0.34 dB compared to CsNet. The method also performed well in visual comparisons in terms of image details and textures, in processing both common and color test images. It should be noted that our method is an extensible image combination strategy, which is reflected in two aspects. Our method is flexible in the selection of the type and number of initial denoising methods. As new denoising methods have been continuously proposed, future studies can select more efficient complementary methods to further increase the performance in terms of execution efficiency and denoising effect. Our method is based on a general image combination strategy, indicating that it is not only limited to image denoising. It would be reasonable to extend our image combination strategy to other image processing problems, such as image deblurring or low-light image enhancement.
Generally speaking, the metrics for evaluating denoising results are mainly in two aspects: effect and efficiency. For every new denoising method, the most important goal is to improve the denoising effect. In this work, we focus on a combination of stand-alone methods to improve the denoising effect; therefore, the processing time is inevitably longer than that of stand-alone denoising methods, and investigating how to reduce the processing time was beyond the scope of this study. In future research, we will explore methods to reduce the processing time while maintaining an excellent denoising effect. Specifically, we may achieve this in the following ways: with the development of denoising techniques, a natural but fundamental idea is to find some more efficient stand-alone denoising methods to reduce the pre-processing time; moreover, we can also improve the efficiency of our method by optimizing our weight map generative network. At present, there are two general implementations of combination networks, unsupervised and supervised. Our network belongs to the unsupervised approach, which allows its network parameters to be adapted to any given initial denoised images without requiring training pairs. This image-specific denoising network allows us to produce optimal combined images but with a longer processing time. Although supervised networks have advantages in terms of testing time, they require a large number of labeled training images, which are expensive to collect. In addition, most supervised combination networks suffer from poor generalization ability and fail to achieve optimal performance on testing images that are different from the training images. The recently proposed U2Fusion [
37] has inspired us to develop a novel method to combine the advantages of both supervised and unsupervised approaches. We can design our proposed network to cooperate with a supervised network such as DenseNet [
37] in U2Fusion. For a given noisy image, our UWMGN always produces its optimal denoising image. Therefore, in the case of no ground truth images, the labeled images required by the supervised network in the training phase will be generated by our UWMGN. As such, in the testing phase, the combined images will be generated directly from the trained supervised network, which improves the execution efficiency while maintaining excellent denoising results.