Image Denoising Using a Novel Deep Generative Network with Multiple Target Images and Adaptive Termination Condition

: Image denoising, a classic ill-posed problem, aims to recover a latent image from a noisy measurement. Over the past few decades, a considerable number of denoising methods have been studied extensively. Among these methods, supervised deep convolutional networks have garnered increasing attention, and their superior performance is attributed to their capability to learn realistic image priors from a large amount of paired noisy and clean images. However, if the image to be denoised is signiﬁcantly different from the training images, it could lead to inferior results, and the networks may even produce hallucinations by using inappropriate image priors to handle an unseen noisy image. Recently, deep image prior (DIP) was proposed, and it overcame this drawback to some extent. The structure of the DIP generator network is capable of capturing the low-level statistics of a natural image using an unsupervised method with no training images other than the image itself. Compared with a supervised denoising model, the unsupervised DIP is more ﬂexible when processing image content that must be denoised. Nevertheless, the denoising performance of DIP is usually inferior to the current supervised learning-based methods using deep convolutional networks, and it is susceptible to the over-ﬁtting problem. To solve these problems, we propose a novel deep generative network with multiple target images and an adaptive termination condition. Speciﬁcally, we utilized mainstream denoising methods to generate two clear target images to be used with the original noisy image, enabling better guidance during the convergence process and improving the convergence speed. Moreover, we adopted the noise level estimation (NLE) technique to set a more reasonable adaptive termination condition, which can effectively solve the problem of over-ﬁtting. Extensive experiments demonstrated that, according to the denoising results, the proposed approach signiﬁcantly outperforms the original DIP method in tests on different databases. Speciﬁcally, the average peak signal-to-noise ratio (PSNR) performance of our proposed method on four databases at different noise levels is increased by 1.90 to 4.86 dB compared to the original DIP method. Moreover, our method achieves superior performance against state-of-the-art methods in terms of popular metrics, which include the structural similarity index (SSIM) and feature similarity index measurement (FSIM). Thus, the proposed method lays a good foundation for subsequent image processing tasks, such as target detection and super-resolution.


Introduction
During acquisition and transmission, the quality of digital images inevitably degrades owing to corruption caused by various reasons.Therefore, the ability to recover a clean image from a noisy one is of great importance, and image denoising is a fundamental step applied in all image processing pipelines.In the computer vision field, image denoising has been a research hotspot since the 1990s.After decades of research, many denoising algorithms had achieved good results through approaches such as non-local self-similarity in natural images [1][2][3], low rankness-based models [4,5], sparse representation-based models [3,6,7], and fuzzy (or neuro-fuzzy)-based models [8,9].Nevertheless, researchers are still aiming to further improve the performance of image denoising algorithms.
Existing denoising algorithms can be roughly divided into internal algorithms and external algorithms [10].Internal algorithms utilize the noisy image itself, while the external algorithms exploit clean, natural images related to the noisy image.Internal image denoising algorithms include filter algorithms, low rankness-based models, and sparse representation-based algorithms.Representative examples of filter algorithms are the non-local means (NLM) algorithm and the block-matching and 3D filtering (BM3D) algorithm.The NLM algorithm [1], proposed by Baudea et al. in 2005, exploited non-local self-similarity in natural images.It first finds similar patches and obtains their weighted average to achieve the denoised patches.Although it exhibits excellent performance, the NLM algorithm is limited by its inability to identify truly similar patches in a noisy environment.BM3D [2], a benchmark denoising algorithm, starts with the block-matching of each reference block, and obtains 3D arrays by grouping similar blocks together.The authors used a two-step algorithm to denoise an image.First, they denoised the input image simply and obtained a basic estimate; next, they achieved an improved denoising effect through collaborative filtering of the basic estimate.Among low rankness-based methods, nuclear norm minimization (NNM) and weighted nuclear norm minimization (WNNM) are two well-known algorithms.The NNM algorithm [4] was proposed by Ji et al. for video denoising.In their work, the problem of removing noise was transformed into a low-rank matrix completion problem, which can be well solved by singular value decomposition.However, the authors equalized each singular value to ensure the convexity of the objective function, which severely restricts its capability and flexibility when dealing with denoising problems.Based on the NMM algorithm and proposed in [5], the WNNM algorithm takes advantage of the non-local self-similarity of the image for denoising.Among sparse representation-based algorithms, the K-singular value decomposition (K-SVD) algorithm, the learned simultaneous sparse coding (LSSC) algorithm, and the non-locally centralized sparse representation (NCSR) algorithm are three noteworthy algorithms.K-SVD [6] is a classic dictionary learning algorithm, which utilizes the sparsity and redundancy of over-complete learning dictionaries to produce high-quality denoising images.LSSC [3] exploits the combination of the self-similarity of image patches and sparse coding to further boost denoising performance.The NCSR [7] algorithm was proposed by Dong et al. and utilizes non-local self-similarity and sparse representation of images.It introduces the concept of sparse coding noise with the goal of suppressing the sparse coding noise to denoise an image.In general, most of these traditional denoising methods use custommade image priority and multiple, manually selected parameters, providing ample room for improvement.
In recent years, deep learning-based methods have become a popular research direction in the field of image denoising.These methods can be categorized as external methods whose denoising performance is superior to internal methods.The main idea is to collect a large number of noise-clean image pairs, and then train the deep neural network denoiser using end-to-end learning.These methods have significant advantages in accumulating knowledge from big datasets; thus, they can achieve superior denoising performance.In 2017, Zhang et al. proposed Deep CNN (DnCNN) [11], which exploited the residual learning strategy to remove noise.They introduced the batch normalization technique as it not only reduced the training time, but also boosted the denoising effect quantitatively and qualitatively.However, it is only effective when the noise level is within a pre-set range.Hence, Zhang et al. proposed FFDNet in [12].FFDNet showed considerable improvement in flexibility and robustness using a single network.Specifically, it was formulated as x = F(y, M; Θ), where x is the expected output, y is the input noise observation, and M is a noise level map.In the DnCNN model x = F(y; Θ), the parameters Θ change with the noise level.As for the FFDNet model, M is modeled as the input and the hyper-parameters have no relationship to the noise level.Therefore, it could handle different noise levels in a flexible manner using a single network.The consensus neural network (CsNet) was proposed by Choi et al. [13], and combines multiple relatively weak image denoisers to produce a satisfactory result.CsNet exhibits superior performance in three aspects: solving the noise level mismatch, incorporating denoisers for different image classes, and uniting different denoiser types.In summary, these supervised denoising networks are exceedingly effective when supplied with plenty of noise-clean image pairs for training, but collecting clean images of the ground truth in many real-world scenarios is very difficult.Moreover, if the image priors are significantly different from the image to be denoised, the supervised denoising networks tend to produce hallucination-like effects when handling an unseen noisy image, because the previously learned image statistics cannot handle the untouched image content and noise level well.These networks have strong data dependence [14], leading to a lack of flexibility.
To overcome the aforementioned limitations, researchers have focused on training unsupervised denoising networks without training images.Recently, research on generative networks using the deep image prior (DIP) framework [15] demonstrated that even if only the input image itself is used in training, deep convolutional neural networks (CNNs) can still provide superior performance on various inverse problems.No prior training is required, and random noise is used as the network input to generate denoised images.It can be widely used in image noise reduction, super-resolution, and other image restoration problems.Because the hyper-parameters of DIP are determined based on the specific noisy image, it may, in some cases, achieve better denoising results than the supervised denoising models.As shown in Figure 1, although the general denoising performance of DIP is inferior to that of DnCNN, we can still find some local details, such as the magnified part of the images, that DIP can preserve more accurately as compared with DnCNN.The reason is that the prior knowledge captured by DnCNN cannot handle the subtle information that DIP can.DIP performs the inference by stopping the training early.However, in the original DIP model, the early stopping point is set as a fixed number of iterations using experimental data, so the result is not always optimal.Furthermore, the noisy image is used as the target image that provides poor guidance and leads to slow convergence of the generative network.Thus, the denoising performance of DIP is much lower than that of deep learning in some cases, and it still leaves room for improvement.In view of the limitations of the existing DIP method, we propose a novel deep generative network with multiple target images and an adaptive termination condition, which not only retains the flexibility of the original DIP model, but also improves denoising performance.Specifically, instead of the noisy image, we use two target images of higher quality to participate in the formation of the loss functions.In addition, we adopt a noise level estimation (NLE) method to automatically terminate the iterative process to resolve the early stopping problem, prevent over-fitting, and ensure an optimal output image.The remainder of the paper is organized as follows: Section 2 introduces a literature review of related work.In Section 3, we describe the proposed approach in detail.Section 4 discusses our experimental results and analysis.We discuss our current work and future work in Section 5. Finally, we conclude this paper in Section 6.

Background
Image denoising is the most fundamental inverse problem of image processing, and its purpose is to recover the underlying image from its noisy measurement.In most cases, image denoising is an ill-posed problem; based on the noisy observation, we can always find many reasonable images that could belong to the clean image manifold.The image denoising problem can be described by a simple mathematical formula: where y is a noisy observed image and x is a clean, no-noise image.Generally, n is assumed to be the additive white Gaussian noise (AWGN), which is widely used in the field of image denoising.The denoising problem requires finding the denoised image x that is closest to the true value image.

Deep Neural Network with Training Pairs
A deep neural network with training pairs is a type of supervised learning method that requires training on large datasets.It aims to map a noisy image to a clean manifold to enable it to remove noise once it is trained.When a large number of training pairs are usable, a neural network can be trained in the following manner: where θ ∈ R L represents the trainable variables, f : R N → R N indicates the neural network, x i label ∈ R N denotes the i th training label, and x i noise ∈ R N is the network input for the i th training pair.In CNN, θ contains convolution filters and bias terms for all layers.Once trained, the network can be applied to image denoising [16][17][18][19].Compared with traditional denoising methods such BM3D, WNNM, and NLM, the deep learning-based methods show superior denoising performance by restoring more image details.These supervised deep learning-based methods require a large number of training pairs to learn network hyper-parameters.These parameters, denoted by θ, have a strong data dependence [14]; specifically, when the image content and noise level values are not uniformly distributed in the image database, the denoising results will be poor.Once the training model is determined, the parameters will not change during the test stage.Therefore, when a noisy image is noticeably different from the training images, the neural network may produce non-existent reconstructed output that results in poor denoising performance.

Deep Image Prior
Different from supervised deep learning-based methods, DIP is regarded as an unsupervised learning method, which does not require a dataset with a large number of clean target images for training.The general idea is similar to the adaptive dictionary learning method.Ulyanov et al. [15] demonstrated that untrained networks can capture some low-level statistics of natural images, especially the translation invariance of local convolution and its usage.A series of such operators can capture pixel neighborhoods on multiple scales.Let x 0 ∈ R N be a distorted image, and the training process can be characterized as: where the network input z ∈ R M is random noise, and x ∈ R N is the denoised image output.The U-Net's encoder-decoder architecture [20] is mainly used by the network, where z is a fixed 3D tensor having the same space size as x and 32 feature maps.The network has a large number of parameters.Specifically, the encoder portion is a contracting path containing maximum pooling layers and stacked convolution, while the decoder portion is an expanding path containing the nearest neighbor upsampling and bilinear upsampling techniques.The encoder is composed of four downsampling layers and four convolutional blocks, while the decoder contains four upsampling layers and four convolutional blocks.No training pair is required, and f (θ, z) is updated at the start.Given the noisy target x 0 , the denoised image x is acquired by minimizing the reconstruction error x 0 − f (θ, z) over z and θ.The method starts with the initial values of z with zero-mean Gaussian distribution, and θ is optimized by gradient descent.
Figure 2 schematically depicts the use of DIP with a fixed number of iterations in the optimization process.Here, Ulyanov et al. optimized Equation (3) by using a data term such as the L 2 distance, which compared the generated image with x 0 : The ground truth value x gt has the non-zero cost E x gt , x 0 > 0. As shown in Figure 2, if it runs for a long enough time, DIP will obtain a solution (x i = x 0 ) that is quite far from x gt .However, the optimized path will usually be close to x gt , and the early stopping point (here at step t * ) will obtain a good solution.Ulyanov et al. [15] showed that this prior is comparable to state-of-the-art learning-free methods in image denoising such as BM3D [2].The prior encodes the hierarchical self-similarity utilized by dictionary-based methods [21] and non-native technologies (such as BM3D).Several layered networks with skipped connections are used for denoising, which plays a vital role in the network architecture.

Drawbacks of DIP
Despite the flexibility of DIP shown in image denoising, its results are in some cases not optimal.First, the generators used for DIP are usually over-parameterized; that is, the number of network parameters is greater than the number of output dimensions, and too many iterations result in an empirically overfitted image.In Figure 3, it can be seen that for each curve, the peak signal-to-noise ratio (PSNR) result continuously improves until it reaches a specific iteration; beyond this iteration, the resultant curve of PSNR begins to decline.Thus, if the iteration process does not stop at the appropriate iteration, the DIP experiences under-or over-fitting problems.Though Ulyanov et al. set a fixed number of iterations (early stopping point) in the deep natural network, it was based on experimental data; thus, it could not guarantee the optimal denoising effect.As shown in Figure 3, regardless of where the iterations end, not all three output images could achieve the optimal denoising effect at the same iteration due to differences in their optimal early stopping points.Second, according to the data listed in Table 1, the denoising effect of DIP is inferior to that of the mainstream FFDNet method.The main reason is that DIP's loss function is defined as: where MSE is the mean square error, xi is the output image of the deep neural network, and x 0 is the noisy image.The guiding ability of the noisy image x 0 , which controls the final convergence direction of the output image, is limited.Notably, more noise in the noisy image will weaken its guiding ability, causing slow iteration convergence and poor denoising performance.

Multiple Target Images
First, we considered a new approach to enhance the guidance of the loss function described in Equation ( 5) by adding two sub-items.In other words, we added two images with higher guiding ability (higher image quality) to participate in the calculation of the loss function.Specifically, we applied two mainstream denoising methods (FFDNet and BM3D) to denoise each noisy image, thereby obtaining two preliminary denoised images x 1 and x 2 .Next, we added the MSE values of the two preliminary denoised images (x 1 and x 2 ) to the loss function.The new loss function can be computed with: where xi is the output image of the network, x 0 represents the noisy image, and x 1 and x 2 are the preliminary denoised images produced by FFDNet and BM3D, respectively.As shown in Figure 4, the proposed approach starts with random weights θ 0 , and we iteratively update them to minimize the objective function described in Equation (6).For each iteration i, the weights θ are used to generate the image xi = f θ i (z), where the mapping f is a neural network with parameters θ i and z is a fixed tensor.The image xi is used to calculate the non-zero cost E xi ; x 0 , x 1 , x 2 .The weight θ i is then updated using the stochastic gradient descent (SGD) training method.The advantage of this method is that it utilizes preliminary denoising images to construct the loss function and can thereby adjust the evolution direction of the generative network model.This ensures that the network output image xi evolves in a reasonable direction within the solution space (close to the ground truth x gt ).The schematic diagram in Figure 5 shows the center of gravity of x 0 , x 1 , and x 2 , and shows the point x0 , where the network output image xi finally converges in the solution space after adopting the new hybrid loss function.This results in an output image xi that is closer to the undistorted image x gt after a specific iteration.It should be noted that the preliminary denoised image x 1 is obtained from FFDNet, which utilizes external information captured through a training image set, while x 2 is obtained from BM3D, which utilizes the internal self-similarity information of the image.Consequently, the proposed method essentially utilizes both the internal and external prior constraints of the image to remove noise.

Adaptive Termination Condition
Second, to solve the problem of over-or under-fitting, we adopted the previously proposed NLE module [22], which can assess the severity of the noise interference and obtain the noise level value of the noisy image to allow us to set a more reasonable adaptive termination condition.Specifically, the residual image ni = x 0 − xi can be obtained by subtracting the i th output image of the deep generative network from the noisy image x 0 .In the early stages of the network iterations, the network output image ni is far from the undistorted image so the standard deviation std ni of the residual image ni is relatively large.When arriving at the appropriate i th iteration, the standard deviation std ni of the residual image ni should be close to the noise level value σ of the noisy image measured by the previously proposed NLE module.Therefore, std ni ≈ δ is used in our work to adaptively terminate the iteration process.In short, with an accurate noise level value, σ, we can determine when to terminate the iteration process after the appropriate number of iterations has been completed.As shown in Figure 5, the adaptive termination point x * proposed of the proposed method is closer to the optimal point x gt than that of the DIP method.Meanwhile, Figure 6 also shows that our adaptive termination condition set the termination step number at 2801, which is very close to the optimal iteration step, 2835, in the iterative process.Thus, the proposed approach can resolve the early stopping problem and achieve more optimal denoising performance.

Experiments 4.1. Datasets and Experimental Setup
To evaluate our method comprehensively and verify its effectiveness, we conducted extensive experiments and compared it with DIP and eight other state-of-the-art image denoising methods, including BM3D [2], NCSR [7], WNNM [5], DnCNN [11], FFDNet [12], TWSC [23], RED-Net [24], and CsNet [13].We conducted denoising experiments on four datasets.In the first dataset, as shown in Figure 7, 10 images commonly used in the literature were selected, comprising six images with a size of 512 × 512 (Barbara, Boat, Couple, Hill, Lena, and Man) and four images with a size of 256 × 256 (Cameraman, House, Monarch, and Peppers).For the second dataset, as illustrated in Figure 8, we randomly selected 50 natural images from the Berkeley segmentation dataset (BSD) [25].The third dataset contains 10 images obtained randomly from the Flickr1024 database, [26] which consists of 1024 high-quality images covering diverse scenarios.Figure 9 shows some examples.The 10 images in the fourth dataset were randomly selected from Urban100, which contains 100 high-resolution images with various real-world structures; Figure 10 shows some representative images in this dataset.
To test the denoising performance of the proposed method objectively, we utilized three widely accepted image quality evaluation criteria [27][28][29], including PSNR, structural similarity index (SSIM) [30], and the feature similarity index measurement (FSIM) [31].In addition, we compared the results visually to assess the quality of the denoising effects subjectively.We performed our experiments using a Lenovo desktop with a 4.00 GHz eight-core Intel Core i7-6700K CPU and 16 GB of RAM.

Experimental Results and Analysis
In this subsection, we present the PSNR results of the proposed method and the original DIP method on 10 commonly used test images.Table 2 shows the results with noise levels of σ ∈ [10,20,30,40,50,60].The highest PSNR values for each noise level are highlighted in bold.According to the PSNR results shown in Table 2, it can be observed that our approach achieved better performance in all cases compared with the original DIP method.In particular, the processing result of the Barbara image at noise level σ = 30 is notable.Even though the Barbara image has abundant texture details and is complex, our proposed method increased the PSNR result by 4.35 dB.The reason for this is that our proposed method is especially suited to denoising images with complex textures because it utilizes two target images of high quality.Additionally, the minimum increase in the PSNR value was observed on the Hill image with the noise level σ = 10, which reached 1.15 dB.The SSIM and FSIM indexes of the two methods on the 10 commonly used test images were also computed and are shown in Tables 3 and 4, respectively.These results show that the proposed method surpassed the DIP method in both SSIM and FSIM, which confirms that our method significantly improved the local structure preservation and global brightness consistency.We also performed experiments to compare the time efficiency of the two methods.In Table 5, the PSNR value in the second column is the best result that the original DIP method can achieve.The following rows list the number of iterations and time required for DIP and our method to achieve the value, respectively.We can observe that compared with the original DIP method, the proposed method requires less time to reach the PSNR value, which shows that the proposed method surpasses the original DIP method not only in denoising performance but also in time efficiency.Moreover, we present the average PSNR, SSIM, and FSIM results of the eight other denoising methods on the 10 commonly used images for noise levels σ ∈ [10,20,30,40,50,60] in Tables 6-8, respectively.From Table 6, we can draw the following conclusions.First, although the original DIP method is more flexible, its overall denoising effect is inferior to the mainstream denoising methods.Second, our method surpassed the other mainstream denoising methods and obtained the highest average PSNR results.Specifically, it outperformed the deep learning-based method FFDNet by 0.48 to 1.23 dB.Table 7 shows that the original DIP method was the second best method and achieved impressive SSIM results at noise levels σ ∈ [10,20,30,40,50].However, the SSIM results of our method were higher than DIP and improved the SSIM values by 0.0184 to 0.0901.Further, from Table 8, it can be seen that the proposed method also achieved the highest FSIM results in all cases.
To test the robustness of the proposed method, we conducted experiments using the BSD dataset in which the texture of the images is more complex, making the task of image denoising more difficult.The PSNR performance of the nine competitive denoising methods is shown in Table 9.The denoising effects of each method on this dataset showed different degrees of decline compared to the average PSNR results achieved on the first dataset, shown in Table 6.Nevertheless, it is apparent that the PSNR results obtained by our method still outperformed all other methods.Especially when the noise level was set to 10, the improvement was significant (e.g., an average improvement of 3.31 dB over the DIP method).Tables 10 and 11 list the average SSIM and FSIM results for all 10 methods under six different noise levels.We can observe that our method obtained the highest SSIM and FSIM values; additionally, the improvements obtained by our proposed method for both the SSIM and FFIM results are noteworthy.
In addition to the traditional databases, we performed experiments on a larger dataset called Flickr1024 with a variety of images.The average PSNR results are shown in Table 12.The PSNR results obtained by our method are clearly superior to those of the other nine methods.Tables 13 and 14 list the average SSIM and FSIM values obtained by 10 methods under six different noise levels.The results show that our method also achieved excellent performance in terms of the values of SSIM and FSIM.Compared to DIP, the proposed method can boost the average SSIM and FSIM values from 0.0607 to 0.1423 and from 0.0144 to 0.0508, respectively.
To further evaluate the applicability of our method comprehensively, we randomly selected 10 high-resolution images from Urban100.From Table 15, we can observe that the WNNM method obtains good results when processing high-resolution images.Nevertheless, our method still outperforms it and achieves the highest PSNR results.As shown in Table 16, our method obtained the highest average SSIM results; compared with the other nine methods, the improvement in the values was approximately 0.0104 to 0.1023.Moreover, in Table 17, it can be observed that our method obtained the best average FSIM results.
The experimental results clearly show that our proposed approach outperformed the existing state-of-the-art denoising methods on four classical datasets that are highly comments.Figure 11 shows a visual comparison of the denoising results of one image selected randomly from BSD with a noise level σ = 40.In the denoised images, we chose to evaluate a portion of the back thigh of a tiger, which was magnified and displayed in the bottom right corner of each image for better visualization.It can be found that DIP exhibited poor denoising performance as some details were lost.The image of the grass that overlaps the thigh of the tiger is completely unobservable, and the position of the spots on the thigh is also distorted compared to the original image.As for the image denoised by Red-Net, the blurry spots of the noise could not be removed effectively, leading to unsatisfactory results.Although WNNM, NCSR, and TWSE produced smoother edges compared to DIP, the texture details were not preserved.While DnCNN, FFDNet, and CSNet retained more texture details, they were prone to generate oversmoothed artifacts.We can observe from the magnified part of the image obtained from our proposed approach that the details of the grass were strengthened and the discernibility of the fur was improved, which demonstrates that the image denoised by our method is close to the original image.Compared with the nine above-mentioned methods, our proposed method preserved more local edges and high-frequency components, leading to a denoised image with better visual effects.Overall, our proposed method yielded satisfactory visual quality compared with the state-of-the-art denoising methods and increased the PSNR value to 28.08 dB.

Discussions
It is well known that Gaussian noise is widely used in image denoising, thus our generative network can fully manage such noise.However, real-world noise, such as Poisson noise, Gaussian-Poisson noise, and salt and pepper noise, is usually non-Gaussian.Poisson noise and Gaussian-Poisson noise are so-called signal-related noise.In an image, their noise levels are variable while the noise level of Gaussian noise is fixed.To handle these cases, we can exploit the average noise level [32] rather than the fixed noise level in our method.To handle salt and pepper noise, we must utilize the corresponding denoising algorithms to obtain preliminary images and use the noise ratio as a condition.That is, the criterion for over-fitting is no longer the noise level, but the noise ratio.Therefore, under the framework of our method, as long as the preliminary denoising images and the iteration termination conditions are modified accordingly, the salt and pepper noise or other types of noise can also be handled well.
In this work, we adopt the mixed loss function, in which the three terms have the same weight, and achieved satisfactory results.Here, we adopt the noisy image to utilize its internal information, but its guiding ability is interfered with by different noise levels to varying degrees.Theoretically, when the noise level is relatively low, the noise image contains more useful information, so it can occupy more weight; when the noise level is relatively high, the noise image is seriously disturbed, and it contains less useful information, so it will occupy less weight.In future work, we consider assigning different weights to the terms to further improve the denoising performance of our generated network.
Further, we exploit the MSE loss function that uses the L2 norm to characterize the distance between the generative image and the noisy image, and preliminary denoising images, respectively.Although the MSE loss function can easily reach local minimums and is sensitive to errors, it still has some defects.For example, it could over-penalize larger errors and may not capture complex characteristics in some cases; meanwhile, the mean absolute error (MAE) loss function that uses the L1 norm to describe the distance may allow our network to obtain better results.Thus, we are considering exploring a mixed loss function with more norms in future work.
It should be noted that although the proposed method obtains the optimal result through online training, it requires a large number of gradient updates, resulting in long inference times.Thus, its execution efficiency is relatively low.In the future, we will consider adopting transfer learning [33] to first find a suitable general initial parameter to improve performance for a faster denoising process.

Conclusions
In this paper, image denoising is modeled as image generation by exploiting DIP with multiple target images and an adaptive termination condition.The experimental results confirm that the proposed generative network exhibits better denoising performance than the original DIP.Moreover, experiments also show that our approach achieves significant performance gains over the state-of-the-art methods according to quantitative evaluation indicators and visual comparisons.The main reason for the increased performance is the integration of preliminary denoising images into the loss function.This allows the proposed generative network to ensure a reasonable convergence position for the output image in the image solution space, thus obtaining an output image as close to the ground truth as possible.Moreover, the adaptive termination condition guarantees the optimal early stopping point in the convergence process that can ensure superior denoising performance.

Figure 2 .
Figure 2. Image space visualization of image denoising using deep image prior.

Figure 3 .
Figure 3. PSNRs (dB) of DIP evaluated on the Barbara, Cameraman, and Peppers images with the noise level σ = 50.

Figure 5 .
Figure 5. Image space visualization of image denoising using the proposed method and deep image prior.

Figure 6 .
Figure 6.Performance comparison between the adaptive termination step and optimal step.

Figure 7 .
Figure 7.A range of 10 widely-used test images in the references.

Figure 8 .
Figure 8.Some representative images in the BSD database.

Figure 9 .
Figure 9.Some representative images in the Flickr1024 database.

Figure 10 .
Figure 10.Some representative images in the Urban100 database.

Table 1 .
PNSR(dB) results of DIP and FFDNet on 10 commonly used images with noise level σ = 30 and 40.

Table 2 .
PNSR(dB) results of two methods on 10 commonly used images with various noise levels.

Table 3 .
SSIM results of two methods on 10 commonly used images with various noise levels.

Table 4 .
FSIM results of two methods on 10 commonly used images with various noise levels.

Table 5 .
Number of iterations and time(s) of two methods on 10 commonly used images with σ = 30.