Deep Image Prior for Super Resolution of Noisy Image

: Single image super-resolution task aims to reconstruct a high-resolution image from a low-resolution image. Recently, it has been shown that by using deep image prior (DIP), a single neural network is sufﬁcient to capture low-level image statistics using only a single image without data-driven training such that it can be used for various image restoration problems. However, super-resolution tasks are difﬁcult to perform with DIP when the target image is noisy. The super-resolved image becomes noisy because the reconstruction loss of DIP does not consider the noise in the target image. Furthermore, when the target image contains noise, the optimization process of DIP becomes unstable and sensitive to noise. In this paper, we propose a noise-robust and stable framework based on DIP. To this end, we propose a noise-estimation method using the generative adversarial network (GAN) and self-supervision loss (SSL). We show that a generator of DIP can learn the distribution of noise in the target image with the proposed framework. Moreover, we argue that the optimization process of DIP is stabilized when the proposed self-supervision loss is incorporated. The experiments show that the proposed method quantitatively and qualitatively outperforms existing single image super-resolution methods for noisy images.


Introduction
Single image super-resolution (SISR) aims to generate a high-resolution (HR) image from a low-resolution (LR) image. SISR has become one of the important tasks in computer vision. Unlike most deep learning models that are trained on large-scale datasets, Ulyanov et al. [1] recently proposed a deep image prior (DIP) that utilizes a deep neural network (DNN) as a strong prior for image restoration by using only a single image. The results of DIP show that the DNN is useful for capturing meaningful low-level image statistics. With the success of DIP [1], it has been utilized in several ways due to its usefulness for a variety of purposes. DIP has significance in the applications where collecting largescale of datasets is difficult and expensive, such as hyperspectral image processing [2,3]. Furthermore, DIP can be used for optimization methods when solving inverse problems such as super-resolution, deblurring and denoising [4,5].
In particular, it was demonstrated that the super-resolution (SR) problem for a given target image x 0 can be solved using DIP by minimizing the following reconstruction loss term: where DS(·) is a downsampling operation and x is the restored HR image. By using the downsampling operation, the spatial resolution of x becomes the same as that of x 0 . In practice, the images taken from cameras equipped in the mobile embedded system are prone to have low-resolution and be corrupted by noise due to the small sizes of the camera sensors and apertures [6]. In such situations, the performance of DIP in the SR task (DIP-SR) [1] is significantly degraded (see Figure 1a). The degradation is attributable to the following two reasons. First, the reconstruction loss (Equation (1)) of DIP-SR does not consider the noise in x 0 . The loss term only minimizes the pixel-wise difference between DS(x) and x 0 ; hence, DS(x) tends to be noisy. As DS(x) is dependent only on x, the fact that DS(x) contains noise implies that x also contains noise. Therefore, DIP-SR requires an additional constraint to handle noise effectively. Second, the DIP optimization process is unstable and sensitive to noise. It has been shown that, for a noisy input image, DIP needs early-stopping during the optimization process in order to avoid overfitting the generated image to the noise so that a clean image can be obtained. However, DIP is limited in the absence of a ground-truth image because it cannot be determined whether the result of the early-stopping is the optimal solution. Therefore, it is essential to obtain a method for DIP to achieve noiseless results through a reliable optimization process without early-stopping.
Herein, we propose a novel DIP-based SR framework that can restore a clean HR image from a noisy LR image. As mentioned earlier, one of the main drawbacks of DIP-SR [1] is that it does not consider noise when minimizing the reconstruction loss in the LR space. Since the noisy LR image contains both signal and noise, the signal needs to be learnt by separating the noise from the LR image. However, separating the noise from an image is very challenging in the absence of ground-truth information. In order to overcome this, we propose a framework to learn the distribution of noise, even when the ground-truth of noise is unknown. As shown in Chen et al. [7], generative adversarial networks (GANs) [8] have the capacity to learn the complex distribution of noise. Inspired by this finding, we employ the GAN framework to estimate noise. Specifically, our framework consists of a generator and a discriminator. The generator aims to reconstruct a clean HR image from a noisy LR image. If a clean HR output is restored by the generator, the downsampled result is also a noise-free LR image. Thus, the difference between the downsampled result and the noisy target image must follow the distribution of the real noise. Based on this, we trained our discriminator to determine whether the distribution of the extracted noise follows the real noise distribution. We sampled the real noise sample from Gaussian distribution because it is one of the most common noise models in the image restoration field [9]. The use of an adversarial framework allows our generator to learn how to reconstruct noiseless HR image. In contrast to [7], which utilizes large scale datasets, our framework is trained to extract the noise for only a single image. In addition, we propose a self-supervision loss to increase the stability of the optimization process and prevent early-stopping. In general, signals tend to have high self-similarity (low entropy and low patch diversity) whereas noise has low self-similarity (high entropy and high patch diversity) [1,10]. Ulyanov et al. [1] showed that the parameters of convolutional neural networks (CNNs) have high impedance to noise and low impedance to signals. Owing to this characteristic, when the target image is noisy, the signals are learnt by the CNN in the early stages of optimization before overfitting to noise occurs. In other words, the results of the early stages of optimization are noiseless and significant. Thus, we assume that the result of the early stage of optimization can be used as an effective regularizer for noise-free signal reconstruction. Based on this assumption, we propose a self-supervision loss that utilizes the result of the previous iteration step during the optimization process. By comparing the output image of the current step with that of the previous step, the reconstructed image can retain the learned signal without following the noise in the target image. Thus, the proposed loss prevents the reconstructed image from becoming noisy and it results in stable optimization process without the need for early-stopping.
Extensive experiments on the SISR task in various scenarios show that our method achieves the best quantitative and qualitative results in comparison to the existing SISR methods. Figure 1 exemplifies that our method generates realistic and clean HR image, whereas DIP-SR [1] suffers from noise.
Our main contributions can be summarized as follows: • We present a GAN [8] framework to estimate the noise in a target image. Given only a noisy LR image without the ground truth, our generator reconstructs a clean HR image. The noise is estimated by learning the noise distribution in the LR image.
• We introduce the self-supervision loss (SSL), a novel approach for resolving the dependency on early-stopping and instability in the DIP [1] optimization process. • We achieve competitive results in various experiments on Set5 [11] and Set14 [12] datasets. The proposed method outperforms the existing SISR methods.

Related Works
Learning-based approaches using convolutional neural networks (CNNs) have recently achieved excellent performance in image SR. Most CNN-based SR models are trained in a supervised manner using large-scale datasets that contain LR and HR image pairs. Thus, these models learn a well-generalized distribution of the HR images from the training data. SRCNN [13], which learns the mapping from an interpolated LR image to a HR image, was first proposed in the pioneering work. However, the direct mapping of the input image to the target image is difficult to achieve. In order to alleviate this difficulty, a VDSR that learns only the residuals between the input and target images in a process called global residual learning was proposed in [14]. Since the global residual learning greatly reduces the learning difficulty and model complexity [15], it has been used in many SR models including [16][17][18][19][20][21][22]. Ledig et al. [16] proposed a SRResNet that combines the ResNet [23] architecture with global residual learning. In addition, the authors applied adversarial training [8] to image SR in order to generate realistic images. EDSR [17] employs a multi-scale architecture with global residual learning and is able to restore HR images with various upscaling factors in a single model. Guo et al. [18] proposed a wavelet prediction network for SR by using residuals. SRDenseNet [24], RDN [20], ESRGAN [19] and DRLN [22] combined DenseNet [25] blocks and global residual learning in order to capture rich features. Benefiting from global residual learning, most existing SR methods are trained to enhance high-frequency information. Due to this characteristic, they also amplify the noise in the LR images. In addition, they do not leverage the information specific to a single image as a prior because they are trained to model the distribution of large external datasets. By contrast, we propose a noise-robust image SR method that focuses on the internal information in a given single image.
Instead of using large scale training datasets, a deep image prior (DIP) [1] framework that requires only a single observation for image SR was recently proposed. The authors found that convolutional layers can be used as a prior for image restoration tasks such as SR, denoising and inpainting. DIP optimizes the CNNs in a self-supervised training scheme without the use of ground-truth image. By minimizing the pixel-wise difference between the reconstructed image and the target image, DIP generates a natural image with fine details. However, DIP-based SR often fails when the target LR image contains noise. Moreover, the performance of DIP relies heavily on early-stopping. In contrast to DIP, our method can restore a clean HR image from a noisy LR image without early-stopping.

Proposed Method
Our goal is to restore a clean HR image from a noisy LR image based on the DIP framework. In this section, we first introduce a DIP for a SR task (DIP-SR) [1], which is closely related to our work. We further analyze why DIP-SR fails to restore a high-quality image from a given noisy LR image. We then describe the proposed noise estimation method, which effectively reduces the noise elements while performing image SR. We subsequently describe our novel loss function, called the self-supervision loss (SSL), which helps to provide a stable optimization process in our network. Finally, we introduce the total loss.

Deep Image Prior (DIP)
Given an input LR image I LR ∈ R H×W×C and the scaling factor s, DIP-SR [1] generates a HR image I HR ∈ R sH×sW×C . By using a generator G, a code vector z ∈ R sH×sW×C is mapped to a super-resolved imageÎ HR ∈ R sH×sW×C asÎ HR = G(z). The reconstruction loss for measuring the error between the downsampled generated image and I LR is defined as follows: where DS(·) is a downsampler with scaling factor s. Since DIP uses the most common downsampling operators, such as Lanczos, the downsampler is not trainable. However, when DIP-SR attempts to super-resolve a LR image that has noise, DS(Î HR ) is likely to be noisy because a pixel-wise comparison between DS(Î HR ) and I LR is performed in the reconstruction loss (Equation (2)). Since DS is not trainable, DS(Î HR ) is dependent only onÎ HR . Thus, the fact that DS(Î HR ) contains noise signifies thatÎ HR also contains noise. In addition, we observe that there exists a point at which the quality of the reconstructed image deteriorates as the optimization process proceeds further. From that point, the output is overfitted to the noisy input image and the performance of DIP deteriorates noticeably. This observation emphasizes that DIP early-stopping is required in DIP in order to obtain a reasonable result. However, it is difficult to determine when to stop the optimization process if the clean image is absent.
To this end, both a solution to handle noise in the target image and a method to avoid early-stopping are required for DIP [1]. In order to address these problems, we first propose a noise estimation method to help our generator estimate the noise in the target image using the GAN [8] framework in Section 3.2. We also propose a self-supervision loss, which provides a stable optimization process and is described in detail in Section 3.3. Finally, the total loss and the algorithm of our framework are introduced in Section 3.4.

Noise Estimation Using GAN
In general, a noisy image I N can be modeled as the summation of the clean image I C and noise n as follows.
The noisy LR image can be handled more easily if the noise can be estimated and extracted. We therefore propose a GAN-based [8] noise estimation method to separate the noise from the reconstructed image.
As illustrated in Figure 2, our framework consists of a generator G and discriminator D. Given a noisy LR image I LR N , our generator G maps a code vector z to the reconstructed imageÎ HR C .Î HR C = G(z).
For comparison with the target LR image,Î HR C is downsampled toÎ LR C through the downsampler DS(·). Our discriminator D is trained to generate the probability y for predicting whether the input noise n in is real or fake as y = D(n in ). In the case that n in is real, then y becomes y real . If n in is fake, y becomes y f ake . While the real noise sample n is generated synthetically, the fake noise sample can be extracted as follows. The extracted noisen is made to follow the distribution of the real noise using the GAN framework. Adopting the WGAN loss [26], which stabilizes the optimization, the min-max game between the generator G and discriminator D is defined as follows: where E[·] represents the expectation operation. Finally, the adversarial loss is defined as the following.
This adversarial loss L adv penalizes the generator G by using the distance between the distribution of n and the distribution of the extracted samplen.

Self-Supervision Loss (SSL)
In general, noise has low self-similarity and high entropy because it contains no structure. Unlike noise, signals have high self-similarity and low entropy [27]. In a previous study on DIP [1], it was found that the parameters of CNNs have high impedance to noise and low impedance to signals. Due to this, when the target image is noisy, CNNs learn the signals in the early stage of the DIP optimization process before learning the noise components.
Inspired by this property of CNNs, we present a novel loss function called the selfsupervision loss. The proposed framework is optimized through several iterations. In each optimization step, the proposed network outputs the reconstructed image. We hypothesize that the result of an earlier stage can be used as a constraint to reconstruct a noiseless HR image for the following stage. Accordingly, our self-supervision loss utilizes the output of the previous iteration step during training. SSL compares the output image of the current step with that of the previous step. By performing this, the reconstructed image maintains the learned signal without following the noise in the target image. In other words, by adding a constraint to the output image to preserve the learned signal, we avoid early-stopping and a dependency on the number of steps. The SSL for each step is defined as follows: whereÎ HR C,i andÎ LR C,i representÎ HR C andÎ LR C at the ith optimization step, respectively.

Total Loss Functions
Our total loss function L total consists of the reconstruction loss L rec (Equation (2)), the adversarial loss L adv (Equation (7)) and the self-supervision loss L ssl (Equation (8)) as follows: L total = L rec + λ adv L adv + λ ssl L ssl , where λ adv and λ ssl are hyperparameters that are empirically set as 1.2 and 1, respectively. The proposed algorithm for our framework is summarized in Algorithm 1. z and n are sampled from the uniform distribution U and Gaussian distribution G, respectively. We solve the SR problem with a noisy image in the case where the noise distribution and noise level σ are known. The code tensor z is perturbed with additional noise before z enters the network. At each iteration, we first train the discriminator. Our generator is then trained using Equation (9). Note that randomly-initialized parameters are used in the downsampler DS(·).

25:
Update the parameters of G i 26: end for 27: I HR C ← G T (z) 28: return Clean HR image I HR C

Dataset
We evaluate our method on the general SR test sets including Set5 [11] and Set14 [12]. Unlike the existing SR methods, our approach considers the degradation of the given LR image by noise. Therefore, we prepare LR noisy images by downsampling the HR images by a factor s and then adding Gaussian noise of level σ. In order to evaluate the general SR performance for various degradations, we use multiple upsampling factors (i.e., s = ×2, ×4) and noise levels (i.e., σ = 15, 25).

Implementation Details
Our framework is implemented in Pytorch [28]. The proposed generator is similar to U-net [29] and the discriminator is the same as a Markovian discriminator [30] with a patch size of 11 × 11. In order to train both the generator and the discriminator, we adopt the Adam optimizer [31]. The learning rates are set to 1 × 10 −2 for the generator and 1 × 10 −4 for the discriminator. We optimize the generator and discriminator by using our objectives for 2000 iterations in the same manner as DIP-SR [1]. We use a single NVIDIA TITAN XP GPU for every single image in all the experiments.

Comparison with Existing Methods
We compare our approach with various SR methods such as DIP [1] and data-driven DL methods (i.e., DRLN [22], HAN [32] and SAN [33]). Two different sets of experiments with DIP were performed because the denoising problem and the SR problem were solved individually by DIP using different architectures and optimization methods. The first set involved the use of DIP for the SR task with noisy LR images; this is denoted as DIP-SR. The second set involved the sequential applications of two DIP networks, which were used for noise removal and SR, and denoted as DIP-Seq. For DIP-Seq, we optimize DIP for noise reduction and the SR task over 1800 iterations and over 2000 iterations, respectively. All experiments are performed with the authors' official code.

Quantitative Comparison
We evaluate the performance of our method using PSNR, SSIM [34] and FSIM [35], which are widely used in image quality assessment. Table 1 shows the quantitative comparisons for Set5 [11] and Set14 [12] at scaling factors of ×2 and ×4 and the noise levels σ = 15 and σ = 25. From the results, it can be observed that our method significantly outperforms the existing methods and achieves the best performance at all scaling factors and noise levels, except at s = 2 and σ = 15 on the Set14 dataset. The results for the SR methods (i.e., DIP-SR [1], DRLN [22], HAN [32] and SAN [33]) show that the existing approaches are vulnerable to noise in images. Even when the noise level is low (i.e., σ = 15), their performances are significantly worse than that of our method (see Table 1). When DIP was sequentially applied for noise removal and DIP-SR, the performance improves compared to DIP-SR (compare the results of DIP-SR and DIP-Seq in Table 1). However, the performance is still not as good as that of our method. We attribute the superior performance of our method to the effects of our GAN [8] framework in which the discriminator encourages the generator to reconstruct a clean output image and estimates the noise. In addition, the results show that the proposed self-supervision loss L ssl in Equation (8) permits a more reliable optimization of the existing DIP algorithm for image restoration.

Qualitative Comparison
Visual comparisons are shown in Figures 3-6. The results of the bicubic upsampling method suffer significantly from noise. This is because the resulting images are generated using the pixel values of the given image, which contains the unexpected noise. The results of DIP-SR, DRLN, HAN and SAN clearly show the side effects of the existing SR algorithms that amplifies the noise when input images are contaminated by noise (see the second, fourth, fifth and sixth columns in Figures 3-6). By contrast, the proposed method restores clean SR images that are close to ground truth. As shown in third columns in Figures 3-6, the results of DIP-Seq are less noisy than those from existing SR methods. However, noise artifacts still exist prominently in the resulting images. This indicates that the sequential optimization using two DIP networks for noise and SR is insufficient for handling both noise and SR. By contrast to the results of existing methods, we effectively remove the noise during the SR process and achieve clean HR image. Table 1. Quantitative comparisons on Set5 [11] and Set14 [12]. The best results are highlighted in bold.

Runtime Comparison
As shown in Table 2, we compare the runtime of our method with those of existing methods. The measured runtime is the average value over 10 images with the size of 256 × 256 × 3 on a PC with a single NVIDIA Titan XP GPU. Even though data-driven DL methods show fast inference time, they require training time for a large dataset. Note that since our method optimizes the network only for a given image, we do not need additional training time. The runtime of our method is similar to DIP-SR. However, our runtime is faster than that of DIP-Seq because DIP-Seq sequentially performs noise removal and SR, while our method efficiently generates noise-free SR images.

Ablation Study
We propose a noise estimating framework using GAN [8] to estimate the noise and a SSL to provide stable optimization for DIP [1]. In order to demonstrate the effectiveness of our method, we conduct ablation studies by gradually adding the noise estimation method (i.e., Equation (7)) and SSL (Equation (8)) based on the reconstruction loss (Equation (2)). For the ablation studies, the scale factor and noise level are set to 2 and 25, respectively.
As depicted in Figure 7, when only reconstruction loss is used, the optimization process becomes overfitted within approximately 500 iterations, resulting in poor performance. When the noise estimation method is applied, the optimization process performs stably without overfitting in the early stage. Furthermore, after 800 iterations, our framework with the noise estimation method outperforms the case that only reconstruction loss is used. The final proposed model, which includes both the noise estimation method and SSL, not only shows the most stable optimization process but also achieves the best performance. At the 2000th iteration, our final model shows the best performance compared to the other methods. Although the number of iterations is set to 2000 in DIP [1], we optimized each algorithm to run for 3500 iterations to show the independence from early-stopping. We can therefore confirm that even when the number of iterations exceeds 2000, the performance of our method improves steadily. The results in Figure 8 clearly show the qualitative effectiveness of our proposed method. In the early stages, when only the reconstruction loss is used, the results are optimized more quickly than those from our method (see the results of 100 and 600 iterations in Figure 8). However, as the iteration progresses, the generator reconstructs more unwanted noise elements, resulting in unpleasant images. Therefore, its results suffer from the presence of noise components in the target image. By contrast, when only the noise estimation method is applied, the results were reliably restored as the iterations proceeded. In this case, the noise elements are observed at approximately 1300 iterations. In comparison, our final model, which adopts both the noise estimation method and SSL, can restore the details well without generating noise elements until 2000 iterations. The PSNR, SSIM and FSIM results are shown in Table 3. When we additionally use the noise estimation method, our method performs much better than when only the reconstruction loss is used with average increase in PSNR of 7.5 dB, SSIM of 0.3094 and FSIM of 0.2217. After the additional adoption of SSL, our final method generates higher quality HR images with average increase in PSNR of 0.87 dB, SSIM of 0.0136 and FSIM of 0.0111.

Conclusions
In this paper, we propose a DIP based noise-robust SR method. Our framework combines a noise estimation method and the self-supervision loss with DIP-SR. By adopting the proposed noise estimating method, the noise in the given LR target image can be estimated. The use of the self-supervision loss increases the stability of the optimization process. By using extensive experiments, it can be concluded that our method achieves outstanding performance both quantitatively and qualitatively.