Joint Demosaicing and Denoising Based on a Variational Deep Image Prior Neural Network

A joint demosaicing and denoising task refers to the task of simultaneously reconstructing and denoising a color image from a patterned image obtained by a monochrome image sensor with a color filter array. Recently, inspired by the success of deep learning in many image processing tasks, there has been research to apply convolutional neural networks (CNNs) to the task of joint demosaicing and denoising. However, such CNNs need many training data to be trained, and work well only for patterned images which have the same amount of noise they have been trained on. In this paper, we propose a variational deep image prior network for joint demosaicing and denoising which can be trained on a single patterned image and works for patterned images with different levels of noise. We also propose a new RGB color filter array (CFA) which works better with the proposed network than the conventional Bayer CFA. Mathematical justifications of why the variational deep image prior network suits the task of joint demosaicing and denoising are also given, and experimental results verify the performance of the proposed method.


Introduction
Nowadays, many digital imaging systems use a single monochrome sensor with color filter array (CFA) to capture a color image. Without color filters, the monochrome camera sensor would give only brightness or luminance information and could not recover the colors of the light that fall on each pixel. To obtain color information, every pixel is covered with a color filter that only lets through a certain color of light: red, green or blue. These sampled red, green and blue channels are then interpolated to fill in the missing information at the pixels for which certain colors could not be sampled. This procedure is called the demosaicing procedure, for which many methods have been proposed [1][2][3][4][5][6][7][8][9].
Current image sensors come in CCD (charge-coupled device) or CMOS (complementary metal-oxide-semiconductor) types, which are both sensitive to thermal noise. Therefore, CFA pattern images taken in low illumination suffer from noise of Poisson distribution. The noise in the CFA pattern image has a large effect on the reconstruction of the color image, as the noise in the noisy pixels spreads out to neighboring regions by the demosaicing process. The denoising is also a challenging task since at least two-thirds of the data are missing. Complex aliasing problems can occur in the demosaicing process if a poor denoising is applied beforehand. As most digital camera pipelines are sequential, quite often the demosaicing and denoising are also performed in an independent and sequential way. This leads to an irreversible error accumulation, since both the demosaicing and the denoising are ill-posed problems and the error occurring in one of the procedures cannot be undone in the other procedure. It has been shown that simultaneous coping with the errors in both the denoising and the demosaicing has advantages, and some joint demosaicing and denoising methods based on optimization techniques have been developed [10][11][12]. Recently, inspired by the success of the convolutional neural network (CNN) in many image processing tasks, methods which use the CNN in joint demosaicing and denoising have been proposed [13][14][15]. The work of [13] is a first attempt to apply a learning approach for demosaicing, and the work in [14] is a first attempt to use a convolutional network for joint demosaicing and denoising, but works only on a single noise level. The work in [15] exposes a runtime parameter and trains the network so that it adapts to a wider range of noise levels, but still can only work with a relatively low level of noise. The work of [16] proposes a residue learning neural network structure for the joint demosaicing and denoising problem based on the analysis of the problem using sparsity models, and that of [17] presents a method to learn demosaicing directly from mosaiced images without requiring ground truth RGB data, and showed that a specific burst improves the fine-tuning of the network. Furthermore, the work in [18] proposes a demosaicing network which can be described as an iterative process, and proposes a principled way to design a denoising network architecture. Such CNN-based methods need a lot of data to be trained, and normally, work poorly with varying noise.
In this paper, we propose a deep image prior based method which needs only the noisy image as the training data for the demosaicing. The proposed method uses as the input a sum of a constant and varying noise. We give mathematical justifications as to why the added varying input noise results in the denoising of the demosaiced image. Furthermore, we propose a color filter array which suits the proposed demosaicing method and show experimentally that the proposed method yields good joint demosaicing and denoising results.

Related Works
The following works are related to the proposed method. The proposed method can be seen as a variation of the following works fitted to the joint demosaicing and denoising problem.

Deep Image Prior
Recently, in [19], a deep image prior (DIP) has been proposed for image restoration. The DIP is a type of convolutional neural network which resembles an auto encoder, but which is trained with a single image x 0 ; i.e., only with the image to be restored. The original DIP converts a 3D tensor z into a restored image f θ (z), where f θ (·) denotes the deep image prior network with parameters θ. The tensor z is filled with random noise from a uniform distribution. The DIP can be trained to inpaint an image with a loss function as follows: where m ∈ {0, 1} H×W is a binary mask with values of zero corresponding to the missing pixels to be inpainted and values of one corresponding to the existing pixels which have to be kept, and is an element-wise multiplication operator. Here, D is a distance measure, which is normally set as the square of the L 2 difference operator; i.e., D(a, b) = a − b 2 2 . The minimization of L in Equation (1) with respect to the parameters θ of the DIP network has been shown to be capable of inpainting an image; i.e., the minimization results in an inpainted image f θ (z). Inpainting and demosaicing are similar in that they try to fill in missing pixels. The difference is that in inpainting the existing pixels have full channel information, i.e., all R, G and B values are available, whereas in the demosaicing the existing pixels have only one of the R, G and B values.

Variational Auto Encoder
The variational auto-encoder [20] is a stochastic spin of the auto-encoder which consists of an encoder q θ (z | x) and a decoder p φ (x | z), where both the encoder and decoder are neural networks with parameters θ and φ, respectively. Given an image x as the input, the encoder q θ (z | x) outputs parameters to a Gaussian probability distribution. After that, samples are drawn from this distribution to get a noise input z to the decoder p φ (x | z). The space from which z is sampled is stochastic and of lower dimension than the space of x. By sampling different samples each time, the variational auto-encoder learns to generate different images. The proposed method also samples noise from a Gaussian distribution, but, unlike the variational auto-encoder, the noise is constituted of constant noise and varying noise and is an input into the network and not used as an intermediate input stage to the decoder.

Variational Deep Image Prior for Joint Demosaicing and Denoising
In this section, we propose a variational deep image prior (DIP) for joint demosaicing and denoising. We denote by f θ the DIP network, and use the same network structure as the DIP for inpainting as defined in [19]; i.e., a U-Net type network which is downsampled five times and upsampled five times. The loss function for the variational DIP differs from that of the original DIP as follows: Here, z c and z v denote the constant noise and the varying noise, respectively, both derived from a Gaussian distribution, and m r is the binary mask corresponding to the proposed random CFA; i.e., it consists of three channels. Each channel constitutes of 33% of random pixels with the value one and 66% of random pixels having the value zero. Unlike the inpainting problem, the positions of the pixels having the value one are different for each channel. The input to f θ is a constant noise (z c ) until the (P − 1)-th training step. Then, after the (P − 1)-th training step, the input becomes the sum of a constant noise (z c ) and a varying noise (z v ), where the noise z v is newly generated and differs for each training step. The effect of adding this varying noise z v will be explained later. The target image y k also differs for the different iteration steps: For the steps k < P, y k = x 0 , x 0 is a three channel image, wherein each channel contains 33% of either the R, G or B intensity values at the pixel positions where the R, G and B values are sensed by the random CFA, respectively, and 66% of zero values at the remaining positions. Furthermore, we assume that x 0 contains noise. Therefore, if y k = x 0 for all steps k, then the reconstructed image f θ will converge to a noisy demosaiced image. To avoid this, after the step k = P, the target image y k becomes a weighted average of the previous target image y k−1 , the given noisy image x 0 and f θ (z c ). The weights between these images are controlled by α and β. In the experiments, we let α = 0.003 and β = 0.007 for all images. It should be noted that is a denoised version of y k−1 , and the adding of it denoises the target image y k . The adding of x 0 has the effect of adding back the noise, a trick which is widely used in denoising algorithms to restore the fine details back to the target image. The adding of the previous target image y k−1 keeps the record of the denoised target image. By using the improved target image y k , the network is not trained toward the given noisy image x 0 any longer, which results in a better denoised output image. Figure 1 shows the working flow of the proposed method. Here, the orange bullets and the solid lines refer to the neural network used in the computation with the specific parameters of θ 0 , θ 1 , θ 2 , ..., and the dashed lines refer to the inputs to the network. Again, it can be observed that z c remains constant, while z v P , z v P+1 , ... are changing over the time. Next, we give mathematical justifications on the joint demosaicing and denoising property of the variational DIP. First, the reason which explains why the DIP performs demosaicing can be found in [21], where it is proven that an auto-encoder with an insufficient number of input channels performs an approximation of the target signal under the Hankel structured low-rank constraint: where H d|p ( f ) denotes the Hankel matrix of f with p input channels and a convolution filter of size d, and f * is the target signal. The above approximation can be easily extended to the two-dimensional case. Thus, letting f = f θ (z c ) and f * = y k for the two-dimensional case, we see that we get a low rank approximation of y k . It has been already shown in [22], that a low rank approximation can perform a reconstruction of missing pixels. When applied to the CFA patterned image, this results in a demosaicing. The Hankel structured low-rank approximation in Equation (4) performs a better approximation than the method in [22], since in [22] the low rank approximation is with respect to the Fourier basis, whereas in Equation (4) this is with respect to a learned basis which best reconstructs the given image, and therefore, yields a better demosaicing result. Now, to consider the effect of adding the varying noise z v to the constant noise z c , we consider a multi-variable vector-valued function f θ (z) which can be expressed as a set of M multi-variable scalar-valued function f θ i (z): The Taylor expansion of the i-th (i = 1, · · · , M) component is where δz = [δz 1 , · · · , δz N ] T is a vector. The gradient vector g and the Hessian matrix H are the first and second order derivatives of the function f θ i (z) defined as: We can extend Equation (5) to f θ (z) as a vector form. The first two terms can be written as . . .
. The second order term requires a tensor form to be expressed which is difficult to write in vector form, and therefore, we replace the second order term with the error term ( z v 2 ). Then, we can express f θ (z c + z v ) by the Taylor expansion This results in where and J f θ (z) is the Jacobian matrix defined over the vector function f θ (z): , the outputs f θ (z c + z v 2 ) and f θ (z c + z v 1 ) cannot be the same. This is contradictory to the loss function in Equation (2), which minimization forces the outputs f θ (z c + z v k ) to converge to the same image y 0 for all different inputs z v k , k ≥ P. It should be noted that, with very high probability, M = 0 and z v 1 = z v 2 , since z v 1 and z v 2 are random noises. Therefore, the different inputs of z v k act as regularizers which eliminate the components with small L 2 norm energy from f θ (z c + z v k ). As the components with small energy will be mostly the noise, this will result in a noise removal of Furthermore, if we take the expectation of the different outputs f θ (z c + z v k ) with respect to z v k , we get which shows the fact that if we put z c as the input after the DIP has been trained, the output f θ (z c ) will be approximately the average of the outputs f θ (z c + z v k ) for different z v k . This averaging has a further denoising effect which will remove the remaining noise. The fact that different inputs of z v k result in different outputs can also be shown by the mean value theorem. According to the mean value theorem, there always exists a pointz between z v 1 and z v 2 such that the following equality holds: When z v 1 = z v 2 , then the right-hand side of Equation (9) is, with very high probability, non-zero, since it is almost unlikely that ∇ z f θ (z) and (z v 1 − z v 2 ) are orthogonal to each other. Therefore, with very high probability, where γ is the angle between ∇ z f θ (z) and z v 1 − z v 2 . This means that if there is a difference between the inputs, then the outputs of the DIP cannot be the same, so there will be an averaging which removes the noise. Next, we propose a CFA pattern, which we think works well with the proposed demosaicing method. The proposed CFA consists of randomly distributed pixels, where the pixels corresponding to the R, G and B channels take up 33% of the whole CFA pattern each. The design of the proposed CFA is not based on a rigorous analysis, as done in classical CFA designs [23][24][25], but on simple reasoning and experimental results. We reason that if the filters are to learn to generate the R, G and B pixels without any bias for a specific color or a specific position, the best training method would be to train the filters to generate any random color at any random position. For example, if the CFA pattern has, for example, 50% green pixels, as in the Bayer format, the convolutional filters will be trained mainly how to generate the green pixels from the noise. When trained like this, the same convolutional filters may be less effective in generating the R or B pixels. Therefore, we reason that the amount of information should be the same for all three channels; i.e., the CFA should consist of 33% of R, G and B pixels each. In the same manner, we reason that if the filters are to learn to generate the R, G and B pixels without any bias for a specific position, it would be good to train the filters to generate any random color at any random position, which is why we propose a pattern with randomly distributed color pixels. Experimental results show that the randomly patterned CFA works better with the proposed demosaicing method than the Bayer pattern or the Fuji X-Trans pattern. Figure 2 shows the different color filter arrays (CFAs) which are used in the experiments including the proposed CFA.

Experimental Results
We compared the proposed method with other deep-learning-based demosaicing methods on three different datasets. We added different noises generated from Gaussian distributions with different standard deviations of σ R , σ G and σ B , for the R, G and B channels, respectively. This is due to the fact that the R, G and B filters absorb different light energy. We compared the proposed method with the method in [28] as a representative of the non-deep learning demosaicing method; the sequential energy minimization (SEM) method [14]; the DemosaicNet [15] with two different CFAs, i.e., the DemosaicNet with Bayer CFA(DNetB) and the DemosaicNet with the Fuji X-Trans CFA(DNetX); and the plain DIP [19]. We made quantitative comparisons with the PSNR (peak signal-to-noise ratio), the CPSNR (color peak signal-to-noise ratio), the SSIM (structural similarity index), the FSIM (feature similarity index) and the FSIMc (feature similarity index chrominance) measures and summarized the results in tables 2-4. The values in the tables are the average values for the Kodak images and the McMaster (also known as the IMAX) images, respectively. Furthermore, the values corresponding to the red, green and blue channels are the average values for those particular channels, respectively. The parameters for α and β in Equation (3) were set to 0.003 and 0.007, respectively, throughout all the experiments. Table 1 shows the results of performance comparison when the proposed method was applied for the different CFA patterns; i.e, the Bayer [26], the Fuji X-Trans [27], the Lucak [23], the Hirakawa [25] and the proposed CFA patterns with different noise levels. For the Hirakawa CFA we used the pattern-A and pattern-B patterns which have RGB ratios of 1:1:1 and 1:2:1, respectively. Likewise, for the proposed random patterned CFA, we used the Random1 pattern (RGB ratio of 1:1:1) and the Random2 pattern (RGB ratio of 1:2:1). The CPSNR, SSIM and FSIMc values are the average values of the images in the Kodak image dataset. When the noise is low, the Hirakawa pattern-A CFA shows the largest CPSNR and SSIM values. However, when the noise increases, the proposed random pattern shows larger PSNR, SSIM and FSIM values. This is maybe due to the fact that when the noise increases, the tasks of demosaicing and denoising become similar-i.e., the task of removing the random noise and the task of filling in random colors become similar-so the finding of the parameters which do the demosaicing and denoising tasks simultaneously becomes an easier task with the proposed random CFA than with other CFAs. It can be seen that the proposed CFA pattern mostly shows the largest value, especially when the noise is large.  Figure 3 shows the results of the different demosaicing methods on the first dataset with color noise of standard deviations σ R = 9.83, σ G = 6.24 and σ B = 6.84 for the R, G and B channels, respectively. The parameter P in Equation (3) is set to 1200 for the experiments with this color noise, and to 500 for all the other experiments. When the noise is light, the SEM, the DNetB and the DNetX also produce good joint demosaicing and denoising results. The SEM shows the best quantitative results in the PSNR values for the Kodak dataset, as can be seen in Table 2. However, the proposed method achieves the best results in the SSIM and the FSIM measures for all datasets, and the best PSNR values for the McMaster dataset. Figures 4 and 5 and Tables 2-4 show the results on the dataset with color noise of standard deviations σ R = 19.67, σ G = 12.48 and σ B = 13.67. The ADMM method got the highest value of cPSNR for the McMaster dataset; that was due to the fact that the ADMM method incorporates the powerful BM3D denoising [29] and the total variation minimization into a single framework, which results in a large denoising power. Therefore, we experimented also on a combination of the proposed method and the BM3D. In this case, the proposed training method can focus more on finding the parameters for the demosaicing task, leaving a large part of denoising to the BM3D, which results in finding effective parameters for demosaicing. The results of the SEM, DNetB and DNetX are those without using the external BM3D denoising method. The proposed + BM3D outperforms the other methods on the Kodak dataset with respect to the PSNR and SSIM measures. As the noise increases, the ADMM, SEM, DNetB and DNetX result in severe color artifacts, as can be observed from the fence regions in the enlarged images in Figure 5b-e. However, the DIP and the proposed method overcome such color artifacts due to the inherent rank minimization property. The figures are selected according to the best PSNR values, which is why the figures for the DIP are a little more blurry than the figures for the proposed method. The DIP reconstructs the noise when reconstructing the high frequency components while the proposed method does not. Finally, Figures 6  and 7 and Tables 2-4 show the results on the dataset with color noise of standard deviations σ R = 26.22, σ G = 16.64 and σ B = 18.23. For this dataset, the non-deep-learning ADMM method outperforms all the deep-learning-based methods, including the proposed method, in the quantitative measures. However, the proposed method outperforms all other deep-learning-based methods. Furthermore, while the ADMM shows large aliasing artifacts, as can be seen in Figure 8b, the proposed method is free from such artifacts. Again, it should be taken into account that this is the result of training with the noisy pattern CFA image only. Furthermore, we fixed all the hyper-parameters of the network for all the different noise levels, which means that the proposed method is not sensitive to the noise levels.  [28] (c) SEM, [14] (d) DNetB, [15] (e) DNetX, [15] (f) DIP, [19] (g) proposed and (h) proposed + BM3D.  [28] (c) SEM, [14] (d) DNetB, [15] (e) DNetX, [15] (f) DIP, [19] (g) proposed and (h) proposed + BM3D.  [28] (c) SEM, [14] (d) DNetB, [15] (e) DNetX, [15] (f) DIP, [19] (g) proposed and (h) proposed + BM3D.  [28] (c) SEM, [14] (d) DNetB, [15] (e) DNetX, [15] (f) DIP, [19] (g) proposed and (h) proposed + BM3D.  [28] (c) SEM, [14] (d) DNetB, [15] (e) DNetX, [15] (f) DIP, [19] (g) proposed and (h) proposed + BM3D.  [28] (c) SEM, [14] (d) DNetB, [15] (e) DNetX, [15] (f) DIP, [19] (g) proposed and (h) proposed + BM3D.   Figure 9 compares the convergence of the PSNR values according to the training iterations of the plain DIP and the proposed variational DIP, respectively. As can be seen, the plain DIP converges to a lower PSNR value as the training step iterates, which is due to the fact that the noise in the target image is reconstructed. In comparison, with the proposed variational DIP, the noise is not reconstructed, due to the reasons explained in the previous section. Therefore, the final output image converges to a joint demosaiced and denoised image, which results in a convergence to a higher PSNR value. Table 5 shows the computational time costs of the different methods. All the methods have been run on a PC with an Intel Core i9-9900K Processor, NVIDIA GeForce RTX 2080 Ti and 32 GB RAM. The proposed method is the slowest of all the methods, which is due to the fact that the proposed method uses a training step for each incoming CFA image. The computational time can be reduced if the proposed method is combined with the meta learning approach. One of the possible methods would be to initialize the neural network with good initial parameters obtained by some pre-training with many images. This should be one of the major topics of further studies.  Figure 9. Comparison of the convergence between the DIP and the proposed variational DIP.

Conclusions
In this paper, we proposed a variational deep image prior for the joint demosaicing and denoising of the proposed random color filter array. We mathematically explained why the variational model results in a demosaicing and denoising result, and experimentally verified the performance of the proposed method. The experimental results showed that the proposed method is superior to other deep-learning-based methods, including the deep image prior network. How to apply the proposed method on the demosaicing of color filter arrays including channels other than the three primary color channels could be the topic of further studies.

Conflicts of Interest:
The authors declare no conflict of interest.