Disentangling Noise from Images: A Flow-Based Image Denoising Neural Network

The prevalent convolutional neural network (CNN)-based image denoising methods extract features of images to restore the clean ground truth, achieving high denoising accuracy. However, these methods may ignore the underlying distribution of clean images, inducing distortions or artifacts in denoising results. This paper proposes a new perspective to treat image denoising as a distribution learning and disentangling task. Since the noisy image distribution can be viewed as a joint distribution of clean images and noise, the denoised images can be obtained via manipulating the latent representations to the clean counterpart. This paper also provides a distribution-learning-based denoising framework. Following this framework, we present an invertible denoising network, FDN, without any assumptions on either clean or noise distributions, as well as a distribution disentanglement method. FDN learns the distribution of noisy images, which is different from the previous CNN-based discriminative mapping. Experimental results demonstrate FDN’s capacity to remove synthetic additive white Gaussian noise (AWGN) on both category-specific and remote sensing images. Furthermore, the performance of FDN surpasses that of previously published methods in real image denoising with fewer parameters and faster speed.


Introduction
Despite decades of research, image denoising [1,2] is still an ongoing low-level image processing task in computer vision. The long-standing interest in image denoising has provided roots for a vast array of downstream applications, such as segmentation [3] and deblurring [4]. Nearly all images need to be denoised before further processing, especially those obtained in dark environments.
The purpose of image denoising is to reconstruct clean images from corrupted noisy observations. Traditional denoising methods rely on certain assumptions on noise distributions [5,6] or priors on ground truth clean images [7,8] to build optimization models. However, these assumptions and priors may differ from the real case, which can compromise the denoising accuracy. Deep learning denoising approaches proposed in recent years use convolutional neural networks (CNNs) to learn the models from a large number of noise-free and noisy image pairs and have achieved superior denoising performance [9,10]. These methods employ CNNs to learn the mapping functions between noisy images and clean ones. However, they usually overemphasize the pixel similarity between the denoised image and the clean ground truth while omitting the underlying distribution of clean images. Thus, although some deep methods can obtain high quantitative results, over-smoothed regions and artifacts are often brought into the restored images, resulting in degraded visual results.  [16] with σ = 50 AWGN. Our method restores finer feathers, clearer eyes, and a sharper beak. Zooming in on a high-resolution display will allow better observation of the differences.
For the second challenge, we take advantage of the characteristics of the latent variables, which follow a distribution of N(0, I); thus, the dimensions are independent to each other. We assume that these dimensions can be disentangled into two groups, i.e., some of the dimensions encode clean images while the others correspond to noise. If we set the noise dimensions to constants, such as 0, the joint distribution of the clean representations and new noise codes will be the same as the marginal distribution of the clean images. The denoised images can be obtained by passing the new latent representations to the reverse pass of the network.
The contributions of our work are listed below.
• We rethink the image denoising task and present a distribution-learning-based denoising framework.
• We propose a Flow-Based Image Denoising Neural Network (FDN). Unlike the widely used feature-learning-based CNNs in this area, FDN learns the distribution of noisy images instead of low-level features. • We present a disentanglement method to obtain the distribution of clean ground truth from the noisy distribution, without making any assumptions on noise or employing the priors of images. • We achieve competitive performance in removing synthetic noise from categoryspecific images and remote sensing images. For real noise, we also verify our denoising capacity by achieving a new state-of-the-art result on the real-world SIDD dataset.

Recent Trends of Image Denoising
Traditional Methods. Traditional denoising methods usually construct an optimization scheme, modeling the distributions of noise or the priors of natural images as penalties or constraints. The widely used natural image priors include sparsity [17], total variation [18,19], non-local similarity [20,21], and external statistical prior [22,23]. NLM [20] computes a weighted average of non-local similar patches to denoise images. The weights are calculated by the Euclidean distance between pixels. BM3D [7] employs the structure similarity of patches in a transform domain, achieving excellent accuracy on denoising additive white Gaussian noise (AWGN).
However, most traditional methods are designed to tackle generic natural images. Very few works study category-specific image denoising and consider the class-specific priors while designing algorithms. CSID [24] was the first to adopt external similar clean patches to facilitate denoising category-specific object images. The authors formulated an optimization problem using the priors in the transform domain. The objective consists of a Gaussian fidelity term that incorporates the category-specific information and a low-rank term that fortifies the similarity between noisy and external similar clean patches. They achieve superior denoising accuracy in removing noise from category-specific images. Nonetheless, a common problem that lies in most of these traditional model-driven methods is that they require noise levels as input. These methods usually implement various hard thresholds to deal with different noise levels. However, the noise level is usually unavailable, and we can only perform blind denoising in practice, limiting the application of these methods.
Deep Learning Methods. Deep learning denoising methods learn models from a large number of clean and noisy image pairs with CNNs, without providing image priors manually. The rapid progress of these methods has been seen in recent years, promoting the denoising effect significantly. The notable DnCNN [9] achieves good results on AWGN removal. After that, RIDNet [1] brings attention to denoising models, boosting the denoising performance further. VDN [25] makes assumptions on the distribution of clean images and noise, deriving a new form of evidence lower bound observation (ELBO) under the variational inference framework as the training objective. These CNN-based denoising methods learn low-level features in the network to restore the details of clean images.
There have also been a few attempts in designing category-specific denoising networks recently, for example, [26] proposes a class-aware CNN-based denoising method. The authors use a classifier to classify the noisy image into the supported classes first and then exploit the pre-trained class-specific denoising models for denoising. For each of the supported classes, the denoising model is pre-trained on the images from the same classes of ImageNet [27]. The denoising architecture they proposed is a feature-learningbased CNN. However, for category-specific images, the feature-learning-based denoising methods usually enforce the pixels of denoised images to be close to the clean ones but ignore the underlying distribution of the specific category. Thus, over-smoothed regions and artifacts are seen in restored images, degrading the visual effects of denoising. As far as we know, we are the first to conduct image denoising with distribution learning and disentanglement.

Flow-Based Invertible Networks
We employ normalizing flows based invertible neural networks to learn the distributions. Normalizing flows [28] are models for computing complex distributions accurately. By applying a sequence of invertible transformations to transform a simple prior distribution into a complex distribution, the complex distribution's exact log-likelihood can be computed.
The key design concept of normalizing flows is invertibility, ensuring that the mapping between an input and its output is one-to-one. Therefore, to estimate the probability density of image y, we can alternatively achieve the same purpose by measuring the probability density of the counterpart latent variable z ∼ N (0, I). Estimating the probability density of y through using the probabilities of z requires taking the variations of metric spaces into consideration. Consequently, we have where z = f (y) and y = f −1 (z); f is the invertible function learned by normalizing flows.
To reduce the complexity of computing the determinants of Jacobian matrices, special designs are proposed in NICE [28] and Real NVP [13] to make each flow module have a triangular Jacobian matrix. Glow [14] extends the channel permutation methods in these two models and proposes invertible 1 × 1 convolutional layers. These models are usually used in image generation, demonstrating superior generation quality of natural images.
So far, few studies have applied invertible networks to image denoising. Noise Flow [29] employs Glow [14] to learn the distribution of real-world noise and generate real noisy images for data augmentation. Extra information, such as raw images, ISO, and camera-specific parameters, is required during noisy image generation. Different from these studies, we are the first to exploit normalizing flows to learn the distribution of noisy images and disentangle the clean representations to restore images.

Our Method
In this section, we explain the design concept of FDN. Then, we introduce the detailed components of the network architecture. The objective function, as well as some training details, are also presented.

Concept of Design
We rethink the image denoising task from the perspective of distribution learning and disentanglement. Suppose the noisy image is y and the corresponding clean ground truth is x. The noise n = y − x. We have p(y) = p(x, n) = p(x)p(n|x)-that is, the distribution of the noisy images p(y) is a joint distribution of clean images and noise. The clean representation can be achieved if we can disentangle the clean and noise representations from p(y). Then, the clean images can be restored with the disentangled clean representation.
A framework of this scheme is presented in Figure 2, which contains three steps: (i) learn the distribution of noisy images by encoding y to a noisy latent representation z, (ii) disentangle the clean representation z C from z, and (iii) restore the clean image by decoding z C to the clean image space. To ensure the denoising effect, the mappings between y and z, z C and x should be one-to-one.

Noisy Images
Clean Images An invertible normalizing flow-based network is employed to learn the distribution of noisy images p(y), transforming y to latent variables z following a simple prior distribution N (0, I). Thus, where f (·) is the model learned by the network. The dimensions of z are independent of each other. We assume that z can be disentangled; some of the dimensions of z encode the distribution of clean images (denoted as z C ) and the remaining embed noise (denoted as z N ). The clean image x can be restored through the following transformation: However, how to obtain z C with z is not so obvious. We propose a way of disentanglement by setting where m is a mask that is 1 in the dimensions for clean variables and 0 in those for noise.ẑ is a new latent code that only contains the clean representations. denotes the element-wise product. Thus, we have p(z N = 0) = 1, and the distribution ofẑ becomes Then, the clean image can be obtained via Equation (3).

Network Architecture
The details of our FDN architecture are presented in this section. FDN is composed of several invertible DownScale Flow Blocks, as shown in Figure 3. Each block consists of a Squeeze layer to downscale the latent representations followed by several Step-Of-Flow Blocks to perform distribution transformation. The details of each layer are described below.   Actnorm. The Actnorm layers apply the affine transformation on latent variables, as illustrated in Equation (7).
where h i and h i+1 are the intermediate latent representations during transformation. The output of the forward pass z can be treated as the final layer of the latent representations. s 1 and b 1 are the scale and translation parameters, respectively. is the Hadamard product of tensors. The reverse operation of the Actnorm layer is s 1 and b 1 are initialized to make each channel of the representations have zero mean and unit variance, such as the normalization operation. However, during training, this operation is different from the widely used normalization methods. Specifically, s 1 and b 1 are updated through back-propagation, without any further constraints on the mean and variance of the latent variables. Employing the Actnorm layers is able to improve the training stability and performance. Invertible 1 × 1 Convolutional Layers. Different from ordinary convolutional layers, we use the invertible 1 × 1 convolutional layers, which are designed for normalizing flows to support invertibility. The operation in these layers can be represented as where W is a square matrix that is initialized randomly. Its reverse function is These layers are used to permute different channels of latent representations. Affine Coupling Layers. The Affine Coupling layers capture the correlations among spatial dimensions [13,28]. The forward operations include where Split(·) and Concat(·) operate along channel dimensions. Split(·) splits h i into two tensors h a i and h b i . Concat(·) concatenates two tensors h a i+1 , h b i+1 channel-wise to obtain h i+1 . g i (·) (i = 1, 2, 3) is the neural network. The reverse operations are The operations in the second and third row turn + into − and into /. g i (·) (i = 1, 2, 3) can be any neural network. Following [31,32], we employ Dense Block (DB) in our network as g i (·).

Objective Function
Our objective function consists of two components: the distribution learning loss to encode the input noisy image y into latent code z, and the reconstruction loss to restore the corresponding clean image x with clean code z C . The details of these two losses are as below.
Distribution Learning Loss.
where L is the number of invertible layers in FDN and f i is the function learned by each layer. p z (z) ∼ N (0, I) and z = (z C , z N ). To reconstruct the clean image, we set z N = 0 and achieve a new latent representationẑ = (z C , 0), which lies in a subspace of z. Reconstruction Loss.ẑ is passed through the reverse network to restore the clean ground truth.
Total Loss. The total objective function we use during training is where λ 1 and λ 2 are the weights for the two loss components.

Data Preprocessing
Since FDN is the first distribution-learning-based denoising network, we explore different data preprocessing techniques to demonstrate how to make the best use of it.
Random vs. Center Crop and resize. The widely used training strategy in featurelearning-based CNN denoising networks is to randomly crop patches from the training dataset to learn features. However, it is not obvious whether the random crop strategy is also superior in distribution-learning-based networks. Take face image denoising as an example; if we center crop the face region and resize it into an appropriate size, we will obtain a downscaled face image, following a similar distribution as the face image test set. Intuitively, the center crop with the resizing method will facilitate the network to learn the distribution better and achieve superior denoising results.
Thus, we compare training with random crop and center crop with resizing, illustrating the curves of the validation results in Figure 5a. Contrary to our intuition, cropping training patches randomly outperforms the center crop with resizing consistently and significantly. Therefore, we apply random crops while training the FDN.

Data Augmentation vs. No Data Augmentation.
Feature-learning-based networks usually employ horizontal and vertical flip and rotation with 90, 180, and 270 degrees for data augmentation. However, these methods will bring in unrealistic patches, compromising the learning of distributions. For example, if we rotate a patch of a face image, we may obtain patches with the eyes under the mouth or on the mouth's left side, which is impossible in real face images. Although data augmentation can lead to better generalizability for discriminative models, it may bring noise when learning the distributions.
We train three models for comparison: one with flip and rotation as data augmentation, one with only flip as data augmentation, and the other is trained without any augmentation. The validation results are shown in Figure 5b. The results verify our concern that inappropriate data augmentation such as rotation introduces noise to the distribution model, resulting in the lower denoising accuracy shown by the blue curve. Training with only flip as data augmentation achieves almost the same results as without data augmentation; however, the latter is more stable during training. The potential reason might be that the horizontal flip also generates realistic images for face images, while the vertical flip creates impossible face images, making the training unstable. Thus, to avoid unrealistic samples, we train our distribution-learning-based networks without data augmentation when the training set is large enough to learn the distribution. If the training set is small, we only conduct data augmentation that will not generate unrealistic patches.

Experiment
We perform thorough experiments to demonstrate the effectiveness of our method. We first apply FDN to denoise category-specific images. Since category-specific images usually have similar patterns in all the images, such as similar facial contours and features in human faces, their distribution is easier to learn than random nature images. The experiment is then extended to denoising more difficult remote sensing images, which contain diverse terrain patterns, such as mountains or forests, following intricate distributions. Finally, we investigate our capacity to remove noise, which follows complicated distribution in the real noise dataset. Further details are provided about the datasets, training strategies, and qualitative and quantitative results.  [34] is applied as an optimizer. The learning rate is initialized as 2 × 10 −4 and halved after every 50K iterations. To evaluate the methods, we employ Peak Signal-to-Noise Ratio (PSNR) as the evaluation metric.

Datasets
Next, we provide information about the category-specific, remote sensing, and realworld datasets.
Category-Specific Datasets: We investigate the capacity of FDN in removing AWGN on three category-specific datasets: faces, flowers, and birds.

Remote Sensing Datasets:
We attempt to denoise two remote sensing datasets (RICE1 and RICE2 [36]) with AWGN added to explore our capability when the distribution of ground truth images becomes complex. The datasets contain 500 and 450 pairs of images, respectively, each with a size of 512 × 512. We randomly crop patches of size 64 × 64 from the images for training and add AWGN with σ = 30, 50, and 70 to obtain noise-free and noisy pairs, respectively. Random flipping, as well as rotation, are utilized for data augmentation.
Real Noisy Datasets: Finally, we verify FDN's effectiveness in removing real noise, which follows a complex distribution. Real image noise can result from photon shot noise, fixed pattern noise, dark current, readout noise, quantization noise, etc. during the imaging process [37]. We conduct real noise removal on the dataset SIDD [38], which is taken by five smartphone cameras with small apertures and sensor sizes. The medium SIDD dataset contains 320 clean and noisy pairs. Patches with a size of 144 × 144 are randomly cropped for training. Flipping and rotation are adopted for augmentation.

Category-Specific Image Denoising
Quantitative Results. In Table 1, we report numeric values for the three categoryspecific datasets with added AWGN with the levels of σ = 15, 25, and 50. Compared with other competitive methods in synthetic noise removal, FDN achieves the highest PSNR on all the datasets and noise levels. We also employ the same FDN for blind denoising with noise levels between [0, 55], as shown in Table 1. The distribution of blind noise is a Gaussian Mixture Model, which is much more complicated than the Gaussian distribution with a certain noise level. Although traditional methods such as BM3D [7] and EPLL [39] are good at removing Gaussian noise, their capacity in blind denoising is unavailable due to the requirement of the noise level as input. Comparing with other feature-learning-based CNN methods, FDN outperforms others to a large extent, exhibiting its superiority in category-specific image denoising.
Qualitative Results. The visual results are shown in Figures 6-8. For CelebA, we observe that, although the other competitive methods can restore the facial contour, they lose many detailed facial features. Thus, the denoised images of these methods are blurred with artifacts. In contrast, our denoising results are much clearer and closer to the ground truth images. For the Flower and CUB-200 datasets, the foreground and background are more diverse and complicated than CelebA. Our results are clean with sharp edges (in close-up versions), while other methods have artifacts near and at the edges. This illustrates that FDN can handle category-specific image denoising very well. GT Noisy BM3D DnCNN EPLL IRCNN FDN (Ours) Figure 6. Image denoising results of FDN on CelebA dataset with σ = 50 against competitive methods. Our network produces results close to the ground-truth without any kind of deformation or artifact. The effects are best viewed zoomed-in.

Remote Sensing Image Denoising
Quantitative Results. The results of denoising the RICE1 and RICE2 [36] datasets with σ = 30, 50, and 70 are reported in Table 2. FDN improves by 1.13 dB-1.74 dB on RICE1 with different noise levels compared with the highest results from other competitive methods. On RICE2, FDN outperforms other methods when the noise levels are large, i.e., achieving increases of 0.97 dB and 1.55 dB for σ = 50 and 70, respectively. These results demonstrate that FDN is also capable of restoring images following complex distributions. Qualitative Results. The visual results are illustrated in Figure 9 (Although in RGB images, blacker pixels represent smaller values, we change every pixel in the right regions by using 255 minus the value; thus, the whiter regions are smaller.). The remote sensing datasets are mainly composed of images with two types of regions: the texture regions such as mountains, and the smooth regions, such as deserts. An example of the smooth region from RICE1 with σ = 70 is taken. Our FDN outperforms other methods significantly from the right regions of Figure 9b-f. Thus, our distribution learning and disentanglement based denoising method, i.e., FDN, has proven to be effective not only for category-specific data but also for images following more complex distributions.  Figure 9. Visual results on RICE1 with σ = 70. For (a), the left part is the clean image and the right part is the noise. For (b-f), the left part is the denoised image and the right region reflects the difference between the denoised and GT images. Whiter pixels represent better denoising performance. The denoised image restored by FDN is closer to the ground truth.

Real Image Denoising
Quantitative Results. The performance comparison on the test set of the real noise dataset SIDD [38] is listed in Table 3. We achieve a new state-of-the-art denoising accuracy comparing with other methods. In addition, our model size (4.38 M) is much smaller than the competitive AINDNet [42] (13.76 M) and VDN [25] (7.81 M), illustrating that FDN is suitable to be deployed on small edge devices. We also report the inference time (in GigaFlops) of one 256 × 256 image for each method. FDN is much faster than VDN [25]. Qualitative Results. To further present the effectiveness of FDN against other state-ofthe-art methods, we show the visual results of denoised images in Figure 10. FDN restores accurate textures and well-shaped edges, while other methods blur details and introduce artifacts. This indicates that FDN is also superior in removing real-world noise.

Noisy
CBDNet GradNet VDN FDN (Ours) Figure 10. The visual comparison on the SIDD dataset against state-of-the-art methods. In the first row, FDN reconstructs the white dot patterns clearly in a dark environment without smoothing and artifacts. In the second row, FDN preserves more crisp edges.

Number of Flow and SoF Blocks.
We study the denoising effects of employing different numbers of Flow and SoF blocks in FDN. We train models on CelebA [33] with σ = 50 AWGN added. The results of the 50Kth iteration on the validation set are reported in Table 4. In general, given the same number of Flow blocks, the more SoF blocks in each of the Flow blocks, the higher the denoising accuracy. However, the improvement is not significant when we have three Flow blocks. On the other hand, with the same number of SoF blocks in each Flow block, increasing Flow block numbers from one to two improves the performance. Nevertheless, further increments result in similar accuracy when SoF = 4 and even slightly decrease when SoF = 8. Thus, we adopt two Flow blocks with eight SoF blocks contained in our experiment. The split of z C and z N . We also study the effects of different dimension numbers of z C (i.e., dim(z C )). Models are trained with dim(z C ) = 1/8, 1/4, 1/2, 3/4, and 7/8 dim(z) separately, and the validation results of the 50Kth iteration are reported in Table 5. In general, the denoising accuracy improves with the increase of the proportion of dim(z C ). On the other hand, the results are almost the same when dim(z C ) = 3/4 and 7/8 dim(z), illustrating that extending the dimensions of z C will not boost the denoising performance further. Thus, we use dim(z C ) = 3/4 dim(z) in our experiment.

Conclusions
The widely used image denoising CNNs are discriminative models, learning the mapping between noisy images and their clean counterparts via learning features of images. However, these methods may overlook the underlying distribution of the clean ground truth, resulting in downgraded visual results with blurry regions or artifacts. This paper provides a new perspective to understand image denoising as a distribution disentangling task. Since the distribution of noisy images can be treated as a joint distribution of clean images and noise, the denoised images can be obtained via the clean images' latent representations. A distribution-learning-based denoising framework is proposed in this paper. We also present a novel denoising network, FDN, based on normalizing flows without adding any assumptions on clean images and noise distributions. FDN learns the distribution instead of features from noisy images, which is different from previous feature-learning-based networks. A distribution disentanglement method for denoising is introduced as well. Experimental results verify the effectiveness of FDN on both categoryspecific and remote sensing images denoising with synthetic AWGN. Moreover, FDN shows its superiority in real image denoising with fewer parameters and a lower running time.
In conclusion, this paper presents a new potential direction to optimize image denoising methods in the future.