SCENet: Secondary Domain Intercorrelation Enhanced Network for Alleviating Compressed Poisson Noises

In real image coding systems, block-based coding is often applied on images contaminated by camera sensor noises such as Poisson noises, which cause complicated types of noises called compressed Poisson noises. Although many restoration methods have recently been proposed for compressed images, they do not provide satisfactory performance on the challenging compressed Poisson noises. This is mainly due to (i) inaccurate modeling regarding the image degradation, (ii) the signal-dependent noise property, and (iii) the lack of analysis on intercorrelation distortion. In this paper, we focused on the challenging issues in practical image coding systems and propose a compressed Poisson noise reduction scheme based on a secondary domain intercorrelation enhanced network. Specifically, we introduced a compressed Poisson noise corruption model and combined the secondary domain intercorrelation prior with a deep neural network especially designed for signal-dependent compression noise reduction. Experimental results showed that the proposed network is superior to the existing state-of-the-art restoration alternatives on classical images, the LIVE1 dataset, and the SIDD dataset.


Introduction with Preliminary Examination
The block-based discrete cosine transform (BDCT) coding scheme is typically adopted for various coding standards such as JPEG, MPEG4, H.264/AVC, and H.265/HEVC for image and video compression. However, block-based coding suffers from well-known undesirable blocking artifacts due to the distortion of spatial correlation between neighboring blocks called intercorrelation. Meanwhile, according to [1], noises in real images captured by charge-coupled device (CCD) imaging sensors generally tend to have signal-dependent characteristics such as a Poisson distribution. Accordingly, the coding of the Poisson noise-corrupted image generates complex signal-dependent compressed noises that are called compressed Poisson noises. By applying recent compressed image restoration algorithms [2][3][4][5][6][7][8][9][10][11][12][13][14] based on convolutional neural networks (CNN) on the degraded image, it may be possible to reduce the conventional blocking and ringing artifacts. However, despite their promising solutions on conventional artifacts, the existing algorithms still do not provide excellent performance on compressed Poisson noises. This is because both an accurate image degradation model considering practical imaging systems and signal-dependent noise characteristics have not been seriously dealt with in the existing neural networks. To cope with the issues due to compressed Poisson noises, in this paper, we introduced an image degradation model suitable to the practical application and present a robust multi-band neural network to the signal-dependent noise property by exploiting a variance-stabilizing transformation (VST) [15]. In addition, the existing restoration algorithms [2][3][4][5][6][7][8][9][10][11][12][13][14] may not assure their best performance on coded images especially at low bit rates because the intercorrelation distortion recovery of each BDCT coefficient is not analytically reflected in the existing networks. In this paper, to verify the effect of the block-based coding on the intercorrelation, we performed a preliminary study by investigating the intercorrelation of each BDCT coefficient between neighboring blocks. To this end, we first prepared four ground truth (GT) images given in the top left of Figure 1 and produced their JPEG-coded images with quality factors, q, of 10, 20, and 30 corresponding to low bit rates. For the four GT images and their twelve coded images, we obtain the secondary block, b, which is composed of 4 × 4 BDCT coefficients [16]. We then computed the coefficient's spatial correlation in each secondary block, or the intercorrelation ρ as where Note that we investigated the intercorrelation using both GT images and the corresponding coded images, while the previous study [16] only used GT images. We further observed the change of the intercorrelation distribution according to the compression level of the coded images, unlike in an earlier study [17]. algorithms [2−14] may not assure their best performance on coded images especially at low bit rates because the intercorrelation distortion recovery of each BDCT coefficient is not analytically reflected in the existing networks. In this paper, to verify the effect of the block-based coding on the intercorrelation, we performed a preliminary study by investigating the intercorrelation of each BDCT coefficient between neighboring blocks. To this end, we first prepared four ground truth (GT) images given in the top left of Figure 1 and produced their JPEG-coded images with quality factors, q, of 10, 20, and 30 corresponding to low bit rates. For the four GT images and their twelve coded images, we obtain the secondary block, b, which is composed of 4 × 4 BDCT coefficients [16]. We then computed the coefficient's spatial correlation in each secondary block, or the intercorrelation ρ as where ) )( ( Note that we investigated the intercorrelation using both GT images and the corresponding coded images, while the previous study [16] only used GT images. We further observed the change of the intercorrelation distribution according to the compression level of the coded images, unlike in an earlier study [17].  Figure 1 shows the distributions of the computed high intercorrelation values (ρ > 0.6) for the three lowest frequency (LF) BDCT coefficients: DC, AC(1,0), and AC(0,1). As the compression level increases, the relative frequencies of the high intercorrelation values for three LF coefficients commonly decreases. For example, the relative frequency of 63.4% for the DC of GT images decreases to 58.0%, 52.5%, and 40.5% in JPEG-coded images with q of 30, 20, and 10, respectively. The same tendency could also be observed for the other two LF coefficients, AC(1,0) and AC(0,1), while the remaining high-frequency (HF) coefficients did not necessarily follow this trend. From this  Figure 1 shows the distributions of the computed high intercorrelation values (ρ > 0.6) for the three lowest frequency (LF) BDCT coefficients: DC, AC(1,0), and AC(0,1). As the compression level increases, the relative frequencies of the high intercorrelation values for three LF coefficients commonly decreases. For example, the relative frequency of 63.4% for the DC of GT images decreases to 58.0%, 52.5%, and 40.5% in JPEG-coded images with q of 30, 20, and 10, respectively. The same tendency could also be observed for the other two LF coefficients, AC(1,0) and AC(0,1), while the remaining high-frequency (HF) coefficients did not necessarily follow this trend. From this examination, we considered that the examination, we considered that the intercorrelation enhancement of three LF coefficients was required for the effective restoration of coded images. In addition, the intercorrelation distortion of the BDCT coefficients occurred differently for each coefficient, as a different quantization step size was applied for each one in JPEG. Hence, in this paper, we propose an intercorrelation enhancement network in the secondary domain, which enabled us to improve the distorted intercorrelation of each frequency coefficient adaptively.

Degradation Model
Considering the practical imaging systems described above, we defined a simple but effective image degradation model consisting of three procedures: camera sensor noise corruption, image coding based on quantization in the BDCT domain, and image decoding based on dequantization in the BDCT domain, as illustrated in Figure 2. Here, let z be a decoded block in the receiver, specifically, where x denotes a spatial coordinate; and T and T −1 are the BDCT and inverse BDCT (IBDCT) operators, respectively. In addition, P denotes a Poisson variable scaled by a with a mean value µ, and Q denotes a quantization table. The probability distribution of the acquired value P(x) is derived as To better observe the visual image degradation of the received image z, a residual image can be obtained by subtracting the GT image from z, as shown in the top right of Figure 2. We note here that the decoded image suffers from complicated degradations including compressed Poisson noises as well as well-known blocking and ringing artifacts. It can also be observed that original near-random Poisson noises in P(x) were deformed to annoying patterns that had a strong spatial correlation, even in smooth regions. Here, let z be a decoded block in the receiver, specifically,

Secondary Domain Intercorrelation Enhanced Network
where x denotes a spatial coordinate; and T and T −1 are the BDCT and inverse BDCT (IBDCT) operators, respectively. In addition, P denotes a Poisson variable scaled by a with a mean value µ, and Q denotes a quantization table. The probability distribution of the acquired value P(x) is derived as To better observe the visual image degradation of the received image z, a residual image can be obtained by subtracting the GT image from z, as shown in the top right of Figure 2. We note here that the decoded image suffers from complicated degradations including compressed Poisson noises as well as well-known blocking and ringing artifacts. It can also be observed that original near-random Poisson noises in P(x) were deformed to annoying patterns that had a strong spatial correlation, even in smooth regions.

Secondary Domain Intercorrelation Enhanced Network
To reflect the defined image degradation model in the neural network adequately, in this paper, we suggested a secondary domain intercorrelation enhanced network (SCENet), which is quite suitable to address compressed Poisson noises, as illustrated in Figure 3. Inspired by our preliminary examination, we adapted and extended the secondary domain approach [17], which is still valuable for recovering the intercorrelation distortion. Note that the proposed algorithm has a clear difference from the existing algorithm [17] in terms of method and application. Specifically, while the classical edge-preserving total variation (TV) filtering was applied in the secondary domain for removing well-known blocking artifacts in [17], we utilized the deep neural network specially trained for reducing compressed Poisson noises, instead of the classical filtering. We also exploited the variance-stabilizing model to deal with signal-dependent noise characteristics, unlike in [17]. In other words, we combined key elements of the VST-based secondary domain intercorrelation model with a deep neural network that was particularly trained using the defined compressed Poisson noise model. The proposed SCENet architecture had two major parallel phases: restoration of the three LF coefficients and restoration of the high-band (HB) image. To reflect the defined image degradation model in the neural network adequately, in this paper, we suggested a secondary domain intercorrelation enhanced network (SCENet), which is quite suitable to address compressed Poisson noises, as illustrated in Figure 3. Inspired by our preliminary examination, we adapted and extended the secondary domain approach [17], which is still valuable for recovering the intercorrelation distortion. Note that the proposed algorithm has a clear difference from the existing algorithm [17] in terms of method and application. Specifically, while the classical edge-preserving total variation (TV) filtering was applied in the secondary domain for removing well-known blocking artifacts in [17], we utilized the deep neural network specially trained for reducing compressed Poisson noises, instead of the classical filtering. We also exploited the variancestabilizing model to deal with signal-dependent noise characteristics, unlike in [17]. In other words, we combined key elements of the VST-based secondary domain intercorrelation model with a deep neural network that was particularly trained using the defined compressed Poisson noise model. The proposed SCENet architecture had two major parallel phases: restoration of the three LF coefficients and restoration of the high-band (HB) image. In one of the parallel phases, we increased the intercorrelation of DC, AC(1,0), and AC(0,1) in the secondary domain, as shown at the top of Figure 3. In particular, the network had 20 layer architectures and each layer architecture was composed of five operations: VST, convolution, inverse VST (IVST), batch normalization (BN), and rectified linear units (ReLU). For an input image z, we first generated three secondary images: SDC, SAC(1,0), and SAC(0,1). To this end, we computed three LF BDCT coefficients in each 8 × 8 block by shifting the block pixel-by-pixel with overlapping and then merged them into each secondary image, respectively. After that, to remove the signal dependency of compressed noises, the secondary image pixel value s was stabilized to have homoscedastic variance via the Anscombe transformation [15]   As a subsequent procedure, the convolution was undertaken with K pre-trained filters, aL,F, with a size of W × H. Next, destabilization based on the IVST was applied in order to retrieve the original heteroskedastic variance as In one of the parallel phases, we increased the intercorrelation of DC, AC(1,0), and AC(0,1) in the secondary domain, as shown at the top of Figure 3. In particular, the network had 20 layer architectures and each layer architecture was composed of five operations: VST, convolution, inverse VST (IVST), batch normalization (BN), and rectified linear units (ReLU). For an input image z, we first generated three secondary images: S DC , S AC(1,0) , and S AC(0,1) . To this end, we computed three LF BDCT coefficients in each 8 × 8 block by shifting the block pixel-by-pixel with overlapping and then merged them into each secondary image, respectively. After that, to remove the signal dependency of compressed noises, the secondary image pixel value s was stabilized to have homoscedastic variance via the Anscombe transformation [15]  As a subsequent procedure, the convolution was undertaken with K pre-trained filters, a L,F , with a size of W × H. Next, destabilization based on the IVST was applied in order to retrieve the original heteroskedastic variance as The IVST step was then followed by BN and ReLU for fast and stable convergence in the training process. The iterative layer architectures were performed on three secondary images separately for the adaptive restoration of each coefficient that had different quantization amounts. Final feature maps were reshaped to the original input tensor size via a fully-connected layer and then an output low-band (LB) image L out was obtained by applying T −1 to three filtered coefficients, S DC,out , S AC(1,0),out , and S AC(0,1),out , in each block without overlapping. The images from the first column to the fourth column of Figure 4 show that the three LF coefficients restoration network successfully recovered the secondary images and the LB image, similar to their corresponding GT images by addressing artifacts in degraded images.
The IVST step was then followed by BN and ReLU for fast and stable convergence in the training process. The iterative layer architectures were performed on three secondary images separately for the adaptive restoration of each coefficient that had different quantization amounts. Final feature maps were reshaped to the original input tensor size via a fully-connected layer and then an output low-band (LB) image Lout was obtained by applying T −1 to three filtered coefficients, SDC,out, SAC(1,0),out, and SAC(0,1),out, in each block without overlapping. The images from the first column to the fourth column of Figure 4 show that the three LF coefficients restoration network successfully recovered the secondary images and the LB image, similar to their corresponding GT images by addressing artifacts in degraded images. In another parallel phase for restoring the HB image, we first obtained the input LB image L by applying T −1 to three LF coefficients in each 8 × 8 block. The input HB image H was then acquired by subtracting L from the input image z. Note that H corresponded to the remaining 61 HF coefficients in each block. Next, the filtered output Hout was computed via the same iterative layer architectures as in the restoration of the three LF coefficients by using convolution filters bL instead of aL,F. Note in the fifth column of Figure 4 that the network effectively recovered Hout quite close to the GT. The final restoration result could be obtained by adding two output images, Lout and Hout. All of the above steps are also described in Algorithm 1.  In another parallel phase for restoring the HB image, we first obtained the input LB image L by applying T −1 to three LF coefficients in each 8 × 8 block. The input HB image H was then acquired by subtracting L from the input image z. Note that H corresponded to the remaining 61 HF coefficients in each block. Next, the filtered output H out was computed via the same iterative layer architectures as in the restoration of the three LF coefficients by using convolution filters b L instead of a L,F . Note in the fifth column of Figure 4 that the network effectively recovered H out quite close to the GT. The final restoration result could be obtained by adding two output images, L out and H out . All of the above steps are also described in Algorithm 1.

Experiments
To train our networks, we used 400 images from the BSDS500 dataset [18] and 800 images from the DIV2K dataset [19]. Given a GT image, we synthesized three JPEG degraded images with different noise levels of {quality factor q, peak} = {10, 200}, {20, 400}, and {30, 600}. To generate Poisson noises, the maximum intensity of the GT image was first normalized to have the defined peak value. Next, the noise corruption was performed on the normalized image and the corrupted image was then denormalized to have the original maximum intensity. Therefore, the lower the peak value, the higher the Poisson noise level. In the restoration network of DC, AC(1,0), and AC(0,1), the sizes of W × H for filter parameters a L,F = 1 , a L,F = 2 , and a L,F = 3 were set to 3 × 3, 3 × 1, and 1 × 3, respectively, by considering the dominant pattern of compression artifacts in each secondary image. For example, the secondary image of AC(1,0) (or AC(0,1)) included only the vertical (or horizontal) artifacts affected by the BDCT basis function, as shown in the second row of Figure 4. In addition, the number of filters K in all networks was set to 64 and the W × H for the HB image restoration network parameters b L was empirically set to 3 × 3 in every architecture layer. Given a set of secondary images and HB images computed from GT images and their corresponding degraded images, we used the mean squared error (MSE) as a loss function. To minimize the loss function, we adopted an optimization method, Adam [20] with a batch size of 32. The learning rate was set to drop exponentially from 1e −3 to 1e −5 . The proposed network was separately trained according to each noise level on one NVIDIA GTX 1080 GPU, under MATLAB R2017b with the MatConvNet package for about 16 h. Figure 5 shows an example of the convolution filters that were obtained via the network training. The whole inference time was about 120 ms for a 512 × 512 image and the time could be further reduced via parallel processing.

Experiments
To train our networks, we used 400 images from the BSDS500 dataset [18] and 800 images from the DIV2K dataset [19]. Given a GT image, we synthesized three JPEG degraded images with different noise levels of {quality factor q, peak} = {10, 200}, {20, 400}, and {30, 600}. To generate Poisson noises, the maximum intensity of the GT image was first normalized to have the defined peak value. Next, the noise corruption was performed on the normalized image and the corrupted image was then denormalized to have the original maximum intensity. Therefore, the lower the peak value, the higher the Poisson noise level. In the restoration network of DC, AC(1,0), and AC(0,1), the sizes of W × H for filter parameters aL,F = 1, aL,F = 2, and aL,F = 3 were set to 3 × 3, 3 × 1, and 1 × 3, respectively, by considering the dominant pattern of compression artifacts in each secondary image. For example, the secondary image of AC(1,0) (or AC(0,1)) included only the vertical (or horizontal) artifacts affected by the BDCT basis function, as shown in the second row of Figure 4. In addition, the number of filters K in all networks was set to 64 and the W × H for the HB image restoration network parameters bL was empirically set to 3 × 3 in every architecture layer. Given a set of secondary images and HB images computed from GT images and their corresponding degraded images, we used the mean squared error (MSE) as a loss function. To minimize the loss function, we adopted an optimization method, Adam [20] with a batch size of 32. The learning rate was set to drop exponentially from 1e −3 to 1e −5 . The proposed network was separately trained according to each noise level on one NVIDIA GTX 1080 GPU, under MATLAB R2017b with the MatConvNet package for about 16 h. Figure 5 shows an example of the convolution filters that were obtained via the network training. The whole inference time was about 120 ms for a 512 × 512 image and the time could be further reduced via parallel processing. Meanwhile, in order to validate our trained networks, we used eight classical images given in Table 1 and 29 images from the LIVE1 dataset [21]. Figures 6 and 7 show several restoration results Meanwhile, in order to validate our trained networks, we used eight classical images given in Table 1 and 29 images from the LIVE1 dataset [21]. Figures 6 and 7 show several restoration results for Sensors 2019, 19,1939 7 of 13 the JPEG degraded images with different noise levels of q and peak values to evaluate the subjective performance of the proposed network. We also compared the performance with a general compression artifact reduction algorithm [2] and two existing state-of-the-art restoration algorithms [4,5] based on CNN. The two existing denoising algorithms [22,23] that were not based on CNN were additionally used for the comparison. Open source codes in the first authors' websites were applied for the comparison. The pre-trained models for MWCNN [5] were kindly provided by P. Liu, because it was not accessible via that website. We can easily note that in Figures 6 and 7 that the results of the existing algorithms were not satisfactory because the undesirable compressed Poisson noises still remained, especially in many flat regions such as the wing, the face, the pepper, the calendar, the sky, and the wall. In contrast, the proposed SCENet provided more visually pleasing images by successfully alleviating the annoying compressed Poisson noises while preserving the image details in comparison to the existing algorithms. This noticeable visual improvement was achieved by the proposed VST-based secondary domain intercorrelation prior that was enforced in the neural network. The full resolution image results and an executable program for reproducing the results are also available on our website [24]. for the JPEG degraded images with different noise levels of q and peak values to evaluate the subjective performance of the proposed network. We also compared the performance with a general compression artifact reduction algorithm [2] and two existing state-of-the-art restoration algorithms [4,5] based on CNN. The two existing denoising algorithms [22,23] that were not based on CNN were additionally used for the comparison. Open source codes in the first authors' websites were applied for the comparison. The pre-trained models for MWCNN [5] were kindly provided by P. Liu, because it was not accessible via that website. We can easily note that in Figures 6 and 7 that the results of the existing algorithms were not satisfactory because the undesirable compressed Poisson noises still remained, especially in many flat regions such as the wing, the face, the pepper, the calendar, the sky, and the wall. In contrast, the proposed SCENet provided more visually pleasing images by successfully alleviating the annoying compressed Poisson noises while preserving the image details in comparison to the existing algorithms. This noticeable visual improvement was achieved by the proposed VST-based secondary domain intercorrelation prior that was enforced in the neural network. The full resolution image results and an executable program for reproducing the results are also available on our website [24].  In addition to the subjective comparison, a quantitative comparison was conducted. Table 1 summarizes the peak signal-to-noise ratio (PSNR) and structure similarity (SSIM) [25] values computed from the processed results of the eight classical test images. It can be noticed in the table that the proposed SCENet provided the best objective quality except for only one case, by restoring the GT pixel values well. The objective comparison on the LIVE1 dataset was additionally conducted, as given in Table 2. The average PSNR and SSIM values were calculated from the luminance channels of 29 images in the dataset. This demonstrates that the proposed network overall outperformed the existing compressed image restoration algorithms as well as providing significant quality improvement when compared with the input degraded images.  In addition to the subjective comparison, a quantitative comparison was conducted. Table 1 summarizes the peak signal-to-noise ratio (PSNR) and structure similarity (SSIM) [25] values computed from the processed results of the eight classical test images. It can be noticed in the table that the proposed SCENet provided the best objective quality except for only one case, by restoring the GT pixel values well. The objective comparison on the LIVE1 dataset was additionally conducted, as given in Table 2. The average PSNR and SSIM values were calculated from the luminance channels of 29 images in the dataset. This demonstrates that the proposed network overall outperformed the existing compressed image restoration algorithms as well as providing significant quality improvement when compared with the input degraded images. In addition, to evaluate the restoration performance of several algorithms on the actual sensor noises, we used the smartphone image denoising dataset (SIDD) [26] because smartphone images tend to have notably severe Poisson noises due to the small aperture and sensor size. Figures 8 and 9 show the comparison of the algorithms for two images from the SIDD dataset, Books and Desk, respectively. The two images were acquired using an iPhone 7 with different camera settings, ISO, and exposure time and they included real camera sensor noises, as shown in Figures 8a and 9a. The JPEG compression of the sensor noises generates compressed Poisson noises, as shown in Figures 8b and 9b. We note that the proposed SCENet alleviated the compressed Poisson noises well, as shown in Figures 8h and 9h, while the results of the existing algorithms still included the noises in the book, the phone, the paper, and the box, as shown in Figures 8c-g and 9c-g. As their original GT images were not available, we conducted an objective comparison using a no-reference image quality metric, the blind/referenceless image spatial quality evaluator (BRISQUE) index [27]. The lower the BRISQUE values, the better the image quality. As expected in visual results in Figures 8 and 9, we noticed that SCENet outperformed the existing algorithms on the real noise data in terms of the BRISQUE values. and 9b. We note that the proposed SCENet alleviated the compressed Poisson noises well, as shown in Figures 8h and 9h, while the results of the existing algorithms still included the noises in the book, the phone, the paper, and the box, as shown in Figures 8c-g and 9c-g. As their original GT images were not available, we conducted an objective comparison using a no-reference image quality metric, the blind/referenceless image spatial quality evaluator (BRISQUE) index [27]. The lower the BRISQUE values, the better the image quality. As expected in visual results in Figures 8 and 9, we noticed that SCENet outperformed the existing algorithms on the real noise data in terms of the BRISQUE values.

Conclusions
Compressed Poisson noises are critical and troublesome issues generated in real image coding systems. To address sensor noises effectively, we analyzed the intercorrelation distortion process via our preliminary examination and proposed a new multi-band intercorrelation increment network that exploits the secondary domain instead of the typical spatial domain. Additionally, to increase robustness to the signal-dependent noise characteristics, we designed a layer architecture composed of five operations and trained the network parameters under the challenging image degradation model. The superior performance of the proposed network on three datasets was also validated in terms of both subjective and objective qualities.

Conclusions
Compressed Poisson noises are critical and troublesome issues generated in real image coding systems. To address sensor noises effectively, we analyzed the intercorrelation distortion process via our preliminary examination and proposed a new multi-band intercorrelation increment network that exploits the secondary domain instead of the typical spatial domain. Additionally, to increase robustness to the signal-dependent noise characteristics, we designed a layer architecture composed of five operations and trained the network parameters under the challenging image degradation model. The superior performance of the proposed network on three datasets was also validated in terms of both subjective and objective qualities.