OCT Image Restoration Using Non-Local Deep Image Prior

: In recent years, convolutional neural networks (CNN) have been widely used in image denoising for their high performance. One di ﬃ culty in applying the CNN to medical image denoising such as speckle reduction in the optical coherence tomography (OCT) image is that a large amount of high-quality data is required for training, which is an inherent limitation for OCT despeckling. Recently, deep image prior (DIP) networks have been proposed for image restoration without pre-training since the CNN structures have the intrinsic ability to capture the low-level statistics of a single image. However, the DIP has di ﬃ culty ﬁnding a good balance between maintaining details and suppressing speckle noise. Inspired by DIP, in this paper, a sorted non-local statics which measures the signal autocorrelation in the di ﬀ erences between the constructed image and the input image is proposed for OCT image restoration. By adding the sorted non-local statics as a regularization loss in the DIP learning, more low-level image statistics are captured by CNN networks in the process of OCT image restoration. The experimental results demonstrate the superior performance of the proposed method over other state-of-the-art despeckling methods, in terms of objective metrics and visual quality.


Introduction
Medical images are an important application of digital imaging combined with biology to visually express important medical information. Optical coherence tomography (OCT) is a rapidly evolving imaging model with high resolution, dynamic range, and non-invasive and non-destructive detection, suitable for detecting a variety of biological tissues [1,2]. However, OCT is susceptible to speckle noise during acquisition, resulting in poor imaging results and preventing the physician from making accurate diagnoses [3]. Therefore, OCT image denoising is very important for subsequent medical applications.
In the past few decades, a variety of methods have been developed for reducing the speckle noise in OCT images. These algorithms can be categorized as transformation-based, spatial filtering, and sparse representations. In the transform domain, there are mainly the wavelet transform [4,5], the curvelet transform [6], and the contourlet transform [7]. In the transform domain, the coefficients representing the image are assumed to be distinguishable from the speckle noise by using shrinkage or threshold. However, when the decomposed sub-bands come to recomposition, the transform domain algorithms are prone to producing some artifacts. Spatial filtering techniques are mainly represented by non-local means (NLM) methods [8][9][10][11][12], which have the advantage of the self-similarity of natural images by comparing patches with non-local neighborhoods [8]. Based on the double Gaussian anisotropic kernels [9], the uncorrupted probability of each pixel (PNLM) [10], or the weighted maximum likelihood estimation [11], the non-local means is able to reduce the speckle noise more effectively. Non-local means-based methods take advantage of the self-similarity of images Deep convolutional networks are one of the most commonly-used techniques in image denoising. Traditional deep convolutional networks need to train a large amount of image information, and the training parameters of the deep convolutional network can be determined by: whereθ represents the training parameters, x i label represents the ith original image, z i represents the ith input noisy image, and f represents the deep convolutional network. However, it is difficult to prepare a large amount of high-quality data for the OCT image despeckling. The DIP model indicates that a large amount of training data is not needed, and the deep convolution network itself has the ability to learn the low-dimensional information of noise images. In the DIP [24], the goal of the image restoration process is to recover the original imagex from the corrupted image x. Such problems are often formulated as an optimization task: where E(x; x) is a task-dependent data item and R(x) is a regularizer. Similar to the adaptive dictionary learning, R(x) can be replaced by the implicit prior captured by the deep convolution network, so the DIP model can be written as:θ where the network input z of the DIP network is random noise. The parametersθ which are represented x can be trained by using an optimizer such as gradient descent. In the DIP denoising model, E(x; x) is the reconstruction loss, which measures the error between the constructed image and the input image: Therefore, Equation (3) becomes:θ Figure 1 is the OCT despeckling results of the DIP. Figure 1a is the original noisy OCT image, and Figure 1b-d is the despeckling results after 500, 1000 and 1500 iterations, respectively. It can be seen that when the DIP is trained to reconstruct the OCT image, it generates along the way an intermediate clean version of the image, before overfitting the noise. It can be seen from Figure 1c that the clean version of the deep image prior allows the recovery of most of the image details while containing some speckle noise. In other words, based on the reconstruction loss in Equation (4), the DIP model has difficulty finding a good balance between maintaining details and suppressing speckle noise.

Non-Local Deep Image Prior Model
The convolutional neural networks (CNNs) have the intrinsic ability to solve inverse problems such as denoising without any pre-training. The main reason is that compared with the noise, the convolution layer is better able to recover the signal which has more self-similarity. However, for OCT despeckling, the DIP model has difficulty finding a good balance between maintaining details and suppressing speckle noise. One reason is that the L2 loss in Equation (4) does not distinguish between the differences originating from signal and noise.
The differences between the constructed image and the input image can be defined as: As shown in Figure 2c, in the difference image, the signal shows stronger self-similarity than the noise. A sorted non-local statics is proposed in this section to measure the signal autocorrelation. Minimizing the signal autocorrelation of the difference image can make the network parameters converge more quickly with the OCT signals.
We apply block matching to measure the non-local self-similarity in the difference image by calculating the correlation between the reference patch and each patch in the neighborhood i  . In this paper, the autocorrelation coefficient is applied as the metric: where i d is the reference patch in difference image centered at i , i j  d d is the covariance of i d and Note that the edge details in the difference image are not correlated with all their non-local neighborhood. To solve this problem, a sorted non-local statistic is proposed by sorting and selecting the mth largest autocorrelation coefficient. Based on the non-local statistic, the uncorrelation of difference image is obtained as follows: where It can be seen in Figure 2d, the statistic provides a measure of how independent a patch is to its most similar non-local neighbors in the difference image. The signal uncorrelation of the differences is obtained as follows:

Non-Local Deep Image Prior Model
The convolutional neural networks (CNNs) have the intrinsic ability to solve inverse problems such as denoising without any pre-training. The main reason is that compared with the noise, the convolution layer is better able to recover the signal which has more self-similarity. However, for OCT despeckling, the DIP model has difficulty finding a good balance between maintaining details and suppressing speckle noise. One reason is that the L2 loss in Equation (4) does not distinguish between the differences originating from signal and noise.
The differences between the constructed image and the input image can be defined as: As shown in Figure 2c, in the difference image, the signal shows stronger self-similarity than the noise. A sorted non-local statics is proposed in this section to measure the signal autocorrelation. Minimizing the signal autocorrelation of the difference image can make the network parameters converge more quickly with the OCT signals.
We apply block matching to measure the non-local self-similarity in the difference image by calculating the correlation between the reference patch and each patch in the neighborhood C i . In this paper, the autocorrelation coefficient is applied as the metric: where d i is the reference patch in difference image centered at i, σ d i d j is the covariance of d i and d j , and σ d i and σ d j denote the standard deviation of d i and d j , respectively. Note that the edge details in the difference image are not correlated with all their non-local neighborhood. To solve this problem, a sorted non-local statistic is proposed by sorting and selecting the mth largest autocorrelation coefficient. Based on the non-local statistic, the uncorrelation of difference image is obtained as follows: where c m d i , d j is the mth largest one among c d i , d j . It can be seen in Figure 2d, the statistic provides a measure of how independent a patch is to its most similar non-local neighbors in the difference image. The signal uncorrelation of the differences is obtained as follows: The optimization loss function is therefore: where the L C maximizes the signal uncorrelation or minimizes the signal autocorrelation in the difference image. The weight α is used to balance two terms in the loss function. Therefore, the parameters of the neural network can be trained by: Figure 2e,f shows the denoising results of proposed non-local deep image prior (NLM-DIP) after 450 and 900 iterations, respectively. Compared with the DIP results in Figure 1c,d, it can be seen that the NLM-DIP results are obviously clearer and maintain more details.  (10) where the C L maximizes the signal uncorrelation or minimizes the signal autocorrelation in the difference image. The weight  is used to balance two terms in the loss function. Therefore, the parameters of the neural network can be trained by: Figure 2e,f shows the denoising results of proposed non-local deep image prior (NLM-DIP) after 450 and 900 iterations, respectively. Compared with the DIP results in Figure 1c,d, it can be seen that the NLM-DIP results are obviously clearer and maintain more details.

Network Structure and the NLM-DIP Algorithm
It can be seen from Figure 3 that the network structure employed in this work is based on the same U-Net type [25] 5-layer "hourglass" architecture with skip-connections as DIP. The purpose of using an hourglass network is to repeatedly obtain information contained in images of different scales. The architecture consists of an encoding path (upper side) and a decoding path (lower side). Each scale of the encoding path consists of two 5 × 5 convolution layers, the first convolution layer with stride 2 for down-sampling. Each convolution layer is followed by a batch normalization (BN) and a leaky rectified linear unit (LReLU). Each scale of the decoding path consists of a 5 × 5 convolution layer, a 1 × 1 convolution layer and an up-sampling layer respectively, each convolution layer followed by the BN and the LReLU. As shown in Figure 3, the numbers of convolution kernels in the 5-layer network are 16, 32, 64, 128, and 128, respectively. The parameters of the neural network can be trained by the proposed reconstruction loss in Equation (11), as shown in Figure 3.

Network Structure and the NLM-DIP Algorithm
It can be seen from Figure 3 that the network structure employed in this work is based on the same U-Net type [25] 5-layer "hourglass" architecture with skip-connections as DIP. The purpose of using an hourglass network is to repeatedly obtain information contained in images of different scales. The architecture consists of an encoding path (upper side) and a decoding path (lower side). Each scale of the encoding path consists of two 5 × 5 convolution layers, the first convolution layer with stride 2 for down-sampling. Each convolution layer is followed by a batch normalization (BN) and a leaky rectified linear unit (LReLU). Each scale of the decoding path consists of a 5 × 5 convolution layer, a 1 × 1 convolution layer and an up-sampling layer respectively, each convolution layer followed by the BN and the LReLU. As shown in Figure 3, the numbers of convolution kernels in the 5-layer network are 16, 32, 64, 128, and 128, respectively. The parameters of the neural network can be trained by the proposed reconstruction loss in Equation (11), as shown in Figure 3. Electronics 2020, 9, x FOR PEER REVIEW 6 of 11 Calculate the signal uncorrelation of the differences using (9). Calculate the reconstruction loss using (10). Train the network using (11).

End For Return
In each iteration, the input z are perturbed with an additive normal noise which is used as a noise-based regularization. The proposed sorted non-local static selects 20 l  most correlated patches ( 3 3  ) in the 9 9  neighborhood i  . The  in Equation (10) is used to balance the L1 loss and the proposed signal autocorrelation loss in the reconstruction function. In this paper,  is set to 0.85 according to the experimental results. We set a learning rate of 0.001 and 900 iterations for the network training. Currently the code of the proposed NLM-DIP is not optimized and the network training was done in GPU (NVIDIA GTX1060). The neural network was implemented using PyTorch.
Because the NLM-DIP needs to calculate and sort the non-local correlation coefficients in each iteration, it takes 0.40 s per iteration for a 900 × 450 OCT image, while the DIP takes 0.18 s per iteration. x i = f (θ i |z) Calculate the signal uncorrelation of the differences using (9). Calculate the reconstruction loss using (10). Train the network using (11). End For Returnx = f (θ MaxIt |z) .
In each iteration, the input z are perturbed with an additive normal noise which is used as a noise-based regularization. The proposed sorted non-local static selects l = 20 most correlated patches (3 × 3) in the 9 × 9 neighborhood C i . The α in Equation (10) is used to balance the L1 loss and the proposed signal autocorrelation loss in the reconstruction function. In this paper, α is set to 0.85 according to the experimental results. We set a learning rate of 0.001 and 900 iterations for the network training. Currently the code of the proposed NLM-DIP is not optimized and the network training was done in GPU (NVIDIA GTX1060). The neural network was implemented using PyTorch. Because the NLM-DIP needs to calculate and sort the non-local correlation coefficients in each iteration, it takes 0.40 s per iteration for a 900 × 450 OCT image, while the DIP takes 0.18 s per iteration.

Experimental Results
In this section, the proposed NLM-DIP is compared with TNode [12], SBSDI [15], PNLM [10], and DIP [24] on two types of retinal SDOCT image datasets [15]. The synthetic image dataset contains 18 spectral domain OCT (SDOCT) noise images generated from high resolution images that were then subsampled; each image was obtained by an 840 nm wavelength SDOCT imaging system from Bioptigen with an axial resolution of~4.5 µm per pixel in the tissue. The real image dataset contains 39 SDOCT noise images acquired at a natively low sample rate.

OCT Despeckling Results
In order to facilitate the subjective comparison of the denoising effect, two images were randomly selected from the synthetic and the real image datasets, respectively. The experimental results are shown in Figures 4 and 5. Figures 4a and 5a are the original noisy OCT images. Figures 4b and 5b are the TNode results which contain some noise. Figures 4c and 5c show the SBSDI results. It can be seen that SBSDI allowed the recovery of most of the image information while containing some speckle noise. On the contrary, the PNLM results shown in Figures 4d and 5d removed most speckle noise while losing some details. By contrast, the results of DIP and NLM-DIP are more natural-looking. Compared with DIP, NLM-DIP generally output smoother surfaces in homogeneous regions and maintained layer texture and boundary clarity without introducing artifacts.

Experimental Results
In this section, the proposed NLM-DIP is compared with TNode [12], SBSDI [15], PNLM [10], and DIP [24] on two types of retinal SDOCT image datasets [15]. The synthetic image dataset contains 18 spectral domain OCT (SDOCT) noise images generated from high resolution images that were then subsampled; each image was obtained by an 840 nm wavelength SDOCT imaging system from Bioptigen with an axial resolution of ~4.5 μm per pixel in the tissue. The real image dataset contains 39 SDOCT noise images acquired at a natively low sample rate.

OCT Despeckling Results
In order to facilitate the subjective comparison of the denoising effect, two images were randomly selected from the synthetic and the real image datasets, respectively. The experimental results are shown in Figures 4 and 5. Figures 4a and 5a are the original noisy OCT images. Figures 4b  and 5b are the TNode results which contain some noise. Figures 4c and 5c show the SBSDI results. It can be seen that SBSDI allowed the recovery of most of the image information while containing some speckle noise. On the contrary, the PNLM results shown in Figures 4d and 5d removed most speckle noise while losing some details. By contrast, the results of DIP and NLM-DIP are more naturallooking. Compared with DIP, NLM-DIP generally output smoother surfaces in homogeneous regions and maintained layer texture and boundary clarity without introducing artifacts.   In addition, some interesting areas in the field in the OCT image have been selected in Figures 4 and 5 which are marked with green boxes, and the partial area is enlarged. The results are shown in Figures 6 and 7. The TNode results show that the images containing noise and some local details were not clear. As can be seen from Figures 6c and 7c, the SBSDI algorithm blurred the local layer structure of the image, and the denoised result still contained a slight noise. The PNLM algorithm performed well in low-intensity regions, but the layer structure of the denoised result was unclear. The results suggest the DIP algorithm can be used for OCT denoising, but the layer structure of the image was ambiguous. The NLM-DIP algorithm efficiently removed the speckle noise. In addition, it maintained the details and the layer structure of the image, which was beneficial to further layer segmentation or clinical diagnosis and analysis.
Electronics 2020, 9, x FOR PEER REVIEW 8 of 11 image after PNLM speckle reduction, (e) image after DIP speckle reduction, (f) image after NLM-DIP speckle reduction. In addition, some interesting areas in the field in the OCT image have been selected in Figures 4 and 5 which are marked with green boxes, and the partial area is enlarged. The results are shown in Figures 6 and 7. The TNode results show that the images containing noise and some local details were not clear. As can be seen from Figures 6c and 7c, the SBSDI algorithm blurred the local layer structure of the image, and the denoised result still contained a slight noise. The PNLM algorithm performed well in low-intensity regions, but the layer structure of the denoised result was unclear. The results suggest the DIP algorithm can be used for OCT denoising, but the layer structure of the image was ambiguous. The NLM-DIP algorithm efficiently removed the speckle noise. In addition, it maintained the details and the layer structure of the image, which was beneficial to further layer segmentation or clinical diagnosis and analysis.

Image Quality Metrics
Considering that the experimental images in this chapter have been processed and compared with the clinical images of medical experiments, there is a lack of original images, and it is inconvenient to perform signal-to-noise ratio measurements. In order to make objective quantitative comparisons of various denoising methods, this section uses the average contrast to noise ratio (CNR) and the average equivalent number of looks (ENL) as evaluation measures. CNR measures the contrast level between the selected regions from the object in an image and the background region in the image. It can be calculated as: where m u and m  are the mean and standard deviation of the m th interest foreground region (e.g., the areas in solid rectangles inFigure 4a, b u and b  are the mean and standard deviation of the pixels in the background region (e.g., the dashed rectangle area in Figure 4a ). ENL measures smoothness in areas that should have a homogeneous appearance but are corrupted by speckle: To quantitatively compare different despeckling algorithms, we averaged the CNR and ENL images in the two datasets. It can be seen from Tables 1 and 2 that the NLM-DIP algorithm provided the best despeckling performance in terms of CNR and ENL, compared with the other algorithm.

Image Quality Metrics
Considering that the experimental images in this chapter have been processed and compared with the clinical images of medical experiments, there is a lack of original images, and it is inconvenient to perform signal-to-noise ratio measurements. In order to make objective quantitative comparisons of various denoising methods, this section uses the average contrast to noise ratio (CNR) and the average equivalent number of looks (ENL) as evaluation measures. CNR measures the contrast level between the selected regions from the object in an image and the background region in the image. It can be calculated as: where u m and σ m are the mean and standard deviation of the mth interest foreground region (e.g., the areas in solid rectangles in Figure 4a, u b and σ b are the mean and standard deviation of the pixels in the background region (e.g., the dashed rectangle area in Figure 4a). ENL measures smoothness in areas that should have a homogeneous appearance but are corrupted by speckle: To quantitatively compare different despeckling algorithms, we averaged the CNR and ENL images in the two datasets. It can be seen from Tables 1 and 2 that the NLM-DIP algorithm provided the best despeckling performance in terms of CNR and ENL, compared with the other algorithm.

Conclusions
Current advancements in neural networks show their great applicability for unsupervised signal preprocessing. In this paper, a non-local deep image prior network was proposed for OCT image restoration. By adding a sorted non-local statics as regularization loss in the DIP learning, more low-level image statistics were captured by CNN networks in the process of OCT image restoration. The experimental results show that the proposed NLM-DIP allowed the recovery of most of the OCT signal while getting rid of the speckle noise.