Nonlocal Total Variation Using the First and Second Order Derivatives and Its Application to CT image Reconstruction

We propose a new class of nonlocal Total Variation (TV), in which the first derivative and the second derivative are mixed. Since most existing TV considers only the first-order derivative, it suffers from problems such as staircase artifacts and loss in smooth intensity changes for textures and low-contrast objects, which is a major limitation in improving image quality. The proposed nonlocal TV combines the first and second order derivatives to preserve smooth intensity changes well. Furthermore, to accelerate the iterative algorithm to minimize the cost function using the proposed nonlocal TV, we propose a proximal splitting based on Passty’s framework. We demonstrate that the proposed nonlocal TV method achieves adequate image quality both in sparse-view CT and low-dose CT, through simulation studies using a brain CT image with a very narrow contrast range for which it is rather difficult to preserve smooth intensity changes.


Introduction
Nonlocal Total Variation (TV) [1][2][3][4][5][6] was proposed as an improved version of ordinary TV. Nonlocal TV can use a weighting function (e.g., the weight of nonlocal means filter) by taking the intensity difference between the pixel pair into account, and can obtain higher image quality than the ordinary TV that uses only pairs of adjacent pixels.
H. Kim et al. (2016) [2] applied nonlocal TV to sparse-view CT and showed that nonlocal TV improves image quality over ordinary TV and incorporating the reweighted L1 norm into nonlocal TV further improves tissue contrast and structural details. Following this, K. Kim et al. (2017) [3] applied nonlocal TV to low-dose CT and showed nonlocal TV is effective for low-dose noise (Poisson noise).
D. Lv et al. (2019) [4] proposed a hybrid prior distribution (called NLTG prior) that combines nonlocal TV with Gaussian distribution. Additionally, they showed the possibility of applying NLTG prior to a large class of image reconstruction problems, especially when reference images were available.
However, the most existing nonlocal TV studies [2][3][4] are based on the first-order derivative, and still contain the staircase artifact problem as a potential drawback. Since the first-order derivative is too sensitive to the pixel values, even linear intensity changes are detected as false edges, which leads to staircase artifacts in the same way as local TV does [7][8][9].
On the other hand, higher-order derivatives (the second-order or more) possesses a potential risk that, as the order of differentiation is larger, its ability enhance image edges is smaller leading to an image blurring problem.
In previous studies, to overcome the disadvantage of using only the first-order or only the second-order, Bredies et al. (2010) [10] proposed Total Generalized Variation (TGV) that involves and balances higher-order derivatives and showed impressive results in the denoising problem. After that work, Ranftl et al. (2014) [11] proposed a nonlocal version of TGV.
Our proposed method is also a combined method of the first and second order derivatives. We define the regularization term as a weighted sum of two terms, where the first term is the ordinary nonlocal TV based on the first-order and the second term is based on the second-order. Additionally, in this paper, we introduce a newly discovered idea called nonlocal Total K-Split Variation, which is based on the second-order derivative using nonlocal regularization. This idea allows us to preserve smooth intensity changes well.
Furthermore, to accelerate the iterative algorithm associated with the proposed nonlocal TV, we propose a specially designed proximal splitting based on Passty's framework. In this proximal splitting, as the number of dividing the cost function into small subfunctions increases, the faster convergence is achieved [12][13][14]. The structure of final iterative algorithm is row-action type with respect to both the data-fidelity term and the regularization term [15][16][17], which converges to a minimizer very quickly [18][19][20][21][22].
Finally, in order to demonstrate the performances of combining the first and second order derivatives in reconstructed images, we use a brain CT image, in which a very narrow contrast range is used to display the image, where it is very difficult to preserve smooth intensity changes. Also, simulation studies are performed for both the sparse-view CT and the low-dose CT. We demonstrate that the proposed nonlocal TV method achieves adequate image quality within a small number of iterations.

Problem Definition
We define the following unconstrained cost function: is the data-fidelity term, and u → x = βω W → x 1 1 is the regularization term, and A = a ij is the I × J system matrix, β is the hyper-parameter to control regularization strength, and ω is the weight of regularization term, and W is the sparsifying transform to make W → x sparse. Image reconstruction is an inverse problem to recover the image of attenuation coefficients In the sparse-view CT [23,24], by using the projection data corresponding to less than 100 directions (the conventional CT uses 1000-2000 projection data), the equation A → x = → b becomes severely underdetermined, i.e., the dimension J of unknowns → x is larger than the dimension I of measurements → b (I < J). In this case, the regularization term acts to avoid the ill-posed problem by introducing the prior knowledge that most components of the vector W → x are close to 0.
On the other hand, in the low-dose CT, the equation In this case, the regularization term helps to reduce the effect of Poisson noise → n by a smoothing. First, we begin by the modified anisotropic nonlocal TV based on the first-order derivative expressed as where x j is the intensity in the pixel j, and x j is the intensity in a distant pixel j , and ω jj is the weight of smoothing assigned for each pixel pair x j , x j . Next, we describe a newly discovered regularization term based on the second-order derivative called nonlocal Total K-split Variation (TKV). The main idea is to consider more derivatives around x j and x j as where x j k is the adjacent pixel value of x j , x j k is the adjacent pixel value of x j . We remark that the TKV term is divided into a sum of eight terms dependent of the direction to take the pixel difference as shown in Figure 1.
Sensors 2020, 20, x FOR PEER REVIEW 3 of 17 Next, we describe a newly discovered regularization term based on the second-order derivative called nonlocal Total K-split Variation (TKV). The main idea is to consider more derivatives around and as where is the adjacent pixel value of , is the adjacent pixel value of . We remark that the TKV term is divided into a sum of eight terms dependent of the direction to take the pixel difference as shown in Figure 1.
In the proposed method, we assume that the reconstructed image ⃗ is close to piecewisepolynomial of first order. Under this assumption, our proposed regularization term is designed to include both the first and second order derivatives as where the proposed regularization term is a combination of two terms ( ( ⃗) = ( ⃗) + ( ⃗)).
Additionally is the trade-off parameter between nonlocal TV and TKV. If is large, the reconstructed image becomes closer to that of nonlocal TV. If is small, the reconstructed image becomes closer to that of nonlocal TKV. To match the strength of the first term into that of the second term, we divide the second-order derivative based nonlocal TKV into 8 directions. Figure 1 shows the location of pixels appearing in ( ⃗) and ( ⃗) for = 1, 2, 3, ⋯ , 7, 8.

Accelerated Algorithm Using the Proximal Splitting with Passty's Framework
To accelerate the iterative algorithm to minimize the cost function, we propose a specially designed proximal splitting with Passty's framework. The ordinary proximal splitting is able to minimize a cost function consisting of a sum of only two component terms, where the proximity operator corresponding to each subfunction is applied alternately [12,13]. On the other hand, by using Passty's framework which is not well-known in the image reconstruction community, a cost function can be divided into a number of much simpler subfunctions [14]. This results in considerable benefits for optimization problems appearing in CT reconstruction. By applying the proximity operator In the proposed method, we assume that the reconstructed image → x is close to piecewise-polynomial of first order. Under this assumption, our proposed regularization term is designed to include both the first and second order derivatives as where the proposed regularization term is a combination of two terms (u . Additionally t is the trade-off parameter between nonlocal TV and TKV. If t is large, the reconstructed image becomes closer to that of nonlocal TV. If t is small, the reconstructed image becomes closer to that of nonlocal TKV. To match the strength of the first term into that of the second term, we divide the Sensors 2020, 20, 3494 4 of 18 second-order derivative based nonlocal TKV into 8 directions. Figure 1 shows the location of pixels appearing in u TV → x and u TKV → x for k = 1, 2, 3, · · · , 7, 8.

Accelerated Algorithm Using the Proximal Splitting with Passty's Framework
To accelerate the iterative algorithm to minimize the cost function, we propose a specially designed proximal splitting with Passty's framework. The ordinary proximal splitting is able to minimize a cost function consisting of a sum of only two component terms, where the proximity operator corresponding to each subfunction is applied alternately [12,13]. On the other hand, by using Passty's framework which is not well-known in the image reconstruction community, a cost function can be divided into a number of much simpler subfunctions [14]. This results in considerable benefits for optimization problems appearing in CT reconstruction. By applying the proximity operator corresponding to each subfunction alternately, the proposed algorithm possesses a form of row-action type [15][16][17], which converges to a minimizer very quickly [18][19][20][21][22].
We begin by a brief review the proximal splitting with Passty's framework. Let us consider a convex minimization problem formulated as where J → x is a lower semi-continuous (lsc) convex function.
The proximity operator corresponding to the function J → x is defined as where α is the parameter called step-size. We note that J → x can be a non-differentiable function such as component terms of TV or nonlocal TV. The proximity operator is a non-expansive mapping such that its fixed-points (i.e., proximal algorithm). Next, we are going to explain the proximal splitting with Passty's framework.
[Passty's framework] Let us consider the case where J → x can be divided into a sum of subfunctions as: The iterative algorithm can be constructed by applying the proximity operator corresponding to each subfunction J i Furthermore, let us consider the case where J i → x is a sum of two subfunctions, like our cost function including data-fidelity term and regularization term as Sensors 2020, 20, 3494

of 18
The update can be constructed by applying two operators corresponding to each subfunction alternately as below Finally, we show in Algorithm 1 the optimization model applied to this paper: nsors 2020, 20, x; doi: FOR PEER REVIEW www.mdpi.com/journal/sensors Algorithm 1: Proximal splitting with Passty's framework Give an initial vector ⃗ ( , ) . Execute the following.
( is the number of projection data) 1) Update the data term by Equation (15) 2) Calculate the weight from ⃗ ( , ) ; 3) Update the TV term by Equation (22); 4) Update the TKV term by Equation (25); In Passty's framework, by increasing the number to divide the cost function into smaller subfunctions, the better convergence can be expected. In this paper, this division is performed as follows. First, the data-fidelity term is divided as shown in Equation (7) in such a way that each subfunction f i → x contains only a single term corresponding to projection data b i . Therefore, the final algorithm can be designed in the form of a row-action type algorithm such as ART method.
We mention that u → x can also be divided into a sum of many subfunctions. In this paper, we divide u → x as finely as possible similarly to the case of the data-fidelity term. This idea leads to a significant benefit to simplify the processing of nonlocal TV+TKV term as well as improving the convergence speed. With respect to the regularization term u → x , we perform a division described in Section 2.3.2.

Optimization
In this section, we focus on how to divide the cost function and how to derive the resulting iterative algorithm.

Update the Data-Fidelity Term
The data-fidelity term can be divided into I subfunctions as below

Example conversion error
Example of correct conversion Incorrect of vector symbols (11) where I is the number of projection data, and → a i is i-th row vector of the system matrix A, and b i is i-th component of projection data. Furthermore, we note that f i → x is a subfunction corresponding to the data-fidelity term.
The minimization problem for the data-fidelity term can be defined as The Lagrange function can be defined by where λ is the Lagrange multiplier called the dual variable. Finally, by optimizing the Lagrange function, we obtain the following expression of row-action type iteration.
, (α 0 : initial value of step size, ε : deceleration rate of step-size, R [I] : ordered subsets (15) where α (n) is the step-size parameter to control the convergence, and n is the number of main iterations. After updating all elements of projection data, n is increased by 1. The step-size α (n) is diminished gradually to zero as the iteration proceeds (i.e. diminishing step-size rule). In Passty's framework, it is known that this diminishing contributes to ensuring an exact convergence to a minimizer, thereby, avoiding the so-called limit cycle problem. Furthermore, we introduce a random-access order of projection data R [I] to enable a fast convergence within 20-30 iterations [19,20]. The mathematical detail to derive the update formula in Equation (15) has been already described in our previous studies [19][20][21][22].

Update the Regularization Term
We describe how to divide the regularization term. The modified anisotropic nonlocal TV possesses the following structure and is called L1 based group LASSO here. The L1 based group LASSO possesses a very simple structure, where each absolute value term can be considered a group element.
where g is the group index, and the group index g corresponds to the pixel index j, and the TV term can be divided into G groups (G subfunctions). Furthermore, the group itself becomes a subfunction u TV g → x .
In the case of the TKV term, it can be divided into (G × 8) subfunctions as below The detailed structure of the TV term is shown in Figure 2. Among the elements of the group, the pixel j is common and distant pixel j has different values for each other.
In the case of the TKV term, it can be divided into ( × 8) subfunctions as below The detailed structure of the TV term is shown in Figure 2. Among the elements of the group, the pixel is common and distant pixel ′ has different values for each other. The sequential update is related to raster scanning. In the TV term, when assuming that and are updated simultaneously, is updated sequentially times, and is updated once respectively. In the case of TKV term, when assuming that , , , are updated simultaneously, is updated sequentially ( × 8) times, and is updated sequentially eight times. Following this, and (the adjacent pixels) are updated once respectively. [Update the TV term] First, we consider updating the pixel . The proximity operator for a subfunction can be defined as follows where ( ) is the current updated solution as a constant approximation, which is the image updated from the data-fidelity term. The above L1 norm minimization problems can be solved with the following soft-thresholding.
where ∇ , is the soft-thresholding function. The solution ( ∇ , ) corresponding to otherwise is simply ( ) .
For better convergence, although it is possible to update only pixel , we update the pixel and ′ simultaneously. The sequential update is related to raster scanning. In the TV term, when assuming that x j and x j are updated simultaneously, x j is updated sequentially J times, and x j is updated once respectively.
In the case of TKV term, when assuming that x j , x j k , x j , x j k are updated simultaneously, x j is updated sequentially (J × 8) times, and x j is updated sequentially eight times. Following this, x j k and x j k (the adjacent pixels) are updated once respectively.
[Update the TV term] First, we consider updating the pixel j.
The proximity operator for a subfunction can be defined as follows where x (n) j is the current updated solution as a constant approximation, which is the image updated from the data-fidelity term. The above L1 norm minimization problems can be solved with the following soft-thresholding.
For better convergence, although it is possible to update only pixel j, we update the pixel j and j simultaneously. For updating the pixel j and j , we further divide a subfunction u TV g → x into two subfunctions as below where The proximity operator for each subfunction (prox ) can be defined as follows where x (n,j ) j and x (n) j are the current updated solution as a constant approximation, which is the image updated from the data-fidelity term.
Finally, the above L1 norm minimization problems can be solved with the following soft-thresholding.
where S ∇ x (n,j ) j and S ∇ x

[Update the TKV term]
We update the pixel j, j k , j , j k simultaneously. For updating the pixel j, j k , j , j k , we further divide a subfunction u TKV g → x into four subfunctions as below where The proximity operator for each subfunction (prox α where are the current updated solution as a constant approximation, which is the image updated from the TV term.
Finally, the above L1 norm minimization problems can be solved with the following soft-thresholding.
are soft-thresholding function.

The Weight
In this paper, we used the weight of nonlocal means filter [25] as where B(x j ) − B(x j ) 2 2 means the average Euclidean distance between patches (B x j , B x j ) centered in an interest pixel x j and a distant pixel x j .
Theoretically, the weight must be a fixed value as a hyper-parameter of the regularization term. However, there have been previous studies showing that reweighting at each iteration contributes to better image quality [2,9]. Additionally, the larger the size of weight (Ω), the better the performance of removing artifacts or noise. As long as the computer is capable of processing, we recommend increasing the size of the weight. However, if the size of the weight is too large, the calculation cost will increase enormously as compared with the image quality improvement. Therefore, it is important to decide the size of the weight while paying attention to cost performance considering the level of artifacts or noise. Figure 3 shows the cost performance of changing the size of weight.
Sensors 2020, 20, x FOR PEER REVIEW 9 of 17 will increase enormously as compared with the image quality improvement. Therefore, it is important to decide the size of the weight while paying attention to cost performance considering the level of artifacts or noise. Figure 3 shows the cost performance of changing the size of weight.  Next, we describe the proper estimation of the weight. If the ground truth image is known, the weight can be calculated from the ground truth image. However, in the image reconstruction, only the projection data is given and the reconstructed image and the weight must be estimated Next, we describe the proper estimation of the weight. If the ground truth image is known, the weight can be calculated from the ground truth image. However, in the image reconstruction, only the projection data is given and the reconstructed image and the weight must be estimated simultaneously from the projection data. Therefore, we construct the optimization by alternating the estimation of the reconstructed image and the weight.
We show the reweighting process of optimization including the data-fidelity term and regularization term:  (27) where the weight is calculated once as common to nonlocal TV and TKV and the weight is calculated from the image updated from the data-fidelity term. The span parameter S determines how often regularization is performed in the main iteration n. If S is small, many regularization updates are performed in one iteration. Theoretically, the smaller the S, the more accurate the convergence. However, since nonlocal TV has a large amount of computational complexity, it is desirable to determine an appropriate value of S. Figure 4 shows how the size of weight (Ω) influences image quality and computation time.

Experimental Results
We performed simulation studies using a brain CT image. The reason behind using the brain image is as follows. In the brain CT imaging, the reconstructed images are shown with a compressed gray scale range much larger compared to the other CT imaging like chest imaging or abdominal CT imaging. Therefore, preserving smooth intensity changes and avoiding the staircase artifacts are much important in the brain case. Additionally, simulation studies were performed for both the sparse-view CT (the number of projection data was 64) and the low-dose CT (the number of photons was 3 × 10 ). The reconstructed image consisted of 512 × 512 pixels, where the pixel size was 0.0585 (cm 2 ). We compressed the range showing the reconstructed images to [7.82, 62.30] HU, where this contrast range was determined based on the contrast range used in clinical brain CT imaging. To

Experimental Results
We performed simulation studies using a brain CT image. The reason behind using the brain image is as follows. In the brain CT imaging, the reconstructed images are shown with a compressed gray scale range much larger compared to the other CT imaging like chest imaging or abdominal CT imaging. Therefore, preserving smooth intensity changes and avoiding the staircase artifacts are much important in the brain case. Additionally, simulation studies were performed for both the sparse-view CT (the number of projection data was 64) and the low-dose CT (the number of photons was 3 × 10 6 ). The reconstructed image consisted of 512 × 512 pixels, where the pixel size was 0.0585 (cm 2 ). We compressed the range showing the reconstructed images to [7.82, 62.30] HU, where this contrast range was determined based on the contrast range used in clinical brain CT imaging. To evaluate image quality, standard RMSE, PSNR, SSIM values were used as metrics. The number of iterations in image reconstructions was 20 for nonlocal TV, TKV, and TV+TKV, which was determined by the fact that changes in image were small enough with this iteration number. We also showed the reconstructed images by the standard Filtered Back-Projection (FBP), and differences in image quality by changing values of the hyper-parameter t (i.e., the trade-off parameter between nonlocal TV (first derivative) term and the TKV (second derivative) term). The ground truth image and the FBP reconstructions are shown in Figure 5. The reconstructed images in the case of sparse-view CT are shown in Figure 6. In Figure 7, we show the used brain CT image with three display gray-scale ranges, from which we observe that the staircase artifacts are severe when the range of display gray-scale range is small. The reconstructed images in the case of low-dose CT are shown in Figure 8. In Figures 9-11, we show convergence properties of our iterative algorithm based on Passty's proximal splitting framework. In Figures 12  and 13, to show the effect of acceleration by Passty's proximal splitting, we incorporated the TV+TKV term into SIRT (simultaneous iterative reconstruction technique) which is a non-row-action method (a type of the standard iterative algorithm) and compared this with row-action based on our proposed nonlocal TV+TKV. From these figures, it can be observed that our algorithm converged very quickly. It is well-known that the standard iterative algorithms such as Chambolle-Pock [26] and proximal gradient algorithms require several hundreds of iteration up to the convergence. The benefit of our iterative algorithm mainly originates from the fact that our algorithm is of row-action type.

Discussion
The experimental results Table 1 showed that the reconstructed image by nonlocal TV+TKV was closest to the ground truth image, with good RMSE, PSNR and SSIM values. Furthermore, no isolated points caused by outliers of the soft-thresholding, which often appear in the TV-class reconstruction methods, were visible. In the sparse-view CT, the result of nonlocal TV ( = 1.0) using only the firstorder derivative was of very high-contrast, but the staircase artifacts appeared in the smooth intensity changes. In other words, the region with small intensity changes was over-smoother by the regularization as if the oil-painting. In the low-dose CT, the result of nonlocal TV showed isolated

Discussion
The experimental results Table 1 showed that the reconstructed image by nonlocal TV+TKV was closest to the ground truth image, with good RMSE, PSNR and SSIM values. Furthermore, no isolated points caused by outliers of the soft-thresholding, which often appear in the TV-class reconstruction methods, were visible. In the sparse-view CT, the result of nonlocal TV (t = 1.0) using only the first-order derivative was of very high-contrast, but the staircase artifacts appeared in the smooth intensity changes. In other words, the region with small intensity changes was over-smoother by the regularization as if the oil-painting. In the low-dose CT, the result of nonlocal TV showed isolated points in the region closer to the center of the image. In both the sparse-view CT and the low-dose CT, nonlocal TKV (t = 0.0) using only the second-order derivative was able to preserve fine soft tissues well including textures and low-contrast objects, however, it suffered from the blurring in the edge parts. This is because, by using a large weight in the second-order derivative, the threshold value (τ TKV ) of the soft-thresholding operation becomes very small (i.e., τ TV τ TKV ).

Conclusions
In this paper, we proposed a new concept in nonlocal TV, in which the first and second order derivatives are combined in the regularization term. By combining the two terms, we were able to compensate for each other's weaknesses, i.e., staircase artifact and loss in smooth intensity changes in the first derivative and image blurring in the second derivative. Furthermore, we proposed a specially designed proximal splitting algorithm that is based on Passty's framework. The key idea is to split the original cost function to minimize as finely as possible to accelerate convergence and simplify necessary computations. This allows as to make the final iterative algorithm into a form of row-action type, which is known to converge very quickly compared to other standards such as Chambolle-Pock algorithm and proximal gradient. In our experiments, we experimentally confirmed that the proposed algorithm converges within 20 iterations even for the case of brain CT imaging in which the requirement of image contrast is very severe. The simulation results with the brain CT image were performed for both the sparse-view CT and the low-dose CT. We showed that our proposed algorithm works well in practice.
As future work, our proposed nonlocal TV can be compared with the latest technology e.g., deep leaning [27,28] or other applied methods such as low-rank minimization [29,30].
Recently, image reconstruction methods using deep learning have been actively investigated. Our proposed method can be compared with existing deep leaning [27,28] as advanced compressed sensing. Additionally, our proposed method can be applied to low-rank TV, which can improve image quality by combing low-rank minimization and Total Variation [29,30].