Extra Proximal-Gradient Network with Learned Regularization for Image Compressive Sensing Reconstruction

Learned optimization algorithms are promising approaches to inverse problems by leveraging advanced numerical optimization schemes and deep neural network techniques in machine learning. In this paper, we propose a novel deep neural network architecture imitating an extra proximal gradient algorithm to solve a general class of inverse problems with a focus on applications in image reconstruction. The proposed network features learned regularization that incorporates adaptive sparsification mappings, robust shrinkage selections, and nonlocal operators to improve solution quality. Numerical results demonstrate the improved efficiency and accuracy of the proposed network over several state-of-the-art methods on a variety of test problems.


Introduction
Recent years have witnessed the substantial success of deep neural networks (DNN) in a large variety of real-world applications [1][2][3][4][5][6][7][8][9]. Equipped with proven expressive power, DNNs can be used to approximate highly complicated functions provided a sufficient amount of data [10]. However, training DNNs as end-to-end black-boxes can be extremely data demanding, rendering DNNs difficult to interpret, generalize, and sensitive to noise and outliers. To overcome these issues, learned optimization algorithms (LOAs) have started to gain attention as they are designed to combine the interpretable mechanism of optimization algorithms and the expressive power of DNNs. One of the most important applications of LOAs is solving the inverse problem of general form min x f (x; y) + g(x), where f is the data fidelity term determined by the data formation and noise distribution that relate the target solution x and the given measurement data y, and g is the critical (possibly nonconvex) regularization term that promotes the desired solution x, as f is often underdetermined and the data y can be incomplete and noisy. In classical approaches to inverse problems, the regularization g is often handcrafted based on human heuristics and limited experience, which can be overly simplified and not capable to capture the intrinsic complex features of the solution. LOAs, on the other hand, allow the regularization g to be learned from training data and hence can result in significant improvement over the handcrafted regularizations. Our goal in this paper is to propose an efficient extra proximal gradient algorithm that employs the Nesterov's acceleration technique and the extra gradient scheme, and unroll this algorithm into a deep neural network called the extra proximal gradient network (EPGN) to solve a class of inverse problems (1). Motivated by the least absolute shrinkage and selection operator (LASSO) [11][12][13], our EPGN implicitly adopts an l 1 -type regularization in (1) with a nonlinear sparsification mapping learned from data. The proximal operator of this regularization is elaborated by several linear convolutions, nonlinear activation functions, and shrinkage operations for robust sparse feature selection in EPGN. As our focus application is in image reconstruction, we also incorporate a nonlocal feature selection component into the learned regularization to leverage similar patterns within images and improve reconstruction quality. The proposed EPGN combines the advantages of the accelerated extra gradient scheme, the sparsity promoting nonlinear transforms, and the nonlocal feature selections. As a consequence, our EPGN is efficient, robust, and accurate in a variety of image reconstruction problems as demonstrated by the numerical experiments.

Related Work
One of the early LOAs is the learned iterative shrinkage thresholding algorithm (LISTA) for solving l 1 regularized linear inversion [14]. LISTA maps the standard ISTA optimization algorithm to a recurrent neural network (RNN) with certain layer weights learned from training data to improve the performance. The asymptotic linear convergence rate for LISTA is established in [15,16]. Several variations of LISTA are proposed for image reconstruction with regularizations based on low rank or group sparsity [17], l 0 minimization [18], and learned approximate message passing [19]. These LOA methods employ handcrafted regularizations and require a closed-form solution of the proximal operator of the regularization term. The idea of LISTA is also extended to solve composite problems with linear constraints, called differentiable linearized alternating direction method of multipliers (D-LADMM) [20], which exhibits an asymptotic linear convergence rate.
To learn more general and adaptive regularization function in (1), the other group of LOAs is proposed to solve inverse problem (1) with learnable regularization. A straightforward approach in this group uses deep convolutional neural network (CNN), denoted by h k (·), to replace the proximal operator prox α k g [21] of the unknown regularization term g in the proximal gradient update: where α k > 0 is the step size in the kth iteration, b k := x k − α k ∇ f (x k ; y), and prox g (·) is defined by Therefore, one avoids explicit formation of the regularization g, but creates a neural network with prescribed K phases, where each phase mimics one iteration of the proximal gradient method such as (2) to compute b k as above and x k = h k (b k ). The CNN h k can also be cast as a residual network (ResNet) [22] to represent the discrepancy between b k and the improved x k [23]. Such a paradigm is also embedded into half quadratic splitting [23], ADMM [24], and primal dual methods [25] to replace the proximal operator in the subproblems. To improve over the generic black-box CNNs above, several LOA methods are proposed to unroll numerical optimization algorithms such as deep neural networks so as to preserve their efficient structures with proven efficiency, such as the ADMM-Net [26] and ISTA-Net [27]. These methods also prescribe the phase number K and map each iteration of the corresponding numerical algorithm to one phase of the network, and learn specific components of the network using training data.

Extra Proximal Gradient Network
In this section, we propose a novel parameter-efficient deep neural network architecture to solve the inverse problem (1) with regularization learned from data. To this end, we first introduce the accelerated extra proximal gradient algorithm that combines Nesterov's acceleration technique and the extra proximal gradient update in Section 3.1. In Section 3.2, we mimic this algorithm to construct the proposed extra proximal gradient network (EPGN), where Nesterov's acceleration step corresponds to a simple linear combination layer in EPGN to boost convergence, and the extra proximal gradient structure induces a predictor-corrector update scheme with efficient utilization of network parame-ters in EPGN. For the image reconstruction applications considered in our experiment part, we integrate mixing layers into EPGN to combine the local and nonlocal image features for enhanced reconstruction quality in Section 3.3. Additional details of the EPGN training process are provided in Section 3.4.

Extra Proximal Gradient Algorithm
The extra gradient method proposed in the seminal work [28] has attracted significant interest in optimization in recent years. It has been extended to solve variational inequality problems [29] and convex/nonconvex composite optimization problems [30] with theoretical performance guaranteed. Extra gradient algorithms use an additional gradient step in a first-order optimization algorithm to improve the convergence results. This can also be interpreted as a predictor-corrector scheme to speed up convergence. The following two variants of the original extra gradient algorithm are closely related to our proposed extra proximal gradient algorithm. The first one is the extended extra gradient method in [30], which uses extra proximal gradient steps at each iteration to solve nonconvex composite minimization problem (1) by The second one is the convex accelerated extra gradient algorithm developed in [31], which integrates Nesterov's accelerated gradient method for smooth convex optimization [32] into the extra gradient scheme. Different from the classical extra gradient method, this algorithm evaluates gradients in both steps at an interpolation of the previous two iterates rather than the previous iterate only. Recall that Nesterov's acceleration technique [32] for minimizing smooth convex function f is given by which performs a momentum structure (5a) to improve the convergence rate of standard gradient methods. For nonconvex problems, a monitor mechanism that tunes γ k adaptively can be introduced to remedy convergence issue [33]. Motivated by this acceleration technique, we propose to combine (4) and (5) and introduce the accelerated extra proximal gradient updating scheme summarized in Algorithm 1 to solve inverse problems of form (1). In Algorithm 1, α k and β k are step sizes, and γ k is the momentum coefficient in the kth iteration.
Algorithm 1: Accelerated Extra Proximal Gradient Algorithm. Input: Data y and initialization

Extra Proximal Gradient Network (EPGN)
We now cast Algorithm 1 as an LOA by mapping its iterations to the phases of a deep neural network. To this end, we select a phase number K (value to be specified in our experiment), and construct a deep neural network with K phases where each phase performs the updates described in (6). More specifically, we retain the same updates (6a), (6b), (6d) and (6e) in the kth phase of the network. As a result, (6a) and (6d) are simple linear combination layers to integrate the momentum term for acceleration, and (6b) and (6e) are gradient updates for improved fitting to the data. The parameters α k , β k , γ k are all to be learned for every phase k (we set α k = β k for simplicity). The remaining updates (6c) and (6f) are replaced by a robust implicit ResNet-type update (we will make it explicitly computable later): where l = 1/2, 1, and the residual mapping r k plays a critical role of regularization that improves the quality of output x k+l in each phase k. In this paper, we parameterize r k as a composition of two nonlinear mappings, denoted by G k and G k , such that: In the remainder of this subsection, we show the details of the CNN structures of these two nonlinear mappings G k and G k and how to make the implicit residual update (7) explicit by leveraging the robust shrinkage selection operator.

Nonlinear Feature Extraction Operator G k
We parametrize the nonlinear operator G k as a multilayer convolutional network of the following structure: where D k and A k are two linear convolutional operations that generate and convolve the local features of the input x, σ is a nonlinear activation function set to the rectified linear unit (ReLU) (i.e., σ(x) = max(x, 0)), and B k is another linear convolution that fuses the activated local features. All the linear mappings A k , B k , and D k are realized as 3 × 3 convolutions. Hence, the size of the receptive field (RF) [34] of G k is 7 × 7. The purpose of G k is to extract the main features of its input, such that these features can be easily refined by a robust feature selector. To this end, we employ the soft shrinkage selection operator in LASSO, which is the proximal operator of the l 1 norm and proved to be effective in selecting sparse outstanding features and suppressing noises of its input. More specifically, we consider G k (b) as the features prepared to be further pruned by shrinkage (as b is obtained by direct gradient update (6b) and (6e) which may contain undesired artifacts, and hence the features G k (b) need further refinement), and G k (x) as the refined feature obtained by pruning G k (b) using shrinkage. In other words, where the shrinkage threshold θ k > 0 is also to be learned with A k , B k , and D k . Note that the component-wise shrinkage operator S k in (10) has a closed form solution as To further increase our network capacity, we set the convolution B k in G k to contain N f kernels (N f is set to 32 by default), each of size 3 × 3 in our implementation, hence G k (x) has N f channels at each pixel of x. The shrinkage operator S k in (10) is applied channel-wise with varying θ k,j where j = 1, . . . , N f . Hence, the learnable parameters of G k include one convolution D k with N f kernels of size 3 × 3 and convolutions A k and B k with N f kernels of size 3 × 3 × N f , and those of S k are the shrinkage thresholds θ k = {θ k,j : j ∈ [N f ]}.

Nonlinear Residual Resembling Operator G k
Based on (8), the purpose of the nonlinear operator G is to resemble the residual term using the refined feature G k (x), we can interpret G k and G k respectively as encoder and decoder in a symmetric form. More specifically, we parametrize G k (x) as D k A k σ( B k x), where A k , B k , and D k are all 3 × 3 convolutional operators, and D k compresses the N f channels back to 1 channel according to D k in implementation.
Combining the parametrized nonlinear operators G k and G k and the shrinkage operator S k into (8), we obtain an explicit update rule of (7) given by This update rule is employed in (6c) and (6f) for l = 1/2, 1 respectively in the kth phase.
To summarize, our proposed extra proximal gradient network (EPGN) of a prescribed K phases is constructed by unrolling Algorithm 1, where each phase executes (6) but with (6c) and (6f) substituted by (11). The flowchart of the computation in EPGN is shown in Figure 1. The proposed EPGN not only inherits the advantages of Algorithm 1 but also employs learnable feature selection operations. Hence, EPGN combines the following properties: (i) the simple linear momentum layers (6a) and (6d) for improved convergence; (ii) the extra proximal gradient updates (6b), (6e), and (11) mimic the predictor-corrector scheme for parameter-efficient network structure; (iii) learnable feature extraction, selection, and residual resembling operators (G k , G k , S k ) in (11) to effectively improve solution quality as the input data flows through EPGN.

EPGN with Nonlocal Operator (NL-EPGN)
Nonlocal methods have proven effective for image reconstruction problems, such as in variational methods [35] and nonvariational approaches such as the notable BM3D algorithm [36]. Nonlocal operators can significantly improve image quality as they use image patches located in different regions to exploit the inherent self-similarity of images. Recently, the success of the nonlocal methods has motivated the investigation of the architecture of DNNs that have the ability to capture long-distance dependencies of the image. The deep network architecture for gray-scale and color image denoising in [37] is inspired by the projected gradient algorithm for solving a common variational image restoration model with a learnable nonlocal regularization. The nonlocal neural network proposed in [38] can be viewed as a generalization of the classical nonlocal mean in [35] that computes the response at a position as a weighted average of the image intensities at all positions. The weights implicitly depend on the feature maps in the patches with the size determined by the receptive fields.
To exploit repeated features in images for enhanced reconstruction quality, we adopt the idea in [38]. However, unlike [38] which only relies on nonlocal features, our NL-EPGN fuses local and nonlocal features of images using a combination operator learned through training data. More specifically, we propose NL-EPGN to integrate a nonlocal operator N k into the residual operator in (11), so that the features refined by the shrinkage operation S k can be passed to N k to leverage nonlocal features in images: The operator N k contains two main components: a nonlocal block M k that extracts nonlocal features of the input, and a nonlinear layer that combines the local and nonlocal features. The details of these two components are given as follows.
where the function ϕ computes a representation of the input signal at position j, and w ij is the normalized weight depending on the similarity between [z] i and [z] j . The mapping ϕ corresponds to a learnable matrix W ϕ (implemented as 1 × 1 convolution). The weights are computed by embedded Gaussian: where both W α and W β are implemented as N f /2 convolutional filters of kernel size 1 × 1.
We employ the bottleneck structure to reduce computation [38]. Hence, the nonlocal block

Local and Nonlocal Combination Layer
We propose to use a learnable combination layer of form σ(C k [z, v]) to merge the input local feature z and nonlocal feature v obtained by nonlocal block M k in (15). That is, the nonlocal operator N k is defined by In the kth phase, the inputs of N k are z = S k • G k (b k+l ) for l = 1/2, 1 as shown in (12), [·, ·] stands for the concatenation operator at each pixel, and C k corresponds to a set of learnable weight vectors which project the concatenated vector to a scalar at each pixel (implemented as 1 × 1 convolution). The flowchart of the nonlocal operator is shown in Figure 2.

Network Training
As discussed above, the proposed EPGN (or NL-EPGN) consists of K phases, where each phase imitates one iteration in the accelerated extra gradient Algorithm 1 with proximal steps (6b) and (6e) replaced by (11) (or (12) for NL-EPGN). The flowchart of variables in the kth phase is shown in Figure 3. The parameters in the kth phase are collectively denoted by Θ k , which includes the feature extraction operator , the momentum coefficient γ k , and the shrinkage thresholds θ k . Let Θ = {Θ k : 0 ≤ k ≤ K − 1} be the set of all network parameters. Then, given N training data pairs of form {(x (i) , y (i) ) : 1 ≤ i ≤ N}, where y (i) is the input measurement data and x (i) is the corresponding ground truth of the ith pair, we define the loss function of Θ as where x K (y; Θ) denotes the output of the EPGN (i.e., the output of the last, Kth phase) parametrized by Θ given input data y. The optimal network parameter Θ * is obtained by minimizing the loss function (17) in the training process. After training, the EPGN with parameter Θ * serves as a feed-forward neural network that can reconstruct high-quality image x given new measurement data y on the fly.

Numerical Experiments
In this section, we evaluate the performance of the proposed EPGN and NL-EPGN on several inverse problems in imaging reconstruction applications. We focus on the reconstruction problem in compressive sensing in our experiments, however, the proposed method can be easily adapted to other image reconstruction problems by changing the data-fidelity term accordingly. All the experiments are implemented, trained, and tested in the TensorFlow framework [39] on a desktop with an Nvidia GTX-1080Ti GPU and 11 GB of graphics card memory (NVIDIA Corporation, Santa Clara, CA, USA).In all tests, the network parameters Θ of EPGN/NL-EPGN are initialized using the Xavier method [40] and trained with the Adam optimizer [41] with learning rate 1 × 10 −4 for 200 epochs. To evaluate the reconstruction quality, we use the average peak signal-to-noise ratio (PSNR).

Nature Images Compressive Sensing
We first test EPGN on the compressive sensing (CS) image reconstruction problem. In our experiment we use the 91 Images dataset for training and Set11 for testing [42]. For a fair comparison, we follow the same data preparation and result evaluation procedures in [27]. The ground truth data {x (i) : 1 ≤ i ≤ N} contains N = 88, 912 image patches with luminance components that are all randomly cropped into size 33 × 33 from 91 Images dataset. We then generate a matrix with random Gaussian entries of size 10%n and 25%n, where n = 33 2 , and orthogonalize the rows. Then the measurement data for training is The testing data Set11 preparation follows the same procedure as training data.

Comparison with Existing Methods
We set the phase number K = 9 for EPGN and K = 7 for NL-EPGN in this test (as shown in Figure 4 where the PSNRs of the networks become saturated). Table 1 shows the comparison of the average PSNRs of the images reconstructed by EPGN/NL-EPGN versus several state-of-the-art image reconstruction methods, namely TVAL3 [43], D-AMP [44], IRCNN [23], ReconNet [42], DR 2 -Net [45], ISTA-Net + [27], and DPA-Net [46], where the first two are classical optimization-based methods, and the last five are deep learning-based methods. The PSNR results of the first four methods and ISTA-Net + in Table 1 are quoted from [27]. We observe that EPGN and NL-EPGN outperform all aforementioned algorithms, whereas NL-EPGN obtains the highest accuracy.  Table 1. Natural image CS reconstruction results by existing methods and the proposed EPGN (with 9 phases) and NL-EPGE (with 7 phases) on dataset Set11 with CS ratios of 10% and 25%.

Reconstruction Quality Assessment
Compared to the state-of-the-art ISTA-Net + [27], both EPGN and NL-EPGN obtain better reconstruction results with a similar number of parameters as shown in Table 2. In particular, Figure 5 shows the reconstructed butterfly image with a CS ratio of 10%, from which we can see that the 9-phase EPGN and 7-phase NL-EPGN can both capture the inconspicuous detail of the butterfly wings at the lower left part in the zoomed-in images. Similarly, Figure 6 shows the reconstructed cameraman image with a CS ratio of 25%, where the 9-phase EPGN and 7-phase NL-EPGN have fewer artifacts in the background compared to ISTA-Net + , as observed in the lower right area of the zoomed-in images. Figure 7 presents the reconstruction results of the Barbara image in Set11 from ISTA-Net + ,the 9-phase EPGN, and the 7-phase NL-EPGN with a CS ratio of 10%. We can observe that the texture pattern of the scarf is better preserved by NL-EPGN due to the nonlocal operator.

Parameter Efficiency
The number of network parameters in each phase of ISTA-Net + is 37, 442 [27]. The number of trainable parameters of each phase in EPGN is Similarly, the number of learnable parameters of each phase in EPGN is 41, 571. Therefore, the number of network parameters in each phase of ISTA-Net + , EPGN, and NL-EPGN are very similar (NL-EPGN is about 10.9% more than ISTA-Net + and EPGN). Figure 4 shows the reconstruction PSNR of these three methods versus phase number, from which we observe that NL-EPGN becomes saturated with phase number K ≥ 7, whereas EPGN with K ≥ 9 and ISTA-Net+ with K ≥ 11. Nevertheless, as shown in Table 2, a 7-phase NL-EPGN has fewer network parameters than a 9-phase EPGN but achieves even higher PSNR.
We compare the reconstruction results of EPGN and ISTA-Net + in a range of different phase numbers with a CS ratio of 25%, as shown in Figure 4. We observe that the PSNR values improve as the phase number increases and become saturated after K ≥ 9. EPGN achieves a 0.3 dB higher PSNR on average than ISTA-Net + . To further demonstrate the superiority of the extra proximal-gradient method over extending network depth, we compare the 9-phase EPGN with the 15-phase ISTA-Net + , as shown in Table 2. Compared to the 15-phase ISTA-Net + which extends the depth of the network by simply adding more phases, the 9-phase EPGN achieves better accuracy (0.27 dB higher) using much fewer parameters and similar reconstruction time. We compare the reconstruction performance of EPGN and NL-EPGN with CS ratios of 10% and 25%, the 7-phase NL-EPGN outperforms the 9-phase EPGN by 0.21 dB and 0.15 dB respectively, as shown in Table 1. We also compare the reconstruction results of NL-EPGN and EPGN in a range of different phase numbers with a CS ratio of 25%. The results are shown in Figure 4. We observe that NL-EPGN achieves an average of 0.2 dB PSNR better than EPGN. It is interesting that the PSNR of NL-EPGN shows no significant improvement after K = 7. As shown in Table 2, the reconstruction time of the 7-phase NL-EPGN is approximate 7 to 8 times that of the 9-phase EPGN due to the time complexity of the nonlocal operator. However, the effect of the nonlocal operator is remarkable, NL-EPGN with 7 phases has a 0.15 dB PSNR improvement with 13.7% fewer parameters compared to EPGN with 9 phases. In Figure 8, we show the PSNR versus epoch using the proposed NL-EPGN and the state-of-the-art method ISTA-Net + for image reconstruction with a CS ratio of 10% and phase number 3. While both networks gradually improve PSNR with more epochs, NL-EPGN appears to be significantly more effective than ISTA-Net + during training as the former produces reconstructions with much higher PSNR.

MR Images Compressive Sensing
We also test the performance of EPGN on compressive sensing reconstruction of brain MR images [47] (CS-MRI). We randomly selected 100 and 50 images for training and testing, respectively, and cropped every image to the size of 190 × 190. In the CS-MRI problem, the data fidelity is f (x; y) = Φx − y 2 2 , where Φ = PF , P is a binary selection matrix representating the sampling trajectory, and F is the discrete Fourier transform. We compare EPGN with ISTA-Net + [27] on the same MRI data set. The experimental results on various undersampling ratios of radial masks are summarized in Table 3. Here, we set the phase number of ISTA-Net + and EPGN to 15 and 11 respectively. It is obvious that EPGN outperforms ISTA-Net + for each undersampling ratio.

Concluding Remarks
We presented a novel deep neural network architecture, called the extra proximal gradient network (EPGN), to solve a general class of inverse problems with a focus on image reconstruction applications. EPGN imitates the accelerated extra proximal gradient algorithm and features a learned regularization that incorporates adaptive sparsification mappings, robust shrinkage selections, and the combination of local and nonlocal operators for improved solution quality and network parameter efficiency. Extensive numerical experiments show that EPGN outperforms several existing state-of-the-art methods on a variety of image reconstruction problems.

Conflicts of Interest:
The authors declare no conflict of interest.