An Optimization-Based Meta-Learning Model for MRI Reconstruction with Diverse Dataset

This work aims at developing a generalizable Magnetic Resonance Imaging (MRI) reconstruction method in the meta-learning framework. Specifically, we develop a deep reconstruction network induced by a learnable optimization algorithm (LOA) to solve the nonconvex nonsmooth variational model of MRI image reconstruction. In this model, the nonconvex nonsmooth regularization term is parameterized as a structured deep network where the network parameters can be learned from data. We partition these network parameters into two parts: a task-invariant part for the common feature encoder component of the regularization, and a task-specific part to account for the variations in the heterogeneous training and testing data. We train the regularization parameters in a bilevel optimization framework which significantly improves the robustness of the training process and the generalization ability of the network. We conduct a series of numerical experiments using heterogeneous MRI data sets with various undersampling patterns, ratios, and acquisition settings. The experimental results show that our network yields greatly improved reconstruction quality over existing methods and can generalize well to new reconstruction problems whose undersampling patterns/trajectories are not present during training.


Introduction
Deep learning methods have demonstrated promising performance in a variety of image reconstruction problems. However, deep learning models are often trained for specific tasks and require the training samples to follow the corresponding distribution. In particular, the source-domain/training samples and target-domain/testing samples need to be drawn from the same distribution [1][2][3][4]. In practice, these data sets are often collected at different sources and exhibit substantial heterogeneity, and thus the samples may follow related but different distributions in real-world applications [5,6]. Therefore, the robust and efficient training of deep neural networks using such data sets is theoretically important and practically relevant in the application of deep learning-based methods.
Meta-learning provides a unique paradigm to achieve robust and efficient neural network training [1,4,[7][8][9][10]. Meta-learning is known as learning-to-learn and aims to quickly learn unseen tasks from the experience of learning episodes that cover the distribution of relevant tasks. In a multiple-task scenario, given a family of tasks, meta-learning has been proven to be a useful tool for extracting task-agnostic knowledge and improving the learning performance of new tasks from that family [11,12]. We leverage this feature of meta-learning for network training where the MRI training data are acquired by using different under-sampling patterns (e.g., Cartesian mask, Radial mask, Poisson mask), under-sampling ratios, and different settings of the scanning parameters, which result in different levels of contrast (e.g., T1-weighted, T2-weighted, proton-density (PD), and Flair). These data are vastly heterogeneous and can be considered as being from various tasks. 1.
An LOA inspired network architecture-our network architecture exactly follows a proposed LOA with guaranteed convergence. Thus, the network is more interpretable, parameter-efficient, and stable than existing unrolling networks.

2.
Adaptive design of regularization-our adaptive regularizer consists of a task-invariant part and a task-specific part, both of which can be appropriately trained from data.

3.
Improved network robustness and generalization ability-we improve the robustness of the network parameter training process by posing it as a bilevel optimization using training data in the lower-level and validation data in the upper-level. This approach also improves the generalization ability of the trained network so that it can be quickly adapted to image reconstruction with new unseen sampling trajectories and produces high-quality reconstructions.
The remainder of the paper is organized as follows. In Section 2, we discuss related work for both optimization-based meta-learning and deep unrolled networks for MRI reconstructions. We propose our meta-learning model and the neural network in Section 3 and describe the implementation details in Section 4. Section 5 provides the numerical results of the proposed method. Section 6 concludes the paper.

Related Work
In recent years, meta-learning methods have demonstrated promising results in various fields with different techniques [12]. Meta-learning techniques can be categorized into three groups [14][15][16]: metric-based methods [17][18][19], model-based methods [20][21][22][23], and optimization-based methods [8,24,25]. Optimization-based methods are often cast as a bilevel optimization problem and exhibit relatively better generalizability for wider task distributions. We mainly focus on optimization-based meta-learning in this paper. For more comprehensive literature reviews and developments of meta-learning, we refer the readers to the recent surveys [12,16].
Optimization-based meta-learning methods have been widely used in a variety of deep learning applications [8,[24][25][26][27][28][29][30][31]. The network training problems in these meta-learning methods are often cast as the bilevel optimization of a leader variable and a follower variable. The constraint of the bilevel optimization is that the follower variable is optimal for the lower-level problem for each fixed leader variable, and the ultimate goal of bilevel optimization is to find the optimal leader variable (often, the corresponding optimal follower variable as well) that minimizes the upper-level objective function under the constraint. The lower-level problem is approximated by one or a few gradient descent steps in many existing optimization-based meta learning applications, such as Model-Agnostic Meta-Learning (MAML) [8], and a large number of followup works of MAML proposed to improve generalization using similar strategy [9,15,27,29,30,[32][33][34]. Deep bilevel learning [35] seeks to obtain better generalization than when trained on one task and generalize well to another task. The model is used to optimize a regularized loss function to find network parameters from the training set and identify hyperparameters so that the network performs well on the validation dataset.
When the unseen tasks lie in inconsistent domains with the meta-training tasks, as revealed in [36], the generalization behavior of the meta-learner will be compromised. This phenomenon partially arises from the meta-overfitting on the already seen meta-training tasks, which is identified as a memorization problem in [34]. A meta-regularizer forked with information theory is proposed in [34] to handle the memorization problem by regulating the information dependency during the task adaption. MetaReg [4] decouples the entire network into the feature network and task network, where the meta-regularization term is only applied to the task network. They first update the parameters of the task network with a meta-train set to obtain the domain-aligned task network and then update the parameters of the meta-regularization term on the meta-test set to learn the cross-domain generalization. In contrast to MetaReg, Feature-Critic Networks [37] exploit the meta-regularization term to pursue a domain-invariant feature extraction network. The meta-regularization is designed as a feature-critic network that takes the extracted feature as an input. The parameters of the feature extraction network are updated by minimizing the new meta-regularized loss. The auxiliary parameters in the feature-critic network are learned by maximizing the performance gain over the non-meta case. To effectively evaluate the performance of the meta-learner, several new benchmarks [38][39][40] were developed under more realistic settings that operate well on diverse visual domains. As mentioned in [39], the generalization to unseen tasks within multimodal or heterogeneous datasets remains a challenge to the existing meta-learning methods.
The aforementioned methods pursue domain generalization for the classification networks that learned a regularization function to learn cross-domain generalization. Our proposed method was developed to solve the inverse problem, and we construct an adaptive regularization that not only learns the universal parameters among tasks but also the task-aware parameters. The designated adaptive regularizer assists the generalization ability of the deep model so that the well-trained model can perform well on heterogeneous datasets of both seen and unseen tasks.

Preliminaries
We first provide the background of compressed sensing MRI (CS-MRI), the image reconstruction problem, and the learned optimization algorithm to solve the image reconstruction problem. CS-MRI accelerates MRI data acquisition by under-sampling the k-space (Fourier space) measurements. The under-sampled k-space measurement are related to the image by the following formula [41]: where y ∈ C p represents the measurements in k-space with a total of p sampled data points, x ∈ C N×1 is the MR image to be reconstructed with N pixels, F ∈ C N×N is the 2D discrete Fourier transform (DFT) matrix, and P ∈ R p×N (p < N) is the binary matrix representing the sampling trajectory in k-space. n is the acquisition noise in k-space. Solving x from (noisy) under-sampled data y according to (1) is an ill-posed problem. An effective strategy to elevate the ill-posedness issue is to incorporate prior information to the reconstruction. The variational method is one of the most effective ways to achieve this. The general framework of this method is to minimize an objective function that consists of a data fidelity term and a regularization term as follows: where the first term is data fidelity, which ensures consistency between the reconstructed x and the measured data y, and the second term R(x) is the regularization term, which introduces prior knowledge to the image to be reconstructed. In traditional variational methods, R(x) is a hand-crafted function such as Total Variation (TV) [42]. The advances of the optimization techniques allowed more effective algorithms to solve the variational models with theoretical justifications. However, hand-crafted regularizers may be too simple to capture subtle details and satisfy clinic diagnostic quality.
In recent years, we have witnessed the tremendous success of deep learning in solving a variety of inverse problems, but the interpretation and generalization of these deeplearning-based methods still remain the main concerns. As an improvement over generic black-box-type deep neural networks (DNNs), several classes of learnable optimization algorithms (LOAs) inspired neural networks, known as unrolling networks, which unfold iterative algorithms to multi-phase networks and have demonstrated promising solution accuracy and efficiency empirically [43][44][45][46][47][48][49][50]. However, many of them are only specious imitations of the iterative algorithms and hence lack the backbone of the variational model and any convergence guarantee.
In light of the substantial success of deep learning and the massive amount of training data now available, we can parameterize the regularization term as a deep convolutional neural network (CNN) that learns from training samples. LOA-induced reconstruction methods have been successfully applied to CS-MRI to solve inverse problems with a learnable regularizer: arg min where R(x; Θ) is the regularization parameterized as a deep network with parameter Θ. Depending on the specific parametric form of R(x; Θ) and the optimization scheme used for unrolling, several unrolling networks have been proposed in recent years. For example, the variational network (VN) [51] was introduced to unroll the gradient descent algorithm and parametrize the regularization as a combination of linear filters and nonlinear CNNs. MoDL [52] proposed a weight sharing strategy in a recursive network to learn the regularization parameters by unrolling the conjugate gradient method. ADMM-Net [53] mimics the celebrated alternating direction method of multipliers; the regularizer is designed to be L 1 -norm replaced by a piecewise linear function. ISTA-Net [54] considers the regularizer as the L 1 -norm of a convolutional network. The network unrolls several phases iteratively, and each phase mimics one iteration of iterative shrinkage thresholding algorithm (ISTA) [55,56]. However, these networks only superficially mimic the corresponding optimization schemes, but they lack direct relations to the original optimization method or variational model and do not retain any convergence guarantee. In this work, we first develop a learnable optimization algorithm (LOA) for (3) with comprehensive convergence analysis and obtain an LOA-induced network by following the iterative scheme of the LOA exactly.

LOA-Induced Reconstruction Network
In this section, we first introduce a learned optimization algorithm (LOA) to solve (3) where the regularization network parameter Θ is fixed. As Θ is fixed in (3), we temporarily omit this in the derivation of the LOA below and write R(x; Θ) as R(x) for notation simplicity.
In this work, to incorporate sparsity along with the learned features, we parameterize the function R(x) = κ · r(x), where κ > 0 is a weight parameter that needs be chosen properly depending on the specific task (e.g., noise level, undersampling ratio, etc.), and r is a regularizer parameterized as a composition of neural networks and can be adapted to a broad range of imaging applications. Specifically, we parameterize r as the composition of the l 2,1 norm and a learnable feature extraction operator g(x). That is, we set r in (11) to be Here, ":=" stands for "defined as". g(·) = (g 1 (·), . . . , g m (·)), g j (·) = g j (·; θ) is parametrized as a convolutional neural network (CNN) for j = 1, · · · , m, and θ is the learned and fixed network parameter in r(·; θ), as mentioned above. We also consider κ to be learned and fixed as θ for now, and we discuss how to learn both of them in the next subsection. We use a smooth activation function in g as formulated in (20), which renders g a smooth but nonconvex function. Due to the nonsmooth · 2,1 , r is therefore a nonsmooth nonconvex function.
Since the minimization problem in (11) is nonconvex and nonsmooth, we need to derive an efficient LOA to solve it. Here, we first consider smoothing the l 2,1 norm that for any fixed g(x) is We denote R ε = κ · r ε . The LOA derived here is inspired by the proximal gradient descent algorithm and iterates the following steps to solve the smoothed problem: where ε t denotes the smoothing parameter ε at the specific iteration t, and the proximal operator is defined as prox αg (b) := arg min x x − b + αg(x) in (6b). A quick observation from (5) is that R ε → R as ε diminishes, so later we intentionally push ε t → 0 at Line 16 in Algorithm 1. Then, one can readily show that R ε (x) ≤ R(x) ≤ R ε (x) + ε for all x and ε > 0. From Algorithm 1, line 16 automatically reduces ε, and the iterates will converge to the solution of the original nonsmooth nonconvex problem (11)-this is clarified precisely in the convergence analysis in Appendix A.
Since R ε t is a complex function involving a deep neural network, its proximal operator does not have a closed form and cannot be computed easily in the subproblem in (6b). To overcome this difficulty, we consider to approximate R ε t bŷ Then, we update u t+1 = prox α tRε t (z t+1 ) to replace (6b); therefore, we obtain In order to guarantee the convergence of the algorithm, we introduce the standard gradient to serve as a safeguard for the convergence. Specifically, we set Then, we repeat this process. Our algorithm is summarized in Algorithm 1. The prior term with unknown parameters has the exact residual update itself which improves the learning and training process [57]. The condition checking in Line 5 is introduced to make sure that it is in the energy descending direction. Once the condition in Line 5 fails, the process moves to v t+1 , and the line search in Line 12 guarantees that the appropriate step size can be achieved within finite steps to make the function value decrease. From Line 3 to Line 14, we consider that it solves a problem of minimizing φ ε t with ε t fixed. Line 15 is used to update the value of ε t depending on a reduction criterion. The detailed analysis of this mechanism and in-depth convergence justification is shown in Appendix A. The corresponding unrolling network exactly follows Algorithm 1 and thus shares the same convergence property. Compared to LDA [13], which computes both candidates u t+1 , v t+1 at every iteration and then chooses the one that achieves a smaller function value, we propose the criteria above in Line 5 for updating x t+1 , which potentially saves extra computational time for calculating the candidate v t+1 and potentially mitigates the frequent alternations between the two candidates. Besides, the smoothing method proposed in this work is more straightforward than smoothing in dual space [13] while still keeping provable convergence, as shown in Theorem A5.
The proposed LOA-induced network is a multi-phase network whose architecture exactly follows the proposed LOA (Algoirthm 1) in the way that each phase corresponds to one iteration of the algorithm. Specifically, we construct a deep network, denoted by F Θ , that follows Algorithm 1 exactly for a user-specified number of T iterations. Here, Θ denotes the set of learnable parameters in F Θ , which includes the regularization network parameter θ, weight κ, and other algorithmic parameters of Algorithm 1. Therefore, for any input under-sampled k-space measurement y, F Θ (y) executes the LOA (Algorithm 1) for T iterations and generates an approximate solution of the minimization problem (11): where we use "≈" since F Θ follows only finitely many steps of the optimization algorithm to approximate the solution. It is worth emphasizing that this approach can be readily applied to a much broader class of image reconstruction problems as long as f is (possibly nonconvex and) continuously differentiable with the Lipschitz continuous gradient. In the next subsection, we develop a meta-learning based approach for the robust training of the network parameter Θ.

Bilevel Optimization Algorithm for Network Training
In this section, we consider the parameter training problem of the LOA-induced network F Θ . Specifically, we develop a bilevel optimization algorithm to train our network parameters Θ from diverse data sets to improve network robustness and generalization ability.
Recall that the LOA-induced network F Θ exactly follows Algorithm 1, which is designed to solve the variational model (11) containing learnable regularization R(x; Θ). As shown in Section 3.2, we design R(x; Θ) = κ · r(x; Θ), where r is learned to capture the intrinsic property of the underlying common features across all different tasks. To account for the large variations in the diverse training/validation data sets, we introduce a taskspecific parameter ω i to approximate the proper κ for the ith task. Specifically, for the ith task, the weight κ is set to σ(ω i ) ∈ (0, 1), where σ(·) is the sigmoid function. Therefore, κ = σ(ω i ) finds the proper weight of r for the i-th task according to its specific sampling ratio or pattern. The parameters ω i are to be optimized in conjunction with Θ through the hyperparameter tuning process below.
Suppose that we are given M data pairs {(y m , x * m )} M m=1 for the use of training and validation, where y m is the observation, which is the partial k-space data in our setting, and x * m is the corresponding ground truth image. The data pairs are then sampled into where each D τ i represents the collection of data pairs in the specific task τ i . In each task τ i , we further divide the data into the task-specific training set D tr τ i and validation set D val τ i . The architecture of our base network exactly follows the LOA (Algorithm 1) developed in the previous section with learnable parameters θ and a taskspecific parameter ω i for the ith task. More precisely, for one data sample denoted by (y where θ denotes the learnable common parameters across different tasks with task-invariant representation, whereas ω i is a task-specific parameter for task τ i . The weight σ(ω i ) represents the weight of r associated with the specific task τ i . In our proposed network, Θ is the collection of (θ, ω i ) for task i = 1, · · · N . We denote ω to be the set {ω i } N i=1 . The detailed architecture of this network is illustrated in Section 3.2. We define the task-specific loss where |D τ i | represents the cardinality of D τ i and For the sake of preventing the proposed model from overfitting the training data, we introduce a novel learning framework by formulating the network training as a bilevel optimization problem to learn ω and θ in (12) as In (15), the lower-level optimization learns the task-invariant parameters θ of the feature encoder with the fixed task-specific parameter ω i on the training dataset, and the upperlevel adjusts the task-specific parameters {ω i } so that the task-invariant parameters θ can perform robustly on the validation dataset as well. For simplicity, we omit the summation and redefine L(θ, ω; D) := ∑ N i=1 τ i (θ, ω; D) and then briefly rewrite (15) as Then, we relax (16) into a single-level constrained optimization where the lower-level problem is replaced with its first-order necessary condition following [58] min θ,ω which can be further approximated by an unconstrained problem by a penalty term as We adopt the stochastic gradients of the loss functions on mini-batch data sets in each iteration. In our model, we need to include the data pairs of multiple tasks in one batch; therefore, we propose the cross-task mini-batches when training. At each training iteration, we randomly sample the training data pairs B tr on each task τ i . Then, the overall training and validation mini-batches B tr and B val used in every training iteration are composed of the sampled data pairs from the entire set of tasks; i.e., Thus in each iteration, we have N · J tr and N · J val data pairs used for training and validation, respectively. To solve the minimization problem (17), we utilize the stochastic mini-batch alternating direction method summarized in Algorithm 2, which is modified from [58].
As analyzed in [58], this penalty-type method has linear time complexity without computing the Hessian of the low level and only requires a constant space since we only need to store the intermediate θ, ω at each training iteration, which is suitable for solving the large-scale bilevel optimization problem. Algorithm 2 relaxes the bi-level optimization problem to a single-level constrained optimization problem by using the first-order necessary condition, which is not equivalent to the original problem but is much easier and efficient to solve. In the inner-loop (Line 5-9) of Algorithm 2, we continue minimizing the converted single-level optimization function (18) with respect to θ for K steps and then ω once alternatively until the condition with tolerance δ in Line 5 fails. The basic idea behind the condition in Line 5 arises from the first-order necessary condition as we would like to push the gradient of L toward 0. Furthermore, at Line 11 of the outer loop (Line 2-11), we decrease the tolerance δ. Combining Line 5 and 11 guarantees that each time the inner loop terminates, the gradients of L with respect to θ and ω become increasingly close to 0. The parameter δ tol is used to control the accuracy of the entire algorithm, and the outer-loop will terminate when δ is sufficiently small (i.e., δ ≤ δ tol ). In addition, λ is the weight for the second constraint term of (18); in the beginning, we set λ to be small to achieve a quick starting convergence, then gradually increase its value to emphasize the constraint.

Feature Extraction Operator
We set the feature extraction operator g to be a vanilla l-layer CNN with the componentwise nonlinear activation function ϕ and no bias, as follows: where {w q } l q=1 denote the convolution weights consisting of d kernels with identical spatial kernel size, and * denotes the convolution operation. Here, ϕ is constructed to be the smoothed rectified linear unit as defined below: where the prefixed parameter δ is set to be 0.001 in our experiment. The default configuration of the feature extraction operator is set as follows: the feature extraction operator g consists of l = 3 convolution layers and all convolutions are with 4 kernels of a spatial size of 3 × 3.

Setups
As our method introduces an algorithmic unrolling network, there exists a one-to-one correspondence between the algorithm iterations and the neural network phases (or blocks). Each phase of the forward propagation can be viewed as one algorithm iteration, which motivates us to imitate the iterating of the optimization algorithm and use a stair training strategy [13]. At the first stage, we start training the network parameters using one phase, then after the the loss converges, we add more phases (one phase each time) then continue the training process. We repeat this procedure and stop it when the loss does not decrease any further when we add more blocks. We minimize the loss for 100 epochs/iterations each time using the SGD-based optimizer Adam [59] with β 1 = 0.9, β 2 = 0.999, and the initial learning rate set to 10 −3 as well as a mini-batch size of 8. The Xavier Initializer [60] is used to initialize the weights of all convolutions. The initial smoothing parameter ε 0 is set to be 0.001 and then learned together with other network parameters. The input x 0 of the unrolling network is obtained by the zero-filling strategy [61]. The deep unrolling network was implemented using the Tensorflow toolbox [62] in the Python programming language.

Dataset
To validate the performance of the proposed method, the data we used were from Multimodal Brain Tumor Segmentation Challenge 2018 [63], in which the training dataset contains four modalities (T1, T1 c , T2 and FLAIR )scanned from 285 patients and the validation dataset contains images from 66 patients, each with a volume size of 240 × 240 × 155. Each modality consists of two types of gliomas: 75 volumes of low-grade gliomas (LGG) and 210 volumes of high-grade gliomas (HGG). Our implementation involved HGG MRI in two modalities-T1 and T2 images-and we chose 30 patients from each modality in the training dataset to train our network. In the validation dataset, we randomly picked 15 patients as our validation data and 6 patients in the training dataset as testing data, which were distinct from our training set and validation set. We cropped the 2D image size to be 160 × 180 in the center region and picked 10 adjacent slices in the center of each volume, resulting in a total of 300 images as our training data, 150 images as our validation data, and a total of 60 images as testing data. The amount of data mentioned here is for a single task, but since we emploedy multi-task training, the total number of images in each dataset should be multiplied by the number of tasks. For each 2D slice, we normalized the spatial intensity by dividing the maximum pixel value.

Experiment Settings
All the experiments were implemented on a Windows workstation with an Intel Core i9 CPU at 3.3GHz and an Nvidia GTX-1080Ti GPU with 11 GB of graphics card memory via TensorFlow [64]. The parameters in the proposed network were initialized by using Xavier initialization [65]. We trained the meta-learning network with four tasks synergistically associated with four different CS ratios-10%, 20%, 30%, and 40%-and tested the welltrained model on the testing dataset with the same masks of these four ratios. We used 300 training data for each CS ratio, amounting to a total of 1200 images in the training dataset. The results for T1 and T2 MR reconstructions are shown in Tables 1 and 2, respectively. The associated reconstructed images are displayed in Figures 1 and 2. We also tested the well-trained meta-learning model on unseen tasks with radial masks for unseen ratios of 15%, 25%, and 35% and random Cartesian masks with ratios of 10%, 20%, 30%, and 40%. The task-specific parameters for the unseen tasks were retrained for different masks with different sampling ratios individually with fixed task-invariant parameters θ. In this experiments, we only needed to learn ω i for three unseen CS ratios with radial mask and four regular CS ratios with Cartesian masks. The experimental training proceeded with fewer data and iterations, where we used 100 MR images with 50 epochs. For example, to reconstruct MR images with a CS ratio of 15% from the radial mask, we fixed the parameter θ and retrained the task-specific parameter ω on 100 raw data points with 50 epochs, then tested with renewed ω on our testing data set with raw measurements sampled from the radial mask with a CS radial of 15%. The results associated with radial masks are shown in Tables 3 and 4, Figures 3 and 4 for T1 and T2 images, respectively. The results associated with Cartesian masks are listed in Table 5 and reconstructed images are displayed in Figure 5.
We compared our proposed meta-learning method with conventional supervised learning, which was trained with one task at each time and only learned the task-invariant parameter θ without the task-specific parameter ω i . The forward network of conventional learning unrolled Algorithm 1 with 11 phases, which was the same as meta-learning. We merged the training set and validation set, resulting in 450 images for the training of the conventional supervised learning. The training batch size was set as 25 and we applied a total of 2000 epochs, while in meta-learning, we applied 100 epochs with a batch size of 8. The same testing set was used in both meta-learning and conventional learning to evaluate the performance of these two methods.
We made comparisons between meta-learning and the conventional network on the seven different CS ratios (10%, 20%, 30%, 40%, 15%, 25%, and 35%) in terms of two types of random under-sampling patterns: radial sampling mask and Cartesian sampling mask. The parameters for both meta-learning and conventional learning networks were trained via the Adam optimizer [59], and they both learned the forward unrolled task-invariant parameter θ. The network training of the conventional method used the same network configuration as the meta-learning network in terms of the number of convolutions, depth and size of CNN kernels, phase numbers and parameter initializer, etc. The major difference in the training process between these two methods is that meta-learning is performed for multi-tasks by leveraging the task-specific parameter ω i learned from Algorithm 2, and the common features among tasks are learned from the feed-forward network that unrolls Algorithm 1, while conventional learning solves the task-specific problem by simply unrolling the forward network via Algorithm 1, where both training and testing are implemented on the same task. To investigate the generalizability of meta-learning, we tested the well-trained meta-learning model on MR images in different distributions in terms of two types of sampling masks with various trajectories. The training and testing of conventional learning were applied with the same CS ratios; that is, if the conventional method was trained with a CS ratio 10%, then it was also tested on a dataset with a CS ratio of 10%, etc.
Because MR images are represented as complex values, we applied complex convolutions [66] for each CNN; that is, every kernel consisted of a real part and imaginary part. Three convolutions were used in g, where each convolution contained four filters with a spatial kernel size of 3 × 3. In Algorithm 1, a total of 11 phases can be achieved if we set the termination condition tol = 1 × 10 −3 , and the parameters of each phase are shared except for the step sizes. For the hyperparameters in Algorithm 1, we chose an initial learnable step size α 0 = 0.01, τ 0 = 0.01, ε 0 = 0.001, and we set prefixed values of a = 10 5 , σ = 10 3 , ρ = 0.9, and γ = 0.9. The principle behind the choices of those parameters is based on the convergence of the algorithm and effectiveness of the computation. The parameter 0 < ρ < 1 is the reduction rate of the step size during the line search used to guarantee the convergence. The parameter 0 < γ < 1 at step 15 is the reduction rate for ε. In Algorithm 1, from step 2 to step 14, the smoothing level ε is fixed. When the gradient of the smoothed function is small enough, we reduce ε by a fraction factor γ to find an approximate accumulation point of the original nonsmooth nonconvex problem. We chose a larger a in order to have more iterations k for which u k+1 satisfies the conditions in step 5, so that there would be fewer iterations requiring the computation of v k+1 . Moreover, the scheme for computing u k+1 is in accordance with the residual learning architecture that has been proven effective for reducing training error.
In Algorithm 2, we set ν δ = 0.95 and the parameter δ was initialized as δ 0 = 1 × 10 −3 and stopped at value δ tol = 4.35 × 10 −6 , and a total of 100 epochs were performed. To train the conventional method, we set 2000 epochs with the same number of phases, convolutions, and kernel sizes as used to train the meta-learning approach. The initial λ was set as 1 × 10 −5 and ν λ = 1.001.
We evaluated our reconstruction results on the testing data sets using three metrics: peak signal-to-noise ratio (PSNR) [67], structural similarity (SSIM) [68], and normalized mean squared error (NMSE) [69]. The following formulations compute the PSNR, SSIM, and NMSE between the reconstructed image x and ground truth x * . PSNR can be induced by the mean square error (MSE) where PSNR(x, x * ) = 20 log 10 max(|x * |) where N is the total number of pixels of the ground truth and MSE is defined by where µ x , µ x * represent local means, σ x , σ x * denote standard deviations, σ xx * represents the covariance between x and x * , C 1 = (k 1 L) 2 , C 2 = (k 2 L) 2 are two constants which avoid the zero denominator, and k 1 = 0.01, k 2 = 0.03. L is the largest pixel value of MR image.
where NMSE is used to measure the mean relative error. For detailed information of these three metrics mentioned above, please refer to [67][68][69].

Experimental Results with Different CS Ratios in Radial Mask
In this section, we evaluate the performance of well-trained meta-learning and conventional learning approaches. Tables 1, 2 and 5 report the quantitative results of averaged numerical performance with standard deviations and associated descaled task-specific meta-knowledge σ(ω i ). From the experiments implemented with radial masks, we observe that the average PSNR value of meta-learning improved by 1.54 dB in the T1 brain image for all four CS ratios compared with the conventional method, and for the T2 brain image, the average PSNR of meta-learning improved by 1.46 dB. Since the general setting of metalearning aims to take advantage of the information provided from each individual task, with each task associated with an individual sampling mask that may have complemented sampled points, the performance of the reconstruction from each task benefits from other tasks. Smaller CS ratios will inhibit the reconstruction accuracy, due to the sparse undersampled trajectory in raw measurement, while meta-learning exhibits a favorable potential ability to solve this issue even in the situation of insufficient amounts of training data.
In general supervised learning, training data need to be in the same or a similar distribution; heterogeneous data exhibit different structural variations of features, which hinder CNNs from extracting features efficiently. In our experiments, raw measurements sampled from different ratios of compressed sensing display different levels of incompleteness; these undersampled measurements do not fall in the same distribution but they are related. Different sampling masks are shown at the bottom of Figures 1 and 3, and these may have complemented sampled points, in the sense that some of the points which a 40% sampling ratio mask did not sample were captured by other masks. In our experiment, different sampling masks provided their own information from their sampled points, meaning that four reconstruction tasks helped each other to achieve an efficient performance. Therefore, this explains why meta-learning is still superior to conventional learning when the sampling ratio is large.
Meta-learning expands a new paradigm for supervised learning-the purpose is to quickly learn multiple tasks. Meta-learning only learns task-invariant parameters once for a common feature that can be shared with four different tasks, and each σ(ω i ) provides task-specific weighting parameters according to the principle of "learning to learn". In conventional learning, the network parameter needs to be trained four times with four different masks since the task-invariant parameter cannot be generalized to other tasks, which is time-intensive. From Tables 1 and 2, we observe that a small CS ratio needs a higher value of σ(ω i ). In fact, in our model (11), the task-specific parameters behave as weighted constraints for task-specific regularizers, and the tables indicate that lower CS ratios require larger weights to be applied for the regularization.
A qualitative comparison between conventional and meta-learning methods is shown in Figures 1 and 2, displaying the reconstructed MR images of the same slice for T1 and T2, respectively. We label the zoomed-in details of HGG in the red boxes. We observe evidence that conventional learning is more blurry and loses sharp edges, especially with lower CS ratios. From the point-wise error map, we find that meta-learning has the ability to reduce noises, especially in some detailed and complicated regions, compared to conventional learning.
We also tested the performance of meta-learning with two-thirds of the training and validation data for the T1-weighted image used in the previous experiment, denoted as "meta-learning". For conventional learning, the network was also trained by using twothirds of the training samples in the previous experiment. The testing dataset remained the same as before. These results are displayed in Table 1, where we denote the reduced data experiments as "meta-learning * " and "conventional * ". These experiments reveal that the accuracy of test data decreases when we reduce the training data size, but it is not a surprise that meta-learnining * still outperforms conventional learning * , and even conventional learning.
To verify the reconstruction performance of the proposed LOA 1, we compared the proposed conventional learning with ISTA-Net + [54], which is a state-of-the-art deep unfolded network for MRI reconstruction. We retrained ISTA-Net + with the same training dataset and testing dataset as conventional learning on the T1-weighted image. For a fair comparison, we used the same number of convolution kernels, the same dimension of kernels for each convolution during training, and the same phase numbers as conventional learning. The testing numerical results are listed in Table 1 and the MRI reconstructions are displayed in Figure 1. We can observe that the conventional learning which unrolls Algorithm 1 outperforms ISTA-Net + in any of the CS ratios. From the corresponding point-wise absolute error, the conventional learning attains a much lower error and much better reconstruction quality.

Experimental Results with Different Unseen CS Ratios in Different Sampling Patterns
In this section, we test the generalizability of the proposed model for unseen tasks. We fixed the well-trained task-invariant parameter θ and only trained ω i for sampling ratios of 15%, 25%, and 35% with radial masks and sampling ratios of 10%, 20%, 30%, and 40% with Cartesian masks. In this experiment, we only used 100 training data points for each CS ratio and applied a total of 50 epochs. The averaged evaluation values and standard deviations are listed in Tables 3 and 4 for reconstructed T1 and T2 brain images, respectively, with radial masks, and Table 5 shows the qualitative performance for the reconstructed T2 brain image with random Cartesian sampling masks applied. In the T1 image reconstruction results, meta-learning showed an improvement of 1.6921 dB in PSNR for the 15% CS ratio, 1.6608 dB for the 25% CS ratio, and 0.5764 dB for the 35% ratio compared to the conventional method, showing the tendency that the level of reconstruction quality for lower CS ratios improved more than higher CS ratios. A similar trend was found for T2 reconstruction results with different sampling masks. The qualitative comparisons are illustrated in Figures 3-5 for T1 and T2 images tested with unseen CS ratios in radial masks and T2 images tested with Cartesian masks with regular CS ratios, respectively. In the experiments conducted with radial masks, meta-learning was superior to conventional learning, especially at a CS ratio of 15%-one can observe that the detailed regions in red boxes maintained their edges and were closer to the true image, while the conventional method reconstructions are hazier and lost details in some complicated tissues. The pointwise error map also indicates that meta-learning has the ability to suppress noises.
Training with Cartesian masks is more difficult than radial masks, especially for conventional learning, where the network is not very deep since the network only applies three convolutions each with four kernels. Table 5 indicates that the average performance of Meta-learning improved about 1.87 dB compared to conventional methods with T2 brain images. These results further demonstrate that meta-learning has the benefit of parameter efficiency, and the performance is much better than conventional learning even if we apply a shallow network with a small amount of training data.
The numerical experimental results discussed above show that meta-learning is capable of fast adaption to new tasks and has more robust generalizability for a broad range of tasks with heterogeneous, diverse data. Meta-learning can be considered as an efficient technique for solving difficult tasks by leveraging the features extracted from easier tasks. Table 1. Quantitative evaluations of the reconstructions of T1 brain image associated with various sampling ratios of radial masks. Conventional * and meta-learning * are trained with two-thirds of the dataset used in training conventional and meta-learning approaches, respectively.  Table 3. Quantitative evaluations of the reconstructions of T1 brain image associated with various sampling ratios of radial masks. Meta-learning was trained with CS ratios of 10%, 20%, 30%, and 40% and tested with unseen ratios of 15%, 25%, and 35%. The conventional method was subjected to regular training and testing with the same CS ratios of 15%, 25%, and 35%.  Table 4. Quantitative evaluations of the reconstructions of T2 brain image associated with various sampling ratios of radial masks. Meta-learning was trained with CS ratios of 10%, 20%, 30%, and 40% and tested with unseen ratios of 15%, 25%, and 35%. Conventional method was subjected to regular training and testing with the same CS ratios of 15%, 25%, and 35%.    Figure 3. The pictures (from top to bottom) display the T1 brain image reconstruction results, zoomed-in details, point-wise errors with a color bar, and associated radial masks for meta-learning and conventional learning. Meta-learning was trained with CS ratios of 10%, 20%, 30%, and 40% and tested with three different unseen CS ratios of 15%, 25%, and 35% (from left to right). Conventional learning was trained and tested with the same CS ratios of 15%, 25%, and 35%. The top-right image is the ground truth fully-sampled image.  Figure 4. The pictures (from top to bottom) display the T2 brain image reconstruction results, zoomed-in details, point-wise errors with a color bar, and associated radial masks for meta-learning and conventional learning. Meta-learning was trained with CS ratios of 10%, 20%, 30%, and 40% and test edwith three different unseen CS ratios of 15%, 25%, and 35% (from left to right). Conventional learning was trained and tested with the same CS ratios of 15%, 25%, and 35%. The top-right image is the ground truth fully-sampled image. Next, we empirically demonstrate the convergence of Algorithm 1 in Figure 6. This shows that the objective function value φ decreases and the PSNR value for testing data increases steadily as the number of phases increases, which indicates that the learned algorithm is indeed minimizing the learned function as we desired.

Future Work and Open Challenges
Deep optimization-based meta-learning techniques have shown great generalizability, but there are several open challenges that can be discussed and can potentially be addressed in future work. A major issue is the memorization problem, since the base learner needs to be optimized for a large number of phases and the training algorithm contains multiple gradient steps; furthermore, the computation is very expensive in terms of time and memory costs. In addition to reconstructing MRI through different trajectories, another potential application for medical imaging could be multi-modality reconstruction and synthesis. Capturing images of anatomy with multi-modality acquisitions enhances the diagnostic information and could be cast as a multi-task problem that could benefit from meta-learning.

Conclusions
In this paper, we put forward a novel deep model for MRI reconstructions via metalearning. The proposed method has the ability to solve multi-tasks synergistically, and the well-trained model could generalize well to new tasks. Our baseline network is constructed by unfolding an LOA, which inherits the convergence property, improves the interpretability, and promotes the parameter efficiency of the designed network structure. The designated adaptive regularizer consists of a task-invariant learner and task-specific meta-knowledge. Network training follows a bilevel optimization algorithm that minimizes task-specific parameter ω in the upper level for the validation data and minimizes task-invariant parameters θ in the lower level for the training data with fixed ω. The proposed approach is the first model designed to solve the inverse problem by applying meta-training on the adaptive regularization in the variational model. We consider the recovery of undersampled raw data across different sampling trajectories with various sampling patterns as different tasks. Extensive numerical experiments on various MRI datasets demonstrate that the proposed method generalizes well at various sampling trajectories and is capable of fast adaption to unseen trajectories and sampling patterns. The reconstructed images achieve higher quality compared to conventional supervised learning for both seen and unseen k-space trajectory cases. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Convergence Analysis
We make the following assumptions regarding f and g throughout this work: First, we state the Clark subdifferential [13] of r(x) in Lemma A1, and we show that the gradient of r ε is Lipschitz continuous in Lemma A2. (4); then, the Clarke subdifferential of r at x is

Lemma A1. Let r(x) be defined in
where I 0 = {j ∈ [m] | g j (x) = 0}, I 1 = [m] \ I 0 , and Π(w; C(A)) is the projection of w onto C(A) which stands for the column space of A.
Lemma A2. The gradient of r ε is Lipschitz continuous with constant m(L g + 2M 2 ε ).

1.2.
Otherwise, we compute v t+1 = x t − α t ∇φ ε (x t ), where α t is found through the line search until the criteria holds, and then put x t+1 = v t+1 . From Lemma A2, we know that the gradient ∇r ε (x) is Lipschitz continuous with constant m(L g + 2M 2 ε ). Furthermore, we assumed in (a1) that ∇ f is L f -Lipschitz continuous. Hence, putting L ε = L f + m(L g + 2M 2 ε ), we find that ∇φ ε is L ε -Lipschitz continuous, which implies Furthermore, by the optimality condition of v t+1 = arg min Combining (A8) and (A9) and v t+1 = x t − α t ∇φ ε (x t ) in line 8 of Algorithm 1 yields Therefore, it is sufficient for α t ≤ 1 2/a+L ε for the criteria (A7) to be satisfied. This process only take finitely many iterations since we can find a finite t such that ρ t α t ≤ 1 2/a+L ε , and through the line search, we can obtain φ ε (v t+1 ) ≤ φ ε (x t ). Therefore, in either case of 1.1. or 1.2. where we take x t+1 = u t+1 or v t+1 , we can obtain φ ε (x t+1 ) ≤ φ ε (x t ), for all t ≥ 0.
Theorem A5. Suppose that {x t } is the sequence generated by Algorithm 1 with any initial x 0 , tol = 0 and T = ∞. Let {x t l +1 } be the subsequence that satisfies the reduction criterion in step 15 of Algorithm 1, i.e., ∇φ ε t l (x t l +1 ) ≤ σε t l γ for t = t l and l = 1, 2, · · · . Then {x t l +1 } has at least one accumulation point, and every accumulation point of {x t l +1 } is a clarke stationary point of min x φ(x) := f (x) + σ(ω i )r(x).
Proof. By Lemma A4 and φ(x) ≤ φ ε (x) + ε for all ε > 0 and x ∈ X , we know that Since φ is coercive, we know that {x t } is bounded, and the selected subsequence {x t l +1 } is also bounded and has at least one accumulation point. Note that ∇φ ε t l (x t l +1 ) ≤ σε t l γ = σε 0 γ l+1 → 0 as l → ∞. Let {x p+1 } be any convergent subsequence of {x t l +1 } and denote ε p as the corresponding ε t used in Algorithm 1 that generates x p+1 . Then, there exists x * ∈ X such that x p+1 → x * as ε p → 0, and ∇φ ε p (x p+1 ) → 0 as p → ∞.
If j ∈ I 0 , we have g j (x) = 0 ⇐⇒ g j (x) = 0: then, Therefore, we obtain Comparing (A24) and (A26), we can see that the first term on the right-hand side of (A26) converges to that of (A24), due to the fact that x p+1 →x and the continuity of ∇ f . Together with the continuity of g i and ∇g i , the last term of (A24) converges to the last term of (A26) as ε p → 0 and g j (x) > 0. Furthermore, apparently 0 is a special case of the second term in (A26). Hence, we know that dist(∇φ ε p (x p+1 ), ∂φ(x)) → 0, as p → ∞. Since ∇φ ε p (x p+1 ) → 0 and ∂φ(x) is closed, we conclude that 0 ∈ ∂φ(x).