Benchmarking MRI Reconstruction Neural Networks on Large Public Datasets

Deep learning is starting to oﬀer promising results for reconstruction in MRI. A lot of networks are being developed, but the comparisons remain hard because the frameworks used are not the same among studies, the networks are not properly re-trained and the datasets used are not the same among comparisons. The recent release of a public dataset, fastMRI, consisting of raw k-space data, encouraged us to write a consistent benchmark of several deep neural networks for MR image reconstruction. This paper shows the results obtained for this benchmark allowing to compare the networks and links the open source implementation of all these networks in Keras. The main ﬁnding of this benchmark is that it is beneﬁcial to perform more iterations between the image and the measurement spaces compared to having a deeper per-space network.


Introduction
Magnetic Resonance Imaging (MRI) is an imaging modality used to probe the soft tissues of the human body. As it is non-invasive and non-ionizing (contrary to X-Rays for example), its popularity has grown over the years, for example tripling between 1997 and 2006 according to [2]. This is attributed in part to the technical improvements of this technique. We can for example mention higher field magnets (3 Teslas instead of 1.5), parallel imaging [3] or compressed sensing MRI [4] (CS-MRI). These improvements allow for better image quality and lower acquisition duration.
There is however, still room for improvement. Indeed, an MRI scan may last up to 90 minutes according to the NHS website [5], making it unpractical for some people, because you need to lay still for this long period. Typically, babies or people suffering from Parkinson's disease or claustrophobia could not stay that long in a scanner without undergoing general anesthesia which is a heavy process, making the overall exam less accessible. To extend the accessibility to more people, we should therefore either increase the robustness to motion artifacts, or reduce the acquisition time with the same image quality. On top of that, we should also reduce the reconstruction time with the same image quality to increase the MRI scanners throughput and the total exam time. Indeed, the reconstructed image might show some motion artifacts, and the whole acquisition would need to be re-done [6]. Some other times, based on the first images seen by the physician, they may decide to prescribe complementary pulse sequences if necessary to clarify the image-based diagnosis.
When working in the framework of CS-MRI, the classical methods generally involve solving a convex non-smooth optimisation problem. This problem often involves a data-fitting term and a regularisation term reflecting our prior on the data. The need for regularisation comes from the fact that the problem is ill-posed since the sampling in the Fourier space, called k-space, is under the Nyquist-Shannon limit. However, these classical reconstruction methods exhibit two shortcomings.
• They are usually iterative involving the computation of transforms on large data, they therefore take a lot of time (2 min for a 512 × 512 -500 μm in plane resolution -slice [1], on a machine with 8 cores).
• The regularisation is usually not perfectly suited to MRI data (it is indeed very difficult to come up with a prior that perfectly reflects MR images). This is where learning comes in to play, and in particular deep learning. The promise is that it will solve both the aforementioned problems.
• Because they are implemented efficiently on GPU, and do not use an iterative algorithm, the deep learning algorithms run extremely fast.
• If they have enough capacity, they can learn a better prior of the MR images from the training set.
One of the first neural networks to gain attention for its use in MRI reconstruction was AUTOMAP [7]. This network did not exploit a problem specific property except the fact that the outcome was supposed to be an image. Some more recent works [8,9,10] have tried to inspire themselves from existing classical methods in order to leverage problem specific properties, but also expertise gained in the field. However, they have not been compared against each other on a large dataset containing complex-valued raw data.
A recently published dataset, fastMRI [11], allows this comparison, although it is still to be done and requires an implementation of the different networks in the same framework to allow for a fairer comparison in terms for example of runtime.
Our contribution is exactly this, that is: • Benchmark different neural networks for MRI reconstruction on 2 datasets: the fastMRI dataset, containing raw complex-valued knee data, and the OASIS dataset [12] containing DICOM real-valued brain data.
While our work focuses on classical MRI modalities reconstruction, it is to note that other works have applied deep learning to other modalities like MR fingerprinting [15] or diffusion MRI [16]. The networks studied here could be applied, but would not benefit from some invariants of the problem, especially in the fourth (contrast-related) dimension introduced.

Related works
In this section we briefly discuss other works presenting benchmarks on many different reconstruction neural networks.
In [17], they benchmark their (adversarial training based) algorithms against classical methods and against Cascade-net (which they call Deep Cascade) [10] and ADMM-net (which they call DeepADMM) [18]. They train and evaluate quantitatively the networks on 2 datasets, selecting each time 100 images for train, 100 images for test: • the IXI database 2 (brains), • the Data Science Bowl challenge 3 (chests).
While both these datasets provide a sufficient number of samples to have a trustworthy estimate of the performance of the networks, they are not composed of raw complex-valued data, but of DICOM real-valued data. Still in [17], they do evaluate their algorithms on a raw complex-valued dataset 4 , but it only features 20 acquisitions, and therefore the comparison is only done qualitatively.
In [9], they benchmark their algorithm against classical methods. They train and evaluate their network on 3 different datasets: • the brain real-valued data set provided by the Alzheimer's Disease Neuroimaging Initiative (ADNI) [19], • 2 proprietary datasets with raw complex-valued data of brain data.
Again, the only public dataset they use features real-valued data. It is also to be noted that their code cannot be found online.

Models
In this section, we will first introduce what we call the classical models to do reconstruction in CS-MRI. The models we chose to discuss are in no way an exhaustive list of all the models that can be used without learning for reconstruction in MRI (think of LORAKS [20] for example just to name this one), but they allow us to justify how the subsequent neural networks are built. These models are introduced in a second time.

Idealized inverse problem
In anatomical MRI, the image is encoded as its Fourier transform and the data acquisition is segmented in time in multiple shots or trajectory. This does not take possible gradient errors or B0-field inhomogeneities into account. Because each Fourier coefficient trajectory takes time to acquire, the time separating two consecutive shots, namely the TR or time of repetition, being potentially pretty long, the idea of CS-MRI is to acquire less of them. We therefore have the following idealized inverse problem in the case of single-coil CS-MRI: where y is the acquired Fourier coefficients, also called the k-space data, Ω is the sub-sampling pattern or mask, F Ω is the non-uniform Fourier transform (or masked Fourier transform in the case of Cartesian under-sampling), and x is the real anatomical image. Here we will only deal with Cartesian under-sampling, and there we have F Ω = M Ω F , where M Ω is a mask, and F is the classical Fourier transform. This model is also valid for 3D (volumewise) imaging, but in the following we only consider 2D (slicewise) imaging.

Classic models
The first (although unsatisfactory model) that can be used to perform the reconstruction of an MR image with an under-sampled k-space, is to simply use the inverse Fourier transform with the unknown Fourier coefficients replaced by zeros (zero-filled inverse Fourier transform). This method is called zero-filled reconstruction and we have: The second model we want to introduce makes use of the fact that MR images can be represented in a wavelet basis with only a few non-zero coefficients [4] according to the sparsity principle. The reconstruction is therefore done by solving the following optimization problem: where the notations are the same as in (1), and λ is a hyper-parameter to be tuned, and Ψ is a chosen wavelet transform. This problem can be solved iteratively using a primal-dual optimisation algorithm like Condat-Vù [21] or Primal Dual Hybrid Gradient (PDHG) [22] (also known as Chambolle-Pock algorithm) or, if the wavelet transform is invertible (i.e. non-redundant or decimated), using a proximal algorithm like Fast Iterative Shrinkage Algorithm (FISTA) [23] or Proximal Optimal Gradient Method (POGM) [24]. Since the problem is convex, all these algorithms converge to the same solution, only at different speeds.
The last model we choose to introduce is the dictionary learning model [25,26]. Its assumption is that an MR image is only composed of a few patches, and can therefore be expressed sparsely in a corresponding dictionary. This dictionary can be learned per-image, leading to the following optimisation problem: where the notations are the same as in (3) and I is the fixed set of patches locations, D is the dictionary, λ and T 0 are hyper-parameters to be set, and R ij is the linear operator extracting the patch at location (i, j). This problem is solved in 2 steps: 1. The dictionary learning step, where both the dictionary D and the sparse codes α ij are updated alternatively.
2. The reconstruction step, where x is updated. Since this subproblem is quadratic, it admits an analytical solution, which amounts to averaging patches and then perform a data consistency in which the sampled frequencies are replaced in the patch-average result.

Neural networks
The neural networks introduced here are all derived in a certain way from the classical models introduced before.

Single-domain networks
What we term single-domain networks, are network which only act either in the k-space or in the image (direct) space. They make use of the fact that we have a pseudo-inverse like in (2). They usually use a U-net-like [27] architecture. This network was originally built for image segmentation, but has since been used for a wide-variety of image-to-image tasks, mainly as a strong baseline. In [28], they used a U-net to apply on the under-sampled k-space measurements before performing the inverse Fourier transform. In [29], they used a U-net to apply on the zero-filled reconstruction and correct the output of the U-net with a data consistency step (where they replace sampled values in the k-space). The network we implemented was however vanilla, without this extra data-consistency Figure 1: Illustration of the U-net from [27]. In our case the output is not a segmentation map but a reconstructed image of the same size (we perform zero-padding to prevent decreasing sizes in convolutions).
step. Our implementation however only features the following cascade of number of filters: 16, 32, 64, 128. The original U-net is illustrated in Figure 1 where the number of filters used in each layer is 4 times what we used.

Cross-domains networks
The second class of networks we introduce, we term cross-domain networks. The key intuitive idea is that they correct the data in both the k-space and the image space alternatively, using the Fourier transform to go from one space to the other. They are derived from the optimisation algorithms used to solve the optimisation problems introduced before, using the idea of "unrolling" introduced in [30]. An illustration of this class of networks is presented in Figure 2.
Because these networks work directly on the input data (and not on a primarily reconstructed version of it), they need to handle complex-valued data. In particular, the classical deep learning frameworks (TensorFlow and Pytorch) do not feature the ability to perform complex convolutions off-the-shelf. The way convolution is performed in the original papers is therefore to concatenate the real and imaginary of the image (respectively the k-space), making it a 2-channel image, perform the series of convolutions, and have the output be a 2-channel image then transformed back in a complex image (respectively k-space).
The Cascade-net [10] is based on the dictionary learning optimisation prob- Figure 2: The common backbone between the Cascade net, the KIKI-net and the PD-net. US mask stands for under-sampling mask. DC stands for data consistency. (I)FFT stands for (Inverse) Fast Fourier Transform. N k,d is the number of convolution layers applied in the k-space. N i,d is the number of convolution layers applied in the image space. N C is the total number of alternations between the k-space and the image-space. It is to be noted that in the case of PD-net, the data consistency step is not performed, the Fourier operators are performed with the original under-sampling mask, and a buffer is concatenated along with the current iteration to allow for some memory between iterations and learn the acceleration (in the k-space net -dual net-it is also concatenated with original k-space input). In the case of the Cascade net, N k,d = 0, only the data consistency is performed in the k-space. In the case of the KIKI-net, there is no residual connection in the k-space. However, the k-space nets and image space nets could potentially be any kind of image-to-image neural network. Figure 3: Illustration of the Cascade-net from [10]. Here each C i is a convolutional block of 64 filters (48 in our implementation) followed by a ReLU non-linearity, n d is the number of such convolutional blocks forming a convolutional subnetwork between each data consistency layer DC, and n c is the number of convolutional subnetworks. Figure 4: Illustration of the KIKI-net from [9]. The KCNN and ICNN are convolutional neural networks composed of a number of convolutional blocks varying between 5 and 25 (we implemented 25 blocks for both KCNN and ICNN), each followed by a ReLU non-linearity and featuring between 8 and 64 filters (we implemented 32 filters). For both the varying numbers, the supplementary material of [9] shows that the higher the better. The ICNN also features a residual connection.
lem (4). The idea is to replace the dictionary learning step by convolutional neural networks and still keep the data consistency step in the k-space. The optimisation algorithm is then unrolled to allow to perform back-propagation. [10] shows that we can perform back-propagation through the data consistency step (which is linear), and derive the corresponding Jacobian. The parameters used here for the implementation are the same as those in the original paper, except the number of filters which was decreased from 64 to 48 to fit on a single GPU. This network is illustrated in Figure 3.
The KIKI-net [9] is an extension of the Cascade-net where they additionally perform convolutions after the data consistency step in the k-space. The parameters used here for the implementation are the same as those in the original paper. This network is illustrated in Figure 4.
The Primal-Dual-net (PD-net) was introduced by [8] and applied to Figure 5: Illustration of the PD-net from [8]. Here T denotes the measurement operator, which in our case is the under-sampled Fourier transform, T * its adjoint, g is the measurements, which in our case are the undersampled kspace measurements, and f 0 and h 0 are the initial guesses for the direct and measurement spaces (the image and k-space in our case). The initial guesses are zero tensors. Because we transform complex-valued data into 2-channel real-valued data, the number of channels at the input and the output of the convolutional subnetworks are multiplied by 2 in our implementation.
MRI by [31], is based on the wavelet-based denoising (3), and in particular the resolution of the corresponding optimisation problem with the PDHG [22] algorithm. Here the algorithm is unrolled and the proximity operators (present in the general case of PDHG) are replaced by convolutional neural networks. For our implementation, for a fairer comparison with Cascade-net and the Unet, we used a ReLU non-linearity instead of a PReLU [32]. This network is illustrated in Figure 5.

Training
The training was done with the same parameters for all the networks. The optimiser used was Adam [33], with a learning rate of 10 −3 and default parameters of Keras (β 1 = 0.9, β 2 = 0.999, the exponential decay rates for the moment estimates). The gradient norm was clipped to 1 to avoid the exploding gradient problems [34]. The batch size was 1 (i.e. one slice) for every network except the U-Net where the whole volume was used for each step. For all networks, to maximise the efficiency of the training, the slices were selected in the 8 innermost slices of the volumes, because the outer slices do not have much signal. No early stopping or learning rate schedule was used (except for KIKI-net to allow for a stable training where we used the learning rate schedule proposed by the authors in the supporting information of [9]). The number of epochs used was 300 for all networks trained end-to-end. For the iterative training of the KIKI-net, the total number of epochs was 200 (50 per sub-training). Batch normalization was not used, however, in order to have the network learn more efficiently, a scaling of the input data was done. Both the k-space and the image were multiplied by 10 6 for fastMRI and by 10 2 for OASIS, because the k-space measurements had values of mean 10 −7 (looking separately at the real and imaginary parts) for fastMRI and of mean 10 −3 for OASIS. Without this scaling operation, the training proved to be impossible with bias in the convolutions and very inefficient without bias in the convolutions.

Under-sampling
The under-sampling was done retrospectively using a Cartesian mask described in the data set paper [11], and an acceleration factor of 4 (i.e. only 25% of the k-space was kept). It contains a fully-sampled region in the lower frequencies, and randomly selects phase encoding lines in the higher frequencies.
It is to be noted that different under-sampling strategies exist in CS-MRI. Some of them are listed in [35], like for example spiral or radial. These strategies allow for a higher image quality while having the same acceleration factor or the same image quality with a higher acceleration factor. Typically, the spiral under-sampling scheme was designed to allow fast coronary imaging [36,37]. These under-sampling strategies must take into account kinematic constraints (both physically and safety based), but also should also be with variable density [35]. Recent works even try to optimize the under-sampling strategy under these kinematic constraints [38]. Others have tried to learn the under-sampling strategy in a supervised way. In [39], the under-sampling strategy is learned with a greedy optimisation. In [40], a gradient descent optimisation is used. Some approaches ( [16,41,42] even try to learn jointly the optimal under-sampling strategy along with the reconstruction.

fastMRI
The data used for this benchmark is the emulated single-coil k-space data of the fastMRI dataset [11], along with the corresponding ground truth images. The acquisition was done with a 15-channel phased array coil, in Cartesian 2D Turbin Spin Echo (TSE). The pulse sequences were proton-density weighting, half with fat suppression, half without, some at 3.0 Teslas (T) others at 1.5 T. The sequence parameters were as follows: Echo train length 4, matrix size 320× 320, in-plane resolution 0.5mm×0.5mm, slice thickness 3mm, no gap between slices. In total, there are 973 volumes (34, 742 slices) for the training subset and 199 volumes (7135 slices) for the validation subset.
Since the k-spaces are of different sizes, therefore resulting in images of different sizes, the outputs of the cross-domain networks were cropped to a central 320 × 320 region. For the U-net, the input of the network was cropped.

OASIS
The Open Access Series of Imaging Studies (OASIS) brain database [12] is a database including MRI scans of 1068 participants, yielding 2168 MR sessions. Of these 2168, we select only 2164 sessions which feature T1-weighted sequences. 1878 of these were acquired on a 3.0 T, 236 at 1.5 T and the remaining are undisclosed (50). The slice size is majorly 256 × 256, and sometimes 240 × 256 (rarely it can be some other sizes).The number of slices per scan is majorly 176, and sometimes 160 (rarely it can be smaller).
The data was then separated in a training and a validation set. The split was participant-based, that is a participant cannot have a scan in both sets. The split was of 90% for the training set and 10% for the validation set. We further reduced the training data to make it comparable to fastMRI, to 1000 scans randomly selected for the training subset and 200 scans randomly selected for the validation subset.
Contrarily to fastMRI, the OASIS data is available only in magnitude and therefore is only real-valued.The k-space is computed as the inverse Fourier transform of the magnitude image.

Metrics
The metrics we used to benchmark the different networks are the following: • The Peak Signal-to-Noise Ratio (PSNR) • The Structural SIMilarity index (SSIM) [43] • The number of trainable parameters in the network • The runtime in seconds of the neural network on a single volume The PSNR is computed as follows, on whole magnitude volumes: where x is the ground truth volume,x is the predicted volume (magnitude image), and n is the total number of points in the ground truth volume (same as the predicted volume). Since this metric compares very local differences it does not necessarily reflect the global visual comparison of the images. The SSIM was introduced in [43] exactly to take more structural differences or similarities between images into account. It is computed as in the original paper, per slice then averaged over the volume (the range however is computed volume-wise): where x is the ground truth slice,x is the predicted slice, µ i is the mean of i, σ 2 i is the variance of i, cov ij is the covariance between i and j, c 1 = (k 1 L) 2 , c 2 = (k 2 L) 2 , c 3 = c2 2 , L is the range of the values of the data (given because computed over the whole volume), and k 1 = 0.01 and k 2 = 0.03.
While the 2 aforementioned metrics control the reconstruction quality, it is important to note that this is not the only factor to take into account when designing reconstruction techniques. Because the reconstruction has to happen fast enough for the MR physician to decide whether to re-conduct the exam or not, it is important for the proposed technique to have a reasonable reconstruction speed. For real-time MRI applications or dynamic MRI (e.g. cardiac imaging), it is even more important (for example in the context of monitoring surgical operations [44]). The runtimes were measured on a computer equipped with a single GPU Quadro P5000 with 16GB of RAM.
Concurrently, the number of parameters has to stay relatively low to allow the implementation on the different machines with potentially limited memory, which will probably need to have multiple models (for different contrasts, different organs or different undersampling schemes including different acceleration factors). The quantitative results in Tables 1-4 show that the PD-net [8] outperforms its competitors in terms of image quality metrics but also has the least amount of trainable parameters. It it slightly slower than the Cascade-net [10] though   which can be explained by its higher number of iterations, involving therefore more costly Fourier transform (inverse or direct) operations. These results hold true on the 2 data sets, fastMRI [11] and OASIS [12]. The only exception is that KIKI-net [9] is slightly better than the U-net [27] on the OASIS data set, but still far from the best performers. We can also note that the standard deviation of the image quality metrics is way higher in the fastMRI data set than in the OASIS data set. This higher standard deviation is explained by the fact that the 2 contrasts present in the fastMRI dataset, Proton Density with and without Fat Suprression (PD/PDFS), have widely different image metrics values. The standard deviations when we compute the metrics for each contrast separately are more in-line with the OASIS ones. The range of the image quality metrics is also much higher in the OASIS results.

Qualitative results
The qualitative results shown in Figures 6 -7 confirm the quantitative ones on the image quality aspect. The PD-net [8] is much better at conserving the high-frequency parts of the original image, as can be seen when looking at the reconstruction error, which is quite flat over the whole image.

Discussion
In this work we only considered one scheme of under-sampling. However, it should be interesting to see if the performance obtained on one type of undersampling generalizes to other types of under-sampling, especially if we do a Reference Zero-filled KIKI-net U-net Cascade-net PD-net Figure 6: Reconstruction results for a specific slice (16th slice of file1000196, part of the validation set). The first row represents the reconstruction using the different methods, while the second represents the absolute error when compared to the reference.
Reference Zero-filled KIKI-net U-net Cascade-net PD-net Reconstruction results for a specific slice (15th slice of sub-OAS30367 ses-d3396 T1w.nii.gz, part of the validation set). The top row represents the reconstruction using the different methods, while the bottom row represents the absolute error when compared to the reference. re-gridding step for non-Cartesian under-sampling schemes. On that specific point, the extension of the networks towards non-Cartesian sampling schemes is not easy because the data consistency cannot be performed in the same way, and the measurement space is no longer similar to an image (except if we re-grid). In a recent work [45], some of the authors of the Cascade-net [10] propose a way to extend their approach to the non-Cartesian case, using a re-gridding step. The PD-net [8] also has a straightforward implementation for the non-Cartesian case even without re-gridding, in what is called the learned Primal. In this case, the network in the k-space is just computing the difference (residual) between the current k-space measurements and the initial k-space measurements. Therefore there are no parameters to learn which alleviates the problem of how to learn them.
We also only considered a single-coil acquisition setting. As parallel imaging is primarily used in CS-MRI to allow higher image quality [3], it is important to see how these networks will behave in the multi-coil setting. The difficult part in the extension of these works to the multi-coil setting will be to understand how to best involve the sensitivity maps (or even not involve them [46]).
Regarding the networks themselves, the results seem to suggest that for cross-domain networks, the trade-off between a high number of iterations and a richer correction in a certain domain (by having deeper networks) is in favor of having more iterations (i.e. alternating more between domains). It is however unclear how to best tackle the reconstruction in the k-space, since the convolutional networks make an shift invariance hypothesis which is not true in the Fourier space where the coefficients corresponding to the high frequencies should probably not be treated in the same way as with the low frequencies. This leaves room for improvement in the near future.
Finally, this work has not dealt with recent approaches involving adversarial training for MRI reconstruction networks [17,47]. It would be very interesting to see how the adversarial training could improve each of the proposed networks.