Next Article in Journal
Strain Rate Effect on Artificially Cemented Clay with Fully Developed and Developing Structure
Previous Article in Journal
Study of Basalt Fibers and Graphene Enriched Polymers on Bond Behavior of FRP Bars in Concrete
Previous Article in Special Issue
GPU-Accelerated Fock Matrix Computation with Efficient Reduction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture

1
IADI (U1254), Inserm and Université de Lorraine, F-54000 Nancy, France
2
CEA, Neurospin, Paris-Saclay University and CNRS, F-91190 Gif sur Yvette, France
3
CIC-IT 1433, Inserm, Université de Lorraine, and CHRU Nancy, F-54000 Nancy, France
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(11), 5840; https://doi.org/10.3390/app15115840
Submission received: 10 April 2025 / Revised: 5 May 2025 / Accepted: 20 May 2025 / Published: 22 May 2025
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

Abstract

Motion correction in magnetic resonance imaging (MRI) has become increasingly complex due to the high computational demands of iterative reconstruction algorithms and the heterogeneity of emerging computing platforms. However, the clinical applicability of these methods requires fast processing to ensure rapid and accurate diagnostics. Graphics processing units (GPUs) have demonstrated substantial performance gains in various reconstruction tasks. In this work, we present a GPU implementation of the reconstruction kernel for the generalized reconstruction by inversion of coupled systems (GRICS), an iterative joint optimization approach that enables 3D high-resolution image reconstruction with motion correction. Three implementations were compared: (i) a C++ CPU version, (ii) a Matlab–GPU version (with minimal code modifications allowing data storage in GPU memory), and (iii) a native GPU version using CUDA. Six distinct datasets, including various motion types, were tested. The results showed that the Matlab–GPU approach achieved speedups ranging from 1.2× to 2.0× compared to the CPU implementation, whereas the native CUDA version attained speedups of 9.7× to 13.9×. Across all datasets, the normalized root mean square error (NRMSE) remained on the order of 10 6 to 10 4 , indicating that the CUDA-accelerated method preserved image quality. Furthermore, a roofline analysis was conducted to quantify the kernel’s performance on one of the evaluated datasets. The kernel achieved 250 GFLOP/s, representing a 15.6× improvement over the performance of the Matlab–GPU version. These results confirm that GPU-based implementations of GRICS can drastically reduce reconstruction times while maintaining diagnostic fidelity, paving the way for more efficient clinical motion-compensated MRI workflows.

1. Introduction

The recent advances in magnetic resonance imaging (MRI) programs have led to multiple implementation challenges. The computational demands of MRI scales with advancements in imaging techniques, especially with the transition to real-time imaging [1] and model-based reconstructions [2]. Modern MRI systems face challenges including the need for higher spatial and temporal resolution, the reduction in acquisition time, and the integration of advanced techniques such as dynamic imaging [3], compressed sensing [4], and parallel imaging [5]. For instance, the motion-compensated MRI reconstruction problem relies on accurate modeling of rigid and non-rigid body motion, which implies an exact modeling of the deformation and a local tracking of the displacement fields within each voxel [6]. Hence, thousands of variables are required. Thus, it is necessary to assess the data-level parallelism in MRI reconstruction methods and employ high-performance computing (HPC) platforms to reduce the reconstruction time to an acceptable level for clinical applicability. Graphics processing units (GPUs) are suited for iterative, high-throughput operations, critical to MRI workflows, such as Fourier transforms, image interpolation, and gridding [7,8,9]. Unlike traditional central processing units (CPUs), GPUs are designed for data-level parallelism, enabling the processing of thousands of threads concurrently. Moreover, GPUs facilitate memory-intensive processes such as multi-frame parallel reconstruction, leveraging their superior memory bandwidth and multi-core architecture [10]. In parallel with fast GPU-driven reconstruction, a new trend is emerging toward multi-sensory pipelines that fuse imaging with real-time safety monitoring. Smart electronic-device platforms now enable continuous streaming of tissue temperature and local specific absorption rate (SAR) data, which can be co-processed on the same GPU hardware to guide adaptive control of power deposition and reconstruction parameters [11]. Integrating such wearable sensor feedback underscores the broader move toward unified, GPU-centric biomedical systems where imaging, physiology, and patient safety are handled in a single low-latency loop. Compute unified device architecture (CUDA) [12] has gained particular attention in MRI GPU-transition for its user-friendly programming model, its accessibility, and its high-level libraries. In this work, we extend our research reported in [13] by introducing the CUDA implementation of the generalized reconstruction by inversion of coupled systems (GRICS) method’s reconstruction kernel [14], an iterative reconstruction framework that allows for the obtention of 3D high-resolution images, by alternatively optimizing the reconstructed image and the parameters of the motion model. A key contribution of our study lies in the design of the CUDA implementation of the reconstruction kernel, particularly, the optimal design of sparse matrix–vector/element-wise matrix multiplication to efficiently use warp-level parallelism, and the on-the-fly computation of the motion operator. Rather than repeatedly transferring precomputed elements from the CPU to the GPU, the required non-zero entries of the operator are dynamically generated within the target kernels based on the underlying motion model. Despite a higher per-iteration computational cost, this design significantly reduces data-transfer overhead and memory usage, while ensuring that all problem variables remain in GPU memory throughout the iterative process. Moreover, advanced parallel-reduction strategies using atomic operations further guarantee safe and optimal accumulation of partial results in the shared memory. Furthermore, a performance analysis is used to compare the efficiency of three implementations: the CPU-C++, the Matlab-GPU [13], and the CUDA implementations. We evaluated the eventual discrepancies between the CPU and CUDA images with the normalized root mean square error (NRMSE). A second contribution is the roofline analysis of the reconstruction kernel applied to a specific dataset. The roofline model [14] was used to quantify the performance of the operators involved in the reconstruction step. This analysis provides a performance baseline for iterative reconstruction programs since they share the same performance bottlenecks as GRICS [6].

2. Theory

2.1. Joint Reconstruction of an Image and a Motion Model

Motion-compensated reconstruction was accomplished using the GRICS method [6,14]. The algorithm solves alternatively and iteratively for both a motion-corrected image ρ and the parameters of a motion model α . The software corrects for either rigid or non-rigid body motion. The problem is formulated as follows:
ρ , α = arg min ρ , α | | E ( u α ) ρ m | | 2 + λ | | ρ | | 2 + μ R ( α ) , u α ( x , y , z , t ) = [ x , y , z , 1 ] α ( t ) ( rigid model ) , or α ( x , y , z ) s ( t ) ( non-rigid model )
where m stands for the acquired MRI k-space data, u α is the displacement field at each k-space sample time, and E is the encoding operator (a complete description of the encoding operator E can be found in [13]). In the case of a rigid motion model, α ( t ) represents a rigid transformation matrix; it is expressed using homogeneous coordinates. In the case of a non-rigid motion model, a separable motion model is used, with a spatial component α ( x , y , z ) , and the temporal component s ( t ) is a motion signal, acquired from a motion sensor or navigator data. A Tikhonov regularization controlled by the hyper-parameter λ is employed. An additional regularization term is added in the non-rigid case, as R ( α ) = | | α | | 2 . In each iteration of the GRICS algorithm, two coupled least-squares problems are solved in an alternating scheme. The first problem addresses image reconstruction, where the objective is to recover the image ρ by minimizing the discrepancy between the measured k-space data and the forward model, as well as the regularization term that stabilizes the solution. Once the image is updated, the motion correction problem is tackled in the second least-squares step. Here, the residual from the reconstructed image is used to refine the motion parameters α (either under rigid or non-rigid motion assumptions). By linearizing the forward model around the current estimates, a Gauss–Newton method efficiently updates α to reduce motion-induced artifacts. The least-squares problems are solved using the Conjugate Gradient (CG) solver. This iterative interplay between image reconstruction and motion estimation is applied in a multi-resolution fashion: first, it is applied to the low-resolution (central) k-space data; then, it continues until the last resolution level, where only the reconstruction step is performed.

2.2. The Roofline Model

To characterize the performance of our GPU implementation, we employ the roofline model [15], which is a visual framework for understanding how kernel performance is limited by both the available memory bandwidth and the peak computational throughput. A key component of this model is the operational/arithmetic intensity, defined as
OI = FLOPs Bytes transferred ,
where FLOPs is the total number of FLoating-point Operations performed and bytes transferred is the total amount of data read or written from memory. The roofline model states that the maximum achievable performance P cannot exceed either the GPU’s peak floating-point performance P peak or the product of the operational intensity and the peak memory bandwidth B peak :
P min ( P peak , O I × B peak )
This formulation highlights whether a given kernel is compute-bound (limited by P peak ) or memory-bound (limited by B peak ). By plotting the measured performance against OI on the same axes as these theoretical “roofs”, it is possible to identify performance bottlenecks and optimization patterns; see Figure 1.

3. Methods

3.1. Data Acquisition

In this study, we investigated data acquired through two separate MRI examinations conducted on different imaging systems. For the first acquisition, raw MRI data were obtained using a 3T MRI scanner (MAGNETOM Prisma, Siemens Healthineers, Erlangen, Germany). Three volunteers were enrolled, and three datasets ( D 1 , D 2 , D 3 ) with non-rigid motion were acquired using a T1-weighted 3D fast low-angle shot GRadient Echo (GRE) sequence covering the breast region in a 90 s free-breathing scan. Additional details regarding this acquisition and its ethical approval are available in [13,16]. For the second acquisition, MRI data were acquired using a 7T MRI scanner (Magnetom 7T, Siemens Healthineers) equipped with an in-house built head RF coil having 8 transmit and 32 receive channels [17]. Here, 3 datasets ( D 4 , D 5 , D 6 ) containing subtle rigid motion from 3 volunteers were obtained with a 3D Magnetization Prepared Rapid Gradient Echo (MPRAGE) [18] sequence covering the head at a resolution of 1 × 1 × 1   mm 3 . To enhance motion sensitivity, the MPRAGE sequence was modified to randomize the k-space sampling order. This randomization evenly distributes motion-induced artifacts, effectively transforming them into less noticeable, noise-like distortions and making them more suitable to advanced reconstruction techniques [19]. Additionally, it improves temporal resolution by enabling more flexible sampling strategies that adapt to motion dynamics and capture more relevant temporal information. A navigator (the central k-space line, a line scan oriented along the inferior-to-superior IS direction) was acquired at the beginning of each RAGE shot and was processed in order to split the k-space data into N t motion states. All participants in these studies provided written informed consent. Volunteers were not instructed to move. A summary of the dataset parameters is depicted in Table 1.

3.2. The Reconstruction Kernel

The most computationally-intensive part of GRICS lies in the Conjugate Gradient (CG) linear solvers. In particular, the application of the operator E H E —where E H is the Hermitian transpose of the operator E—comprises an outer loop on motion states N t and an inner loop on coil receivers N c . Algorithm 1 introduces the structure of this kernel. The operator E H E reads
E H E = t = 1 N t T t H c = 1 N c C c H F H S t H S t F C c T t
A complete description of the operators in Equation (3) and of the C++ and Matlab–GPU implementations of Algorithm 1 can be found in [13].
Algorithm 1 Main part of the reconstruction kernel (function x E H E x ).
1: y = 0
2: for  t = 1 , 2 , , N t  do
3:       x t = T ( u t ) x
4:       y t = 0
5:      for coil c = 1 , 2 , , N c  do
6:           y t = y t + C c H ( F H S t H S t F ) C c x t
7:      end for
8:       y = y + T ( u t ) H y t
9: end for

3.3. Cuda Implementation

In order to implement Algorithm 1 in a GPU with CUDA, several kernels and mathematical operations were designed. First, the application of the Sensitivity Maps operator C c and its Hermitian transpose C c H required the implementation of an element-wise complex multiplication, where each thread processes one data point, performing concurrently the real and imaginary parts of the complex multiplication for all the elements. Another element-wise product kernel was designed to apply the sampling mask operator S t H S t . The kernels for the displacement operator T t and its Hermitian transpose T t H assign threads to individual voxels and compute displacement fields based on rigid or non-rigid motion parameters, warping the image using linear interpolation. In the C++ implementation, the non-zero elements of T t were precomputed before the CG solver was run and stored in CPU memory. In the Matlab–GPU implementation, due to the GPU memory limitation, the precomputed elements had to be transferred from the CPU to the GPU memory at the beginning of the outer loop (before line 3 of Algorithm 1). To avoid this repeated and costly transfer, in the native CUDA implementation, the non-zero elements of T t were computed dynamically within the kernel from the motion model definition given in Equation 1. This resulted in a higher computational complexity; however, all the required problem variables were stored in GPU memory before running the CG solver. The Fast Fourier Transform FFT algorithm was implemented using the GPU-optimized library cuFFT [20]. Moreover, and in order to avoid race conditions, atomicAdd was explicitly employed at multiple locations, mainly in the reduction process (line 8 in Algorithm 1), to safely accumulate partial results into the shared memory location. Finally, the CG solver was implemented using the cuBLAS library [21].

3.4. The Test Machines

The CPU-only implementation of Algorithm 1 was tested on an AMD EPYC 75F3 processor (AMD, Santa Clara, CA, USA). The C++ code benefits from a hybrid parallelization using Message Passing Interface (MPI) to distribute the N t motion states over n MPI ranks and Open Multi-Processing (OpenMP) to distribute the set of the N c coils over m software threads. The GPU implementations were executed on NVIDIA A100 GPU (NVIDIA Corporation, Santa Clara, CA, USA). More details about the two devices can be found in Table 2.

3.5. Performance Profiling

In order to evaluate the performance of our CUDA-based implementation, we employed the NVIDIA Nsight Compute profiling tool [23]. Nsight Compute is a performance analysis utility specifically designed for NVIDIA GPUs, providing kernel-level profiling with a minimal overhead, with the aim of identifying and optimizing performance bottlenecks. In this context, Nsight Compute was used to collect performance metrics such as the FLOPs and memory traffic for both the main memory and the cache hierarchy. The set of the collected metrics is presented in Table 3.
The operational intensities of each memory level are computed using the output of these metrics, in particular:
O I L 1 = FLOPs L 1 cache , O I L 2 = FLOPs L 2 cache , O I m H B M = FLOPs HBM .
Given that T is the elapsed time of the CUDA-based implementation, its achievable performance P is computed as
P = FLOPs T

4. Results

To compare the performance of the three implementations, we measured the elapsed time per conjugate gradient (CG) iteration across the six datasets listed in Table 1. Each reported time represents the average of five runs. The results are shown in Figure 2, where the primary y-axis (left) indicates the elapsed time (in seconds) per CG iteration, and the secondary y-axis (right) denotes the speedup relative to the CPU reference implementation. The CUDA-based implementation significantly reduced the elapsed time, achieving speedup factors of between 9.7 and 13.9 (green dashed line) compared to the CPU-only version. The Matlab–GPU implementation (blue dashed line) also improved speedup but was less pronounced than the CUDA solution. The NRMSE of the CPU and the CUDA images are depicted in Table 4. They remained within the range of 10 6 to 10 4 for all datasets. Since these values are lower than the reconstruction kernel’s tolerance of ϵ r = 10 3 , no detectable image quality degradation was observed. Figure 3 presents a roofline analysis of the E H E operator, highlighting both the achievable performance and operational intensity for each operator in Algorithm 1. We relied on the Empirical Roofline Toolkit to measure realistic bandwidth and peak FP32/FP64 performance values rather than theoretical specifications. By computing the motion operator elements on the fly, the performance of operator T t was significantly enhanced, reaching 105 GFLOP/s. The Fast Fourier Transform (FFT) steps, executed via the cuFFT library, achieved 310 GFLOP/s. Overall, the E H E function is bandwidth-bound, exhibiting a performance of approximately 250 GFLOP/s at a high-bandwidth memory (HBM) operational intensity of 2.94 FLOP/Byte. For comparison, ref. [13] reported a performance of 16 GFLOP/s for the Matlab GPU implementation of Algorithm 1. In contrast, our native GPU implementation delivered a 15.6× performance gain, narrowing the gap to the roofline limits that define the upper bounds of achievable performance. Figure 4 and Figure 5 illustrate the reconstructed images for each implementation. In both the brain MRI dataset ( D 4 ), subject to rigid-body motion, and the breast MRI dataset ( D 2 ), subject to non-rigid respiratory and cardiac motion, the three methods effectively suppressed motion-induced artifacts.

5. Discussion

Compared to the CPU-only implementation, the GPU-based solutions significantly decreased computation time and improved the performance of Algorithm 1. In particular, the CUDA solution was approximately 9–13 times faster than the hybrid MPI/OpenMP version, exploiting the GPU micro-architecture without compromising the quality of the reconstructed images. For all datasets, the normalized root-mean-square error remained within the specified tolerance of the reconstruction kernel. For many years, one of the primary bottlenecks in GPU-based computing was the limited amount of device memory, which often forced data to be shuttled back and forth between the host and the device to fit larger workloads. This frequent transfer not only added latency but also introduced complexity in managing data pipelines for HPC applications. Recently, however, a new generation of GPUs (such as A100 GPU) has emerged with significantly expanded memory capacities, enabling developers to hold entire datasets directly in GPU memory. By offering more robust storage space and reducing reliance on host-to-device transfers, these GPUs improve the overall performance of such applications [25,26].
The roofline analysis showed substantial performance improvements with the CUDA implementation. However, the actual performance remains noticeably below the theoretical upper bound for the corresponding operational intensity. Potential future optimizations include the following:
  • Data storage in half precision: Reducing the memory footprint by using FP16 could significantly enhance memory bandwidth utilization. An initial test was conducted with this constraint, and the reconstructed image quality was not compromised.
  • Improvement of cache usage: Achieved by optimizing memory access patterns to benefit from the large L2 cache on the A100 GPU.
  • The Utilization of tensor cores could be also investigated to accelerate matrix-based kernels.
The CUDA back-end is markedly cheaper to run. In our code, the A100 GPU performed each reconstruction in 5.3–18 s, whereas the 36-core EPYC 75F3 needed 53–180 s—a roughly 10-fold speedup. Even though a data-center A100 draws more instantaneous power than a single-socket EPYC (400 W versus 280 W, per vendor TDPs), the order-of-magnitude shorter runtime means the energy per study falls by approximately 80%. In a department that performs 2000 such motion-compensated scans a year, that difference translates to roughly 1 MWh saved—enough to pay the GPU’s electricity bill for the entire 5-year depreciation window.

6. Conclusions

In this work, we demonstrated the feasibility of a native GPU implementation of the GRICS method. By evaluating the CUDA-based implementation of the reconstruction kernel, which accounts for 70% to 80% of the total GRICS elapsed time, we achieved a speedup ranging from 9.7× to 13.7× with an elapsed time ranging between 5.3 s and 18 s across six different datasets, having different problem sizes and MRI acquisition types. The GPU acceleration did not compromise the reconstruction quality. The roofline analysis highlights an improvement in the overall performance with the CUDA implementation, with an achievable performance of 250 GFLOP/s. The results show the potential applicability of the method in clinical settings.

Author Contributions

Conceptualization, M.A.Z., P.-A.V., F.O., N.D., V.G. and F.M.; methodology, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; software, M.A.Z., K.I., P.-A.V. and F.O.; validation, M.A.Z., F.O., N.D., V.G. and F.M.; formal analysis, M.A.Z., P.-A.V., F.O., N.D., V.G. and F.M.; investigation, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; resources, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; data curation, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; writing—original draft preparation, M.A.Z.; writing—review and editing, K.I., P.-A.V., F.O. and V.G.; visualization, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; supervision, P.-A.V., F.O., V.G. and F.M.; project administration, P.-A.V., F.O., V.G. and F.M. All authors have read and agreed to the published version of the manuscript.

Funding

MOSAR project (ANR-21-CE19-0028), CPER IT2MP, FEDER (European Regional Development Fund).

Institutional Review Board Statement

The study was conducted in accordance with the ethical principles outlined in the Declaration of Helsinki (1975, revised 2013) and was approved by the ethics committee (approval number: [CPP EST-III, 08.10.01]) under the protocol “METHODO” (ClinicalTrials.gov Identifier: NCT02887053). All participants provided written informed consent prior to their inclusion in the study.

Informed Consent Statement

Written informed consent was obtained from the patients involved in this study.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

MRI    Magnetic Resonance Imaging
FP    Floating Point
GRICS    The Generalized Reconstruction by Inversion of Coupled Systems
GPU    Graphics Processing Unit
HPC    High-Performance Computing
CUDACompute Unified Device Architecture
MPRAGEMagnetization-Prepared Rapid Gradient Echo
OpenMPOpen Multi-Processing
MPIMessage Passing Interface

References

  1. Schaetz, S.; Voit, D.; Frahm, J.; Uecker, M. Accelerated Computing in Magnetic Resonance Imaging: Real-Time Imaging Using Nonlinear Inverse Reconstruction. Comput. Math. Methods Med. 2017, 2017, 3527269. [Google Scholar] [CrossRef]
  2. Fessler, J.A. Model-Based Image Reconstruction for MRI. IEEE Signal Process. Mag. 2010, 27, 81–89. [Google Scholar] [CrossRef] [PubMed]
  3. Gordon, Y.; Partovi, S.; Müller-Eschner, M.; Amarteifio, E.; Bäuerle, T.; Weber, M.A.; Kauczor, H.U.; Rengier, F. Dynamic Contrast-Enhanced Magnetic Resonance Imaging: Fundamentals and Application to the Evaluation of the Peripheral Perfusion. Cardiovasc. Diagn. Ther. 2014, 4, 147–164. [Google Scholar] [PubMed]
  4. Murphy, M.; Alley, M.; Demmel, J.; Keutzer, K.; Vasanawala, S.; Lustig, M. Fast 1-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime. IEEE Trans. Med. Imaging 2012, 31, 1250–1262. [Google Scholar] [CrossRef] [PubMed]
  5. Pruessmann, K.P.; Weiger, M.; Scheidegger, M.B.; Boesiger, P. SENSE: Sensitivity Encoding for Fast MRI. Magn. Reson. Med. 1999, 42, 952–962. [Google Scholar]
  6. Odille, F. Chapter 13—Motion-Corrected Reconstruction. In Advances in Magnetic Resonance Technology and Applications; Akçakaya, M., Doneva, M., Prieto, C., Eds.; Magnetic Resonance Image Reconstruction; Academic Press: Cambridge, MA, USA, 2022; Volume 7, pp. 355–389. [Google Scholar] [CrossRef]
  7. Wang, H.; Peng, H.; Chang, Y.; Liang, D. A survey of GPU-based acceleration techniques in MRI reconstructions. Quant. Imaging Med. Surg. 2018, 8, 196–208. [Google Scholar] [CrossRef] [PubMed]
  8. Stone, S.S.; Haldar, J.P.; Tsao, S.C.; Hwu, W.-m.W.; Sutton, B.P.; Liang, Z.P. Accelerating Advanced MRI Reconstructions on GPUs. J. Parallel Distrib. Comput. 2008, 68, 1307–1318. [Google Scholar] [CrossRef]
  9. Després, P.; Jia, X. A Review of GPU-Based Medical Image Reconstruction. Phys. Med. 2017, 42, 76–92. [Google Scholar] [CrossRef] [PubMed]
  10. Buck, I.; Hanrahan, P. Data Parallel Computation on Graphics Hardware; Technical Report; Stanford University: Stanford, CA, USA, 2003. [Google Scholar]
  11. Laganà, F.; Bibbò, L.; Calcagno, S.; Carlo, D.D.; Pullano, S.A.; Pratticò, D.; Angiulli, G. Smart Electronic Device-Based Monitoring of SAR and Temperature Variations in Indoor Human Tissue Interaction. Appl. Sci. 2025, 15, 2439. [Google Scholar] [CrossRef]
  12. Cook, S. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st ed.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2012. [Google Scholar]
  13. Zeroual, M.A.; Isaieva, K.; Vuissoz, P.A.; Odille, F. Performance Study of an MRI Motion-Compensated Reconstruction Program on Intel CPUs. Appl. Sci. 2024, 14, 9663. [Google Scholar] [CrossRef]
  14. Odille, F.; Vuissoz, P.A.; Marie, P.Y.; Felblinger, J. Generalized Reconstruction by Inversion of Coupled Systems (GRICS) Applied to Free-Breathing MRI. Magn. Reson. Med. 2008, 60, 355–389. [Google Scholar] [CrossRef] [PubMed]
  15. Williams, S.; Waterman, A.; Patterson, D. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef]
  16. Isaieva, K.; Meullenet, C.; Vuissoz, P.A.; Fauvel, M.; Nohava, L.; Laistler, E.; Zeroual, M.A.; Henrot, P.; Felblinger, J.; Odille, F. Feasibility of Online Non-Rigid Motion Correction for High-Resolution Supine Breast MRI. Magn. Reson. Med. 2023, 90, 2130–2143. [Google Scholar] [CrossRef] [PubMed]
  17. Luong, M.; Ferr, G.; Chazel, E.; Gapais, P.F.; Gras, V.; Boulant, N.; Amadon, A. A Compact 16Tx-32Rx Geometrically Decoupled Phased Array for 11.7 T MRI. In Proceedings of the the 31st Annual Meeting of the ISMRM, London, UK, 7–12 May 2023; Volume 707. [Google Scholar]
  18. Mugler, J.P., III; Brookeman, J.R. Three-dimensional magnetization-prepared rapid gradient-echo imaging (3D MP RAGE). Magn. Reson. Med. 1990, 15, 152–157. [Google Scholar] [CrossRef] [PubMed]
  19. Cordero-Grande, L.; Ferrazzi, G.; Teixeira, R.P.A.G.; O’Muircheartaigh, J.; Price, A.N.; Hajnal, J.V. Motion-corrected MRI with DISORDER: Distributed and incoherent sample orders for reconstruction deblurring using encoding redundancy. Magn. Reson. Med. 2020, 84, 713–726. [Google Scholar] [CrossRef] [PubMed]
  20. NVIDIA Corporation. cuFFT Library User Guide; NVIDIA Corporation: Santa Clara, CA, USA, 2023. [Google Scholar]
  21. NVIDIA Corporation. cuBLAS Library User Guide; NVIDIA Corporation: Santa Clara, CA, USA, 2023. [Google Scholar]
  22. GitHub-Ebugger. Empirical Roofline Toolkit. Available online: https://github.com/ebugger/Empirical-Roofline-Toolkit (accessed on 4 March 2025).
  23. Nsight Compute. Available online: https://docs.nvidia.com/nsight-compute/NsightCompute/ (accessed on 7 March 2025).
  24. GitHub-Ebugger. Example Scripts for Plotting Roofline. Available online: https://github.com/cyanguwa/nersc-roofline (accessed on 18 March 2025).
  25. Hong, J.; Cho, S.; Park, G.; Yang, W.; Gong, Y.H.; Kim, G. Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), Edinburgh, UK, 2–6 March 2024; pp. 139–155. [Google Scholar]
  26. Kenyon, C.; Volkema, G.; Khanna, G. Overcoming Limitations of GPGPU-Computing in Scientific Applications. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–9. [Google Scholar] [CrossRef]
Figure 1. Example of a roofline model. The performance characteristics of two computational kernels ( k 1 and k 2 ) with respect to the arithmetic intensity and attainable performance are highlighted. k 1 is bandwidth-bound, limited by memory bandwidth. k 2 is in a compute-bound region, approaching peak computational performance. Cache and main memory bandwidth ceilings are represented to highlight resource utilization boundaries.
Figure 1. Example of a roofline model. The performance characteristics of two computational kernels ( k 1 and k 2 ) with respect to the arithmetic intensity and attainable performance are highlighted. k 1 is bandwidth-bound, limited by memory bandwidth. k 2 is in a compute-bound region, approaching peak computational performance. Cache and main memory bandwidth ceilings are represented to highlight resource utilization boundaries.
Applsci 15 05840 g001
Figure 2. Comparison of the elapsed time per CG iteration for the three implementations.
Figure 2. Comparison of the elapsed time per CG iteration for the three implementations.
Applsci 15 05840 g002
Figure 3. Roofline analysis of the implemented CUDA kernels on NVIDIA A100. The work in [24] was used to plot the model. Dataset D 1 .
Figure 3. Roofline analysis of the implemented CUDA kernels on NVIDIA A100. The work in [24] was used to plot the model. Dataset D 1 .
Applsci 15 05840 g003
Figure 4. Comparison of GRICS reconstructions with the three implementations for dataset D 2 . (A) Uncorrected reconstruction. (B): The CPU-only solution. (C): The Matlab–GPU solution. (D): The CUDA solution. The three solutions markedly reduced the motion artifacts visible in the blue and orange squares.
Figure 4. Comparison of GRICS reconstructions with the three implementations for dataset D 2 . (A) Uncorrected reconstruction. (B): The CPU-only solution. (C): The Matlab–GPU solution. (D): The CUDA solution. The three solutions markedly reduced the motion artifacts visible in the blue and orange squares.
Applsci 15 05840 g004
Figure 5. Comparison of GRICS reconstructions with the three implementations for dataset D 4 . (A) Uncorrected reconstruction. (B): The CPU-only solution. (C): The Matlab–GPU solution. (D): The CUDA solution. Improvements of the reconstruction quality were noticed on the frontal (yellow square) and the parietal (green square) lobes.
Figure 5. Comparison of GRICS reconstructions with the three implementations for dataset D 4 . (A) Uncorrected reconstruction. (B): The CPU-only solution. (C): The Matlab–GPU solution. (D): The CUDA solution. Improvements of the reconstruction quality were noticed on the frontal (yellow square) and the parietal (green square) lobes.
Applsci 15 05840 g005
Table 1. N x , N y , N z are the number of samples in frequency encoding, phase-encoding, and slices directions, respectively. N c is the number of coils in the coil array. N t is the number of motion states.
Table 1. N x , N y , N z are the number of samples in frequency encoding, phase-encoding, and slices directions, respectively. N c is the number of coils in the coil array. N t is the number of motion states.
Dataset D 1 D 2 D 3 D 4 D 5 D 6
N x 320320320256256256
N y 470470550208208208
N z 112112112160160160
N c 203026323232
N t 121212121212
Magnetic Field3T3T3T7T7T7T
Table 2. The main characteristics of the test machines. SM stands for streaming multi-processor, which is akin to cores in CPUs. FP stands for floating point. The bandwidth of the memory levels and the peak FP performance were estimated empirically with the Empirical Roofline Toolkit ERT [22]; it provides more realistic benchmarking of a workstation specs.
Table 2. The main characteristics of the test machines. SM stands for streaming multi-processor, which is akin to cores in CPUs. FP stands for floating point. The bandwidth of the memory levels and the peak FP performance were estimated empirically with the Empirical Roofline Toolkit ERT [22]; it provides more realistic benchmarking of a workstation specs.
DeviceAMD EPYCNVIDIA A100
Main Memory bandwidth (GB/s)221.11227.7
L2 Bandwidth (GB/s)3242.23738.7
Main Memory Size (GB)400040
Main Memory TypeDDR4HBM2e
Nb Cores/SMs36108
Peak FP32 (GFLOP/s)134812,526.1
Release year20212020
Table 3. NVIDIA Nsight Compute metrics. FP stands for floating point.
Table 3. NVIDIA Nsight Compute metrics. FP stands for floating point.
DataMetric
FP 64sm__sass_thread_inst_executed_op_fp64_pred_on.sum
FP 32sm__sass_thread_inst_executed_op_fp32_pred_on.sum
FP 16sm__sass_thread_inst_executed_op_fp16_pred_on.sum
Tensor Coresm__inst_executed_pipe_tensor.sum
L1 cachel1tex__t_bytes.sum
L2 cachelts__t_bytes.sum
HBMdram__bytes.sum
Table 4. NRMSE of the CPU image and the CUDA image.
Table 4. NRMSE of the CPU image and the CUDA image.
DatasetNRMSE (CPU_Image, CUDA_Image)
D 1 10 5
D 2 10 6
D 3 10 5
D 4 10 4
D 5 10 4
D 6 10 4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zeroual, M.A.; Dudysheva, N.; Gras, V.; Mauconduit, F.; Isaieva, K.; Vuissoz, P.-A.; Odille, F. Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture. Appl. Sci. 2025, 15, 5840. https://doi.org/10.3390/app15115840

AMA Style

Zeroual MA, Dudysheva N, Gras V, Mauconduit F, Isaieva K, Vuissoz P-A, Odille F. Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture. Applied Sciences. 2025; 15(11):5840. https://doi.org/10.3390/app15115840

Chicago/Turabian Style

Zeroual, Mohamed Aziz, Natalia Dudysheva, Vincent Gras, Franck Mauconduit, Karyna Isaieva, Pierre-André Vuissoz, and Freddy Odille. 2025. "Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture" Applied Sciences 15, no. 11: 5840. https://doi.org/10.3390/app15115840

APA Style

Zeroual, M. A., Dudysheva, N., Gras, V., Mauconduit, F., Isaieva, K., Vuissoz, P.-A., & Odille, F. (2025). Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture. Applied Sciences, 15(11), 5840. https://doi.org/10.3390/app15115840

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop