Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture

Zeroual, Mohamed Aziz; Dudysheva, Natalia; Gras, Vincent; Mauconduit, Franck; Isaieva, Karyna; Vuissoz, Pierre-André; Odille, Freddy

doi:10.3390/app15115840

Open AccessArticle

Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture

by

Mohamed Aziz Zeroual

¹

,

Natalia Dudysheva

²,

Vincent Gras

²,

Franck Mauconduit

²

,

Karyna Isaieva

¹,

Pierre-André Vuissoz

¹

and

Freddy Odille

^1,3,*

¹

IADI (U1254), Inserm and Université de Lorraine, F-54000 Nancy, France

²

CEA, Neurospin, Paris-Saclay University and CNRS, F-91190 Gif sur Yvette, France

³

CIC-IT 1433, Inserm, Université de Lorraine, and CHRU Nancy, F-54000 Nancy, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(11), 5840; https://doi.org/10.3390/app15115840

Submission received: 10 April 2025 / Revised: 5 May 2025 / Accepted: 20 May 2025 / Published: 22 May 2025

(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))

Download

Browse Figures

Versions Notes

Abstract

Motion correction in magnetic resonance imaging (MRI) has become increasingly complex due to the high computational demands of iterative reconstruction algorithms and the heterogeneity of emerging computing platforms. However, the clinical applicability of these methods requires fast processing to ensure rapid and accurate diagnostics. Graphics processing units (GPUs) have demonstrated substantial performance gains in various reconstruction tasks. In this work, we present a GPU implementation of the reconstruction kernel for the generalized reconstruction by inversion of coupled systems (GRICS), an iterative joint optimization approach that enables 3D high-resolution image reconstruction with motion correction. Three implementations were compared: (i) a C++ CPU version, (ii) a Matlab–GPU version (with minimal code modifications allowing data storage in GPU memory), and (iii) a native GPU version using CUDA. Six distinct datasets, including various motion types, were tested. The results showed that the Matlab–GPU approach achieved speedups ranging from 1.2× to 2.0× compared to the CPU implementation, whereas the native CUDA version attained speedups of 9.7× to 13.9×. Across all datasets, the normalized root mean square error (NRMSE) remained on the order of

10^{- 6}

to

10^{- 4}

, indicating that the CUDA-accelerated method preserved image quality. Furthermore, a roofline analysis was conducted to quantify the kernel’s performance on one of the evaluated datasets. The kernel achieved 250 GFLOP/s, representing a 15.6× improvement over the performance of the Matlab–GPU version. These results confirm that GPU-based implementations of GRICS can drastically reduce reconstruction times while maintaining diagnostic fidelity, paving the way for more efficient clinical motion-compensated MRI workflows.

Keywords:

magnetic resonance imaging; motion correction; GPU; CUDA

1. Introduction

The recent advances in magnetic resonance imaging (MRI) programs have led to multiple implementation challenges. The computational demands of MRI scales with advancements in imaging techniques, especially with the transition to real-time imaging [1] and model-based reconstructions [2]. Modern MRI systems face challenges including the need for higher spatial and temporal resolution, the reduction in acquisition time, and the integration of advanced techniques such as dynamic imaging [3], compressed sensing [4], and parallel imaging [5]. For instance, the motion-compensated MRI reconstruction problem relies on accurate modeling of rigid and non-rigid body motion, which implies an exact modeling of the deformation and a local tracking of the displacement fields within each voxel [6]. Hence, thousands of variables are required. Thus, it is necessary to assess the data-level parallelism in MRI reconstruction methods and employ high-performance computing (HPC) platforms to reduce the reconstruction time to an acceptable level for clinical applicability. Graphics processing units (GPUs) are suited for iterative, high-throughput operations, critical to MRI workflows, such as Fourier transforms, image interpolation, and gridding [7,8,9]. Unlike traditional central processing units (CPUs), GPUs are designed for data-level parallelism, enabling the processing of thousands of threads concurrently. Moreover, GPUs facilitate memory-intensive processes such as multi-frame parallel reconstruction, leveraging their superior memory bandwidth and multi-core architecture [10]. In parallel with fast GPU-driven reconstruction, a new trend is emerging toward multi-sensory pipelines that fuse imaging with real-time safety monitoring. Smart electronic-device platforms now enable continuous streaming of tissue temperature and local specific absorption rate (SAR) data, which can be co-processed on the same GPU hardware to guide adaptive control of power deposition and reconstruction parameters [11]. Integrating such wearable sensor feedback underscores the broader move toward unified, GPU-centric biomedical systems where imaging, physiology, and patient safety are handled in a single low-latency loop. Compute unified device architecture (CUDA) [12] has gained particular attention in MRI GPU-transition for its user-friendly programming model, its accessibility, and its high-level libraries. In this work, we extend our research reported in [13] by introducing the CUDA implementation of the generalized reconstruction by inversion of coupled systems (GRICS) method’s reconstruction kernel [14], an iterative reconstruction framework that allows for the obtention of 3D high-resolution images, by alternatively optimizing the reconstructed image and the parameters of the motion model. A key contribution of our study lies in the design of the CUDA implementation of the reconstruction kernel, particularly, the optimal design of sparse matrix–vector/element-wise matrix multiplication to efficiently use warp-level parallelism, and the on-the-fly computation of the motion operator. Rather than repeatedly transferring precomputed elements from the CPU to the GPU, the required non-zero entries of the operator are dynamically generated within the target kernels based on the underlying motion model. Despite a higher per-iteration computational cost, this design significantly reduces data-transfer overhead and memory usage, while ensuring that all problem variables remain in GPU memory throughout the iterative process. Moreover, advanced parallel-reduction strategies using atomic operations further guarantee safe and optimal accumulation of partial results in the shared memory. Furthermore, a performance analysis is used to compare the efficiency of three implementations: the CPU-C++, the Matlab-GPU [13], and the CUDA implementations. We evaluated the eventual discrepancies between the CPU and CUDA images with the normalized root mean square error (NRMSE). A second contribution is the roofline analysis of the reconstruction kernel applied to a specific dataset. The roofline model [14] was used to quantify the performance of the operators involved in the reconstruction step. This analysis provides a performance baseline for iterative reconstruction programs since they share the same performance bottlenecks as GRICS [6].

2. Theory

2.1. Joint Reconstruction of an Image and a Motion Model

Motion-compensated reconstruction was accomplished using the GRICS method [6,14]. The algorithm solves alternatively and iteratively for both a motion-corrected image

ρ

and the parameters of a motion model

α

. The software corrects for either rigid or non-rigid body motion. The problem is formulated as follows:

\{\begin{matrix} ρ, α = \underset{ρ, α}{arg min} | | E (u_{α}) ρ - {m | |}^{2} + {λ | | ρ | |}^{2} + μ R (α), \\ u_{α} (x, y, z, t) = \{\begin{matrix} [x, y, z, 1] α (t) (rigid model), or \\ α (x, y, z) s (t) (non-rigid model) \end{matrix} \end{matrix}

(1)

where m stands for the acquired MRI k-space data,

u_{α}

is the displacement field at each k-space sample time, and E is the encoding operator (a complete description of the encoding operator E can be found in [13]). In the case of a rigid motion model,

α (t)

represents a rigid transformation matrix; it is expressed using homogeneous coordinates. In the case of a non-rigid motion model, a separable motion model is used, with a spatial component

α (x, y, z)

, and the temporal component

s (t)

is a motion signal, acquired from a motion sensor or navigator data. A Tikhonov regularization controlled by the hyper-parameter

λ

is employed. An additional regularization term is added in the non-rigid case, as

R (α) = {| | \nabla α | |}^{2}

. In each iteration of the GRICS algorithm, two coupled least-squares problems are solved in an alternating scheme. The first problem addresses image reconstruction, where the objective is to recover the image

ρ

by minimizing the discrepancy between the measured k-space data and the forward model, as well as the regularization term that stabilizes the solution. Once the image is updated, the motion correction problem is tackled in the second least-squares step. Here, the residual from the reconstructed image is used to refine the motion parameters

α

(either under rigid or non-rigid motion assumptions). By linearizing the forward model around the current estimates, a Gauss–Newton method efficiently updates

α

to reduce motion-induced artifacts. The least-squares problems are solved using the Conjugate Gradient (CG) solver. This iterative interplay between image reconstruction and motion estimation is applied in a multi-resolution fashion: first, it is applied to the low-resolution (central) k-space data; then, it continues until the last resolution level, where only the reconstruction step is performed.

2.2. The Roofline Model

To characterize the performance of our GPU implementation, we employ the roofline model [15], which is a visual framework for understanding how kernel performance is limited by both the available memory bandwidth and the peak computational throughput. A key component of this model is the operational/arithmetic intensity, defined as

OI = \frac{FLOPs}{Bytes transferred},

where FLOPs is the total number of FLoating-point Operations performed and bytes transferred is the total amount of data read or written from memory. The roofline model states that the maximum achievable performance P cannot exceed either the GPU’s peak floating-point performance

P_{peak}

or the product of the operational intensity and the peak memory bandwidth

B_{peak}

:

P \leq min (P_{peak}, O I \times B_{peak})

(2)

This formulation highlights whether a given kernel is compute-bound (limited by

P_{peak}

) or memory-bound (limited by

B_{peak}

). By plotting the measured performance against OI on the same axes as these theoretical “roofs”, it is possible to identify performance bottlenecks and optimization patterns; see Figure 1.

3. Methods

3.1. Data Acquisition

In this study, we investigated data acquired through two separate MRI examinations conducted on different imaging systems. For the first acquisition, raw MRI data were obtained using a 3T MRI scanner (MAGNETOM Prisma, Siemens Healthineers, Erlangen, Germany). Three volunteers were enrolled, and three datasets

(D_{1}, D_{2}, D_{3})

with non-rigid motion were acquired using a T1-weighted 3D fast low-angle shot GRadient Echo (GRE) sequence covering the breast region in a 90 s free-breathing scan. Additional details regarding this acquisition and its ethical approval are available in [13,16]. For the second acquisition, MRI data were acquired using a 7T MRI scanner (Magnetom 7T, Siemens Healthineers) equipped with an in-house built head RF coil having 8 transmit and 32 receive channels [17]. Here, 3 datasets

(D_{4}, D_{5}, D_{6})

containing subtle rigid motion from 3 volunteers were obtained with a 3D Magnetization Prepared Rapid Gradient Echo (MPRAGE) [18] sequence covering the head at a resolution of

1 \times 1 \times 1

{mm}^{3}

. To enhance motion sensitivity, the MPRAGE sequence was modified to randomize the k-space sampling order. This randomization evenly distributes motion-induced artifacts, effectively transforming them into less noticeable, noise-like distortions and making them more suitable to advanced reconstruction techniques [19]. Additionally, it improves temporal resolution by enabling more flexible sampling strategies that adapt to motion dynamics and capture more relevant temporal information. A navigator (the central k-space line, a line scan oriented along the inferior-to-superior IS direction) was acquired at the beginning of each RAGE shot and was processed in order to split the k-space data into

N_{t}

motion states. All participants in these studies provided written informed consent. Volunteers were not instructed to move. A summary of the dataset parameters is depicted in Table 1.

3.2. The Reconstruction Kernel

The most computationally-intensive part of GRICS lies in the Conjugate Gradient (CG) linear solvers. In particular, the application of the operator

E^{H} E

—where

E^{H}

is the Hermitian transpose of the operator E—comprises an outer loop on motion states

N_{t}

and an inner loop on coil receivers

N_{c}

. Algorithm 1 introduces the structure of this kernel. The operator

E^{H} E

reads

E^{H} E = \sum_{\begin{matrix} t = 1 \end{matrix}}^{N_{t}} T_{t}^{H} (\sum_{\begin{matrix} c = 1 \end{matrix}}^{N_{c}} C_{c}^{H} F^{H} S_{t}^{H} S_{t} F C_{c}) T_{t}

(3)

A complete description of the operators in Equation (3) and of the C++ and Matlab–GPU implementations of Algorithm 1 can be found in [13].

Algorithm 1 Main part of the reconstruction kernel (function

x \mapsto E^{H} E x

).

1:

y = 0

2: for

t = 1, 2, \dots, N_{t}

do
3:

x_{t} = T (u_{t}) x

4:

y_{t} = 0

5: for coil

c = 1, 2, \dots, N_{c}

do
6:

y_{t} = y_{t} + C_{c}^{H} (F^{H} S_{t}^{H} S_{t} F) C_{c} x_{t}

7: end for
8:

y = y + T {(u_{t})}^{H} y_{t}

9: end for

3.3. Cuda Implementation

In order to implement Algorithm 1 in a GPU with CUDA, several kernels and mathematical operations were designed. First, the application of the Sensitivity Maps operator

C_{c}

and its Hermitian transpose

C_{c}^{H}

required the implementation of an element-wise complex multiplication, where each thread processes one data point, performing concurrently the real and imaginary parts of the complex multiplication for all the elements. Another element-wise product kernel was designed to apply the sampling mask operator

S_{t}^{H} S_{t}

. The kernels for the displacement operator

T_{t}

and its Hermitian transpose

T_{t}^{H}

assign threads to individual voxels and compute displacement fields based on rigid or non-rigid motion parameters, warping the image using linear interpolation. In the C++ implementation, the non-zero elements of

T_{t}

were precomputed before the CG solver was run and stored in CPU memory. In the Matlab–GPU implementation, due to the GPU memory limitation, the precomputed elements had to be transferred from the CPU to the GPU memory at the beginning of the outer loop (before line 3 of Algorithm 1). To avoid this repeated and costly transfer, in the native CUDA implementation, the non-zero elements of

T_{t}

were computed dynamically within the kernel from the motion model definition given in Equation 1. This resulted in a higher computational complexity; however, all the required problem variables were stored in GPU memory before running the CG solver. The Fast Fourier Transform FFT algorithm was implemented using the GPU-optimized library cuFFT [20]. Moreover, and in order to avoid race conditions, atomicAdd was explicitly employed at multiple locations, mainly in the reduction process (line 8 in Algorithm 1), to safely accumulate partial results into the shared memory location. Finally, the CG solver was implemented using the cuBLAS library [21].

3.4. The Test Machines

The CPU-only implementation of Algorithm 1 was tested on an AMD EPYC 75F3 processor (AMD, Santa Clara, CA, USA). The C++ code benefits from a hybrid parallelization using Message Passing Interface (MPI) to distribute the

N_{t}

motion states over n MPI ranks and Open Multi-Processing (OpenMP) to distribute the set of the

N_{c}

coils over m software threads. The GPU implementations were executed on NVIDIA A100 GPU (NVIDIA Corporation, Santa Clara, CA, USA). More details about the two devices can be found in Table 2.

3.5. Performance Profiling

In order to evaluate the performance of our CUDA-based implementation, we employed the NVIDIA Nsight Compute profiling tool [23]. Nsight Compute is a performance analysis utility specifically designed for NVIDIA GPUs, providing kernel-level profiling with a minimal overhead, with the aim of identifying and optimizing performance bottlenecks. In this context, Nsight Compute was used to collect performance metrics such as the FLOPs and memory traffic for both the main memory and the cache hierarchy. The set of the collected metrics is presented in Table 3.

The operational intensities of each memory level are computed using the output of these metrics, in particular:

O I_{L_{1}} = \frac{FLOPs}{L_{1} cache}, O I_{L_{2}} = \frac{FLOPs}{L_{2} cache}, O I_{m_{H B M}} = \frac{FLOPs}{HBM} .

Given that T is the elapsed time of the CUDA-based implementation, its achievable performance P is computed as

P = \frac{FLOPs}{T}

4. Results

To compare the performance of the three implementations, we measured the elapsed time per conjugate gradient (CG) iteration across the six datasets listed in Table 1. Each reported time represents the average of five runs. The results are shown in Figure 2, where the primary y-axis (left) indicates the elapsed time (in seconds) per CG iteration, and the secondary y-axis (right) denotes the speedup relative to the CPU reference implementation. The CUDA-based implementation significantly reduced the elapsed time, achieving speedup factors of between 9.7 and 13.9 (green dashed line) compared to the CPU-only version. The Matlab–GPU implementation (blue dashed line) also improved speedup but was less pronounced than the CUDA solution. The NRMSE of the CPU and the CUDA images are depicted in Table 4. They remained within the range of

10^{- 6}

to

10^{- 4}

for all datasets. Since these values are lower than the reconstruction kernel’s tolerance of

ϵ_{r} = 10^{- 3}

, no detectable image quality degradation was observed. Figure 3 presents a roofline analysis of the

E^{H} E

operator, highlighting both the achievable performance and operational intensity for each operator in Algorithm 1. We relied on the Empirical Roofline Toolkit to measure realistic bandwidth and peak FP32/FP64 performance values rather than theoretical specifications. By computing the motion operator elements on the fly, the performance of operator

T_{t}

was significantly enhanced, reaching 105 GFLOP/s. The Fast Fourier Transform (FFT) steps, executed via the cuFFT library, achieved 310 GFLOP/s. Overall, the

E^{H} E

function is bandwidth-bound, exhibiting a performance of approximately 250 GFLOP/s at a high-bandwidth memory (HBM) operational intensity of 2.94 FLOP/Byte. For comparison, ref. [13] reported a performance of 16 GFLOP/s for the Matlab GPU implementation of Algorithm 1. In contrast, our native GPU implementation delivered a 15.6× performance gain, narrowing the gap to the roofline limits that define the upper bounds of achievable performance. Figure 4 and Figure 5 illustrate the reconstructed images for each implementation. In both the brain MRI dataset (

D_{4}

), subject to rigid-body motion, and the breast MRI dataset (

D_{2}

), subject to non-rigid respiratory and cardiac motion, the three methods effectively suppressed motion-induced artifacts.

5. Discussion

Compared to the CPU-only implementation, the GPU-based solutions significantly decreased computation time and improved the performance of Algorithm 1. In particular, the CUDA solution was approximately 9–13 times faster than the hybrid MPI/OpenMP version, exploiting the GPU micro-architecture without compromising the quality of the reconstructed images. For all datasets, the normalized root-mean-square error remained within the specified tolerance of the reconstruction kernel. For many years, one of the primary bottlenecks in GPU-based computing was the limited amount of device memory, which often forced data to be shuttled back and forth between the host and the device to fit larger workloads. This frequent transfer not only added latency but also introduced complexity in managing data pipelines for HPC applications. Recently, however, a new generation of GPUs (such as A100 GPU) has emerged with significantly expanded memory capacities, enabling developers to hold entire datasets directly in GPU memory. By offering more robust storage space and reducing reliance on host-to-device transfers, these GPUs improve the overall performance of such applications [25,26].

The roofline analysis showed substantial performance improvements with the CUDA implementation. However, the actual performance remains noticeably below the theoretical upper bound for the corresponding operational intensity. Potential future optimizations include the following:

Data storage in half precision: Reducing the memory footprint by using FP16 could significantly enhance memory bandwidth utilization. An initial test was conducted with this constraint, and the reconstructed image quality was not compromised.
Improvement of cache usage: Achieved by optimizing memory access patterns to benefit from the large L2 cache on the A100 GPU.
The Utilization of tensor cores could be also investigated to accelerate matrix-based kernels.

The CUDA back-end is markedly cheaper to run. In our code, the A100 GPU performed each reconstruction in 5.3–18 s, whereas the 36-core EPYC 75F3 needed 53–180 s—a roughly 10-fold speedup. Even though a data-center A100 draws more instantaneous power than a single-socket EPYC (400 W versus 280 W, per vendor TDPs), the order-of-magnitude shorter runtime means the energy per study falls by approximately 80%. In a department that performs 2000 such motion-compensated scans a year, that difference translates to roughly 1 MWh saved—enough to pay the GPU’s electricity bill for the entire 5-year depreciation window.

6. Conclusions

In this work, we demonstrated the feasibility of a native GPU implementation of the GRICS method. By evaluating the CUDA-based implementation of the reconstruction kernel, which accounts for 70% to 80% of the total GRICS elapsed time, we achieved a speedup ranging from 9.7× to 13.7× with an elapsed time ranging between 5.3 s and 18 s across six different datasets, having different problem sizes and MRI acquisition types. The GPU acceleration did not compromise the reconstruction quality. The roofline analysis highlights an improvement in the overall performance with the CUDA implementation, with an achievable performance of 250 GFLOP/s. The results show the potential applicability of the method in clinical settings.

Author Contributions

Conceptualization, M.A.Z., P.-A.V., F.O., N.D., V.G. and F.M.; methodology, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; software, M.A.Z., K.I., P.-A.V. and F.O.; validation, M.A.Z., F.O., N.D., V.G. and F.M.; formal analysis, M.A.Z., P.-A.V., F.O., N.D., V.G. and F.M.; investigation, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; resources, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; data curation, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; writing—original draft preparation, M.A.Z.; writing—review and editing, K.I., P.-A.V., F.O. and V.G.; visualization, M.A.Z., K.I., P.-A.V., F.O., N.D., V.G. and F.M.; supervision, P.-A.V., F.O., V.G. and F.M.; project administration, P.-A.V., F.O., V.G. and F.M. All authors have read and agreed to the published version of the manuscript.

Funding

MOSAR project (ANR-21-CE19-0028), CPER IT2MP, FEDER (European Regional Development Fund).

Institutional Review Board Statement

The study was conducted in accordance with the ethical principles outlined in the Declaration of Helsinki (1975, revised 2013) and was approved by the ethics committee (approval number: [CPP EST-III, 08.10.01]) under the protocol “METHODO” (ClinicalTrials.gov Identifier: NCT02887053). All participants provided written informed consent prior to their inclusion in the study.

Informed Consent Statement

Written informed consent was obtained from the patients involved in this study.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

MRI	Magnetic Resonance Imaging
FP	Floating Point
GRICS	The Generalized Reconstruction by Inversion of Coupled Systems
GPU	Graphics Processing Unit
HPC	High-Performance Computing
CUDA	Compute Unified Device Architecture
MPRAGE	Magnetization-Prepared Rapid Gradient Echo
OpenMP	Open Multi-Processing
MPI	Message Passing Interface

References

Schaetz, S.; Voit, D.; Frahm, J.; Uecker, M. Accelerated Computing in Magnetic Resonance Imaging: Real-Time Imaging Using Nonlinear Inverse Reconstruction. Comput. Math. Methods Med. 2017, 2017, 3527269. [Google Scholar] [CrossRef]
Fessler, J.A. Model-Based Image Reconstruction for MRI. IEEE Signal Process. Mag. 2010, 27, 81–89. [Google Scholar] [CrossRef] [PubMed]
Gordon, Y.; Partovi, S.; Müller-Eschner, M.; Amarteifio, E.; Bäuerle, T.; Weber, M.A.; Kauczor, H.U.; Rengier, F. Dynamic Contrast-Enhanced Magnetic Resonance Imaging: Fundamentals and Application to the Evaluation of the Peripheral Perfusion. Cardiovasc. Diagn. Ther. 2014, 4, 147–164. [Google Scholar] [PubMed]
Murphy, M.; Alley, M.; Demmel, J.; Keutzer, K.; Vasanawala, S.; Lustig, M. Fast ℓ₁-SPIRiT Compressed Sensing Parallel Imaging MRI: Scalable Parallel Implementation and Clinically Feasible Runtime. IEEE Trans. Med. Imaging 2012, 31, 1250–1262. [Google Scholar] [CrossRef] [PubMed]
Pruessmann, K.P.; Weiger, M.; Scheidegger, M.B.; Boesiger, P. SENSE: Sensitivity Encoding for Fast MRI. Magn. Reson. Med. 1999, 42, 952–962. [Google Scholar]
Odille, F. Chapter 13—Motion-Corrected Reconstruction. In Advances in Magnetic Resonance Technology and Applications; Akçakaya, M., Doneva, M., Prieto, C., Eds.; Magnetic Resonance Image Reconstruction; Academic Press: Cambridge, MA, USA, 2022; Volume 7, pp. 355–389. [Google Scholar] [CrossRef]
Wang, H.; Peng, H.; Chang, Y.; Liang, D. A survey of GPU-based acceleration techniques in MRI reconstructions. Quant. Imaging Med. Surg. 2018, 8, 196–208. [Google Scholar] [CrossRef] [PubMed]
Stone, S.S.; Haldar, J.P.; Tsao, S.C.; Hwu, W.-m.W.; Sutton, B.P.; Liang, Z.P. Accelerating Advanced MRI Reconstructions on GPUs. J. Parallel Distrib. Comput. 2008, 68, 1307–1318. [Google Scholar] [CrossRef]
Després, P.; Jia, X. A Review of GPU-Based Medical Image Reconstruction. Phys. Med. 2017, 42, 76–92. [Google Scholar] [CrossRef] [PubMed]
Buck, I.; Hanrahan, P. Data Parallel Computation on Graphics Hardware; Technical Report; Stanford University: Stanford, CA, USA, 2003. [Google Scholar]
Laganà, F.; Bibbò, L.; Calcagno, S.; Carlo, D.D.; Pullano, S.A.; Pratticò, D.; Angiulli, G. Smart Electronic Device-Based Monitoring of SAR and Temperature Variations in Indoor Human Tissue Interaction. Appl. Sci. 2025, 15, 2439. [Google Scholar] [CrossRef]
Cook, S. CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st ed.; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2012. [Google Scholar]
Zeroual, M.A.; Isaieva, K.; Vuissoz, P.A.; Odille, F. Performance Study of an MRI Motion-Compensated Reconstruction Program on Intel CPUs. Appl. Sci. 2024, 14, 9663. [Google Scholar] [CrossRef]
Odille, F.; Vuissoz, P.A.; Marie, P.Y.; Felblinger, J. Generalized Reconstruction by Inversion of Coupled Systems (GRICS) Applied to Free-Breathing MRI. Magn. Reson. Med. 2008, 60, 355–389. [Google Scholar] [CrossRef] [PubMed]
Williams, S.; Waterman, A.; Patterson, D. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 2009, 52, 65–76. [Google Scholar] [CrossRef]
Isaieva, K.; Meullenet, C.; Vuissoz, P.A.; Fauvel, M.; Nohava, L.; Laistler, E.; Zeroual, M.A.; Henrot, P.; Felblinger, J.; Odille, F. Feasibility of Online Non-Rigid Motion Correction for High-Resolution Supine Breast MRI. Magn. Reson. Med. 2023, 90, 2130–2143. [Google Scholar] [CrossRef] [PubMed]
Luong, M.; Ferr, G.; Chazel, E.; Gapais, P.F.; Gras, V.; Boulant, N.; Amadon, A. A Compact 16Tx-32Rx Geometrically Decoupled Phased Array for 11.7 T MRI. In Proceedings of the the 31st Annual Meeting of the ISMRM, London, UK, 7–12 May 2023; Volume 707. [Google Scholar]
Mugler, J.P., III; Brookeman, J.R. Three-dimensional magnetization-prepared rapid gradient-echo imaging (3D MP RAGE). Magn. Reson. Med. 1990, 15, 152–157. [Google Scholar] [CrossRef] [PubMed]
Cordero-Grande, L.; Ferrazzi, G.; Teixeira, R.P.A.G.; O’Muircheartaigh, J.; Price, A.N.; Hajnal, J.V. Motion-corrected MRI with DISORDER: Distributed and incoherent sample orders for reconstruction deblurring using encoding redundancy. Magn. Reson. Med. 2020, 84, 713–726. [Google Scholar] [CrossRef] [PubMed]
NVIDIA Corporation. cuFFT Library User Guide; NVIDIA Corporation: Santa Clara, CA, USA, 2023. [Google Scholar]
NVIDIA Corporation. cuBLAS Library User Guide; NVIDIA Corporation: Santa Clara, CA, USA, 2023. [Google Scholar]
GitHub-Ebugger. Empirical Roofline Toolkit. Available online: https://github.com/ebugger/Empirical-Roofline-Toolkit (accessed on 4 March 2025).
Nsight Compute. Available online: https://docs.nvidia.com/nsight-compute/NsightCompute/ (accessed on 7 March 2025).
GitHub-Ebugger. Example Scripts for Plotting Roofline. Available online: https://github.com/cyanguwa/nersc-roofline (accessed on 18 March 2025).
Hong, J.; Cho, S.; Park, G.; Yang, W.; Gong, Y.H.; Kim, G. Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA), Edinburgh, UK, 2–6 March 2024; pp. 139–155. [Google Scholar]
Kenyon, C.; Volkema, G.; Khanna, G. Overcoming Limitations of GPGPU-Computing in Scientific Applications. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–9. [Google Scholar] [CrossRef]

Figure 1. Example of a roofline model. The performance characteristics of two computational kernels (

k_{1}

and

k_{2}

) with respect to the arithmetic intensity and attainable performance are highlighted.

k_{1}

is bandwidth-bound, limited by memory bandwidth.

k_{2}

is in a compute-bound region, approaching peak computational performance. Cache and main memory bandwidth ceilings are represented to highlight resource utilization boundaries.

Figure 1. Example of a roofline model. The performance characteristics of two computational kernels (

k_{1}

and

k_{2}

) with respect to the arithmetic intensity and attainable performance are highlighted.

k_{1}

is bandwidth-bound, limited by memory bandwidth.

k_{2}

is in a compute-bound region, approaching peak computational performance. Cache and main memory bandwidth ceilings are represented to highlight resource utilization boundaries.

Figure 2. Comparison of the elapsed time per CG iteration for the three implementations.

Figure 3. Roofline analysis of the implemented CUDA kernels on NVIDIA A100. The work in [24] was used to plot the model. Dataset

D_{1}

.

Figure 3. Roofline analysis of the implemented CUDA kernels on NVIDIA A100. The work in [24] was used to plot the model. Dataset

D_{1}

.

Figure 4. Comparison of GRICS reconstructions with the three implementations for dataset

D_{2}

. (A) Uncorrected reconstruction. (B): The CPU-only solution. (C): The Matlab–GPU solution. (D): The CUDA solution. The three solutions markedly reduced the motion artifacts visible in the blue and orange squares.

Figure 4. Comparison of GRICS reconstructions with the three implementations for dataset

D_{2}

. (A) Uncorrected reconstruction. (B): The CPU-only solution. (C): The Matlab–GPU solution. (D): The CUDA solution. The three solutions markedly reduced the motion artifacts visible in the blue and orange squares.

Figure 5. Comparison of GRICS reconstructions with the three implementations for dataset

D_{4}

. (A) Uncorrected reconstruction. (B): The CPU-only solution. (C): The Matlab–GPU solution. (D): The CUDA solution. Improvements of the reconstruction quality were noticed on the frontal (yellow square) and the parietal (green square) lobes.

Figure 5. Comparison of GRICS reconstructions with the three implementations for dataset

D_{4}

. (A) Uncorrected reconstruction. (B): The CPU-only solution. (C): The Matlab–GPU solution. (D): The CUDA solution. Improvements of the reconstruction quality were noticed on the frontal (yellow square) and the parietal (green square) lobes.

Table 1.

N_{x}, N_{y}, N_{z}

are the number of samples in frequency encoding, phase-encoding, and slices directions, respectively.

N_{c}

is the number of coils in the coil array.

N_{t}

is the number of motion states.

Table 1.

N_{x}, N_{y}, N_{z}

are the number of samples in frequency encoding, phase-encoding, and slices directions, respectively.

N_{c}

is the number of coils in the coil array.

N_{t}

is the number of motion states.

Dataset	$D_{1}$	$D_{2}$	$D_{3}$	$D_{4}$	$D_{5}$	$D_{6}$
$N_{x}$	320	320	320	256	256	256
$N_{y}$	470	470	550	208	208	208
$N_{z}$	112	112	112	160	160	160
$N_{c}$	20	30	26	32	32	32
$N_{t}$	12	12	12	12	12	12
Magnetic Field	3T	3T	3T	7T	7T	7T

Table 2. The main characteristics of the test machines. SM stands for streaming multi-processor, which is akin to cores in CPUs. FP stands for floating point. The bandwidth of the memory levels and the peak FP performance were estimated empirically with the Empirical Roofline Toolkit ERT [22]; it provides more realistic benchmarking of a workstation specs.

Device	AMD EPYC	NVIDIA A100
Main Memory bandwidth (GB/s)	221.1	1227.7
L2 Bandwidth (GB/s)	3242.2	3738.7
Main Memory Size (GB)	4000	40
Main Memory Type	DDR4	HBM2e
Nb Cores/SMs	36	108
Peak FP32 (GFLOP/s)	1348	12,526.1
Release year	2021	2020

Table 3. NVIDIA Nsight Compute metrics. FP stands for floating point.

Data	Metric
FP 64	sm__sass_thread_inst_executed_op_fp64_pred_on.sum
FP 32	sm__sass_thread_inst_executed_op_fp32_pred_on.sum
FP 16	sm__sass_thread_inst_executed_op_fp16_pred_on.sum
Tensor Core	sm__inst_executed_pipe_tensor.sum
L1 cache	l1tex__t_bytes.sum
L2 cache	lts__t_bytes.sum
HBM	dram__bytes.sum

Table 4. NRMSE of the CPU image and the CUDA image.

Dataset	NRMSE (CPU_Image, CUDA_Image)
$D_{1}$	$10^{- 5}$
$D_{2}$	$10^{- 6}$
$D_{3}$	$10^{- 5}$
$D_{4}$	$10^{- 4}$
$D_{5}$	$10^{- 4}$
$D_{6}$	$10^{- 4}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeroual, M.A.; Dudysheva, N.; Gras, V.; Mauconduit, F.; Isaieva, K.; Vuissoz, P.-A.; Odille, F. Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture. Appl. Sci. 2025, 15, 5840. https://doi.org/10.3390/app15115840

AMA Style

Zeroual MA, Dudysheva N, Gras V, Mauconduit F, Isaieva K, Vuissoz P-A, Odille F. Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture. Applied Sciences. 2025; 15(11):5840. https://doi.org/10.3390/app15115840

Chicago/Turabian Style

Zeroual, Mohamed Aziz, Natalia Dudysheva, Vincent Gras, Franck Mauconduit, Karyna Isaieva, Pierre-André Vuissoz, and Freddy Odille. 2025. "Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture" Applied Sciences 15, no. 11: 5840. https://doi.org/10.3390/app15115840

APA Style

Zeroual, M. A., Dudysheva, N., Gras, V., Mauconduit, F., Isaieva, K., Vuissoz, P.-A., & Odille, F. (2025). Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture. Applied Sciences, 15(11), 5840. https://doi.org/10.3390/app15115840

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feasibility of Implementing Motion-Compensated Magnetic Resonance Imaging Reconstruction on Graphics Processing Units Using Compute Unified Device Architecture

Abstract

1. Introduction

2. Theory

2.1. Joint Reconstruction of an Image and a Motion Model

2.2. The Roofline Model

3. Methods

3.1. Data Acquisition

3.2. The Reconstruction Kernel

3.3. Cuda Implementation

3.4. The Test Machines

3.5. Performance Profiling

4. Results

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI