1. Introduction
Fast Fourier transform (FFT) is an exact fast algorithm to compute the discrete Fourier transform (DFT) when data are acquired on an equispaced grid. In certain image processing fields however, the frequency locations are irregularly distributed, which obstructs the use of FFT. The alternative nonuniform fast Fourier transform (NUFFT) algorithm offers fast mapping for computing nonequispaced frequency components.
Python is a fullyfledged and wellsupported programming language in data science. The importance of Python language can be seen in the recent surge of interest in machine learning. Developers have increasingly relied on Python to build software, taking advantage of its abundant libraries and active community. Yet the standard Python numerical environments lack a native implementation of NUFFT packages, and the development of an efficient Python NUFFT (PyNUFFT) may fill a gap in the image processing field. However, Python is notorious for its slow execution, which hinders the implementation of an efficient Python NUFFT.
During the past decade, the speed of Python has been greatly improved by numerical libraries with rapid array manipulations and vendorprovided performance libraries. However, parallel computing using a multithreading model cannot easily be implemented in Python. This problem is mostly due to the global interpreter lock (GIL) of the Python interpreter, which allows only one core to be used at a time, while the multithreading capabilities of modern symmetric multiprocessing (SMP) processors cannot be exploited.
Recently, generalpurpose graphic processing unit (GPGPU) computing has allowed an enormous acceleration by offloading array computations onto graphic processing units, which are equipped with several thousands of parallel processing units. This emerging programming model may enable an efficient Python NUFFT package by circumventing the limitations of the GIL. Two GPGPU architectures are commonly used, i.e. the proprietary Compute Unified Device Architecture (CUDA, NVIDIA, Santa Clara, CA, USA) and Open Computing Language (OpenCL, Khronos Group, Beaverton, OR, USA). These two similar schemes can be ported to each other, or can be dynamically generated from Python codes [
1].
The current PyNUFFT package was implemented and has been optimized for heterogeneous systems, including multicore central processing units (CPUs) and graphic processing units (GPUs). The design of PyNUFFT aims to reduce the runtime while maintaining the readability of the Python program. Python NUFFT contains the following features: (1) algorithms written and tested for heterogeneous platforms (including multicore CPUs and GPUs); (2) preindexing to handle multidimensional NUFFT and image gradients; (3) the provision of several nonlinear solvers.
2. Materials and Methods
The PyNUFFT software was developed on a 64bit Linux system and tested on a Windows system. PyNUFFT has been reengineered to improve its performance using the PyOpenCL, PyCUDA [
1] and Reikna [
2] libraries.
Figure 1 illustrates the code generation and the memory hierarchy on multicore CPUs and GPUs. The single precision floating point (FP32) version has been released under dual MIT License and GNU Lesser General Public License v3.0 (LGPL3.0) [
3], which may be used in a variety of projects.
2.1. PyNUFFT: An NUFFT Implementation in Python
The execution of PyNUFFT proceeds through three major stages: (1) scaling; (2) oversampled FFT; and (3) interpolation. The three stages can be formulated as a combination of linear operations:
where
$\mathbf{A}$ is the NUFFT,
$\mathbf{S}$ is the scaling,
$\mathbf{F}$ is the FFT, and
$\mathbf{V}$ is the interpolation. A large interpolator size can achive great accuracy for different kernels, at the cost of high memory usage and lower execution speed. To improve the performance, a smaller kernel is preferred. It was previously shown that the min–max interpolator [
4] achieves accurate results in a kernel size of 6–7. In a min–max interpolator, the scaling factor and the kernel are designed to minimize the maximum error in kspace.
The design of the threestage NUFFT algorithm is illustrated in
Figure 2A. To save on data transfer times, variables are created on the device global memory.
Currently, multicoil computations are realized in loop mode or batch mode. The loop mode is robust but the computation times are proportional to the number of parallel coils. In batch mode, the variables of multiple coils are created on the device memory. Thus, the use of batch mode can be restricted by the available memory of the device.
2.1.1. Scaling
Scaling was performed by inplace multiplication (cMultiplyVecInplace). The complex multidimensional array was offloaded to the device.
2.1.2. Oversampled FFT
This stage is composed of two steps: (1) zero padding, which copies the small array to a larger array; and (2) FFT. The first step recomputes the array indexes ontheflight with logical operations. However, GPUs are specialized in floatingpoint arithmetic and it is noted that matrix reshaping is not efficiently supported on GPUs. Thus, it is better to replace matrix reshaping with other GPUfriendly mechanisms.
Here, a preindexing procedure is implemented to avoid matrix reshaping on the flight, and the cSelect subroutine copies array1 to array2 according to the preindexes order1 and order2 (See
Figure 3). This preindexing avoids multidimensional matrix reshaping on the flight, thus greatly simplifying the algorithm for GPU platforms. In addition, preindexing can be generalized to multidimensional arrays (with size
${N}_{in}$,
${N}_{out}$).
Once the indexes (inlist, outlist) are obtained, the input array can be rapidly copied to a larger array. No matrix reshaping is needed during iterations.
The inverse FFT (IFFT) is also based on the same subroutines of the oversampled FFT, but it reverses the order of computations. Thus, an IFFT is followed by array copying.
2.1.3. Interpolation
While the current PyNUFFT includes the min–max interpolator [
4], other kernels can also be used. The scaling factor of the min–max interpolator is designed to minimize the error of offgrid samples [
4]. The interpolation kernel is stored as the Compressed Sparse Row matrix (CSR) format for the Corder (rowmajor) array. Thus, the indexing is accommodated for Corder, and the cSparseMatVec subroutine can quickly compute the interpolation without matrix reshaping. The cSparseMatVec routine is optimized to exploit the data coalescence and parallel threads on the heterogeneous platforms [
5], which also adopt the 6floatingpoint operations per second (FLOPS) twosum algorithm [
6]. The warp in CUDA (the wavefront in OpenCL) controls the size of the workgroups in the cSparseMatVec kernel. Note that the indexing of the Cordered array is different from the Forder Fortran array (columnmajor) implemented in MATLAB.
Gridding is the conjugate transpose of the interpolation, which also uses the same cSparseMatVec subroutine.
2.1.4. Adjoint PyNUFFT
Adjoint NUFFT reverses the order of the forward NUFFT. Each stage is the conjugate transpose (Hermitian transpose) of the forward NUFFT:
which is also illustrated in
Figure 4.
2.1.5. SelfAdjoint NUFFT (Toeplitz)
In iterative algorithms, the cost function is used to represent data fidelity:
The minimization of the cost function finds the solution at
$J\left(x\right)=0$, which leads to the normal equation composed of interpolation (
$\mathbf{A}$) and gridding (
${\mathbf{A}}^{\mathbf{H}}$):
Thus, precomputing the
${\mathbf{A}}^{\mathbf{H}}\mathbf{A}$ can improve the runtime efficiency [
7]. See
Figure 5 for the software implementation.
2.2. Solver
The NUFFT provides solvers to restore multidimensional images (or onedimensional signals in the time domain) from the nonequispaced frequency samples. A great number of reconstruction methods exist for nonCartesian image reconstruction. These methods are usually categorized into three families: (1) density compensation and adjoint NUFFT; (2) least square regression in kspace; and (3) iterative NUFFT.
2.2.1. Density Compensation and Adjoint NUFFT
The sampling density compensation method introduces a tapering function
$\mathbf{w}$ (sampling density compensation function), which can be calculated from the following stable iterations [
8]:
where the elementwise division (⊘) compensates for the overestimation or underestimation of the current
${\mathbf{w}}_{i}$, and
${\mathbf{w}}_{i+1}$. It is noted that the elementwise division tends to make the denominator closer to one. Once
$\mathbf{w}$ is prepared, the sampling density compensation method can be calculated by:
Here, the elementwise multiplication operator ⊙ multiplies the data ($\mathbf{y}$) by the sampling density compensation function ($\mathbf{w}$).
2.2.2. Least Square Regression
Least square regression is a solution to the inverse problem of image reconstruction. Consider the following problem:
The solution $\widehat{\mathbf{x}}$ is estimated from the above minimization problem. Due to the enormous memory requirements of the large NUFFT matrix $\mathbf{A}$, iterative algorithms are more frequently used.
2.2.3. Iterative NUFFT Using Variable Splitting
Iterative NUFFT reconstruction solves the inverse problem with various forms of image regularization. Due to the large size of the interpolation and FFT, iterative NUFFT is computationally expensive.
Here, PyNUFFT is also optimized for iterative NUFFT reconstructions in heterogeneous systems.
Preindexing for fast image gradients: Total variation in basic image regularization has been extensively used in image denoising and image reconstruction. The image gradient is computed using the difference between adjacent pixels, which is represented as follows:
where
${a}_{i}$ is the index of the
ith axis. Computing the image gradient requires image rolling, followed by a subtraction of the original image and the rolled image. However, multidimensional image rolling in heterogeneous systems is expensive and PyNUFFT adopts preindexing to save runtime. This preindexing procedure generates the indexes for the rolled image, and the indexes are offloaded to heterogeneous platforms before initiating the iterative algorithms. Thus, image rolling is not needed during the iterations. The acceleration of this preindexing method is demonstrated in
Figure 3, in which preindexing makes the image gradient run faster on the CPU and GPU.
ℓ1 total variationregularized ordinary least square (L1TVOLS): The
ℓ1 total variation regularized reconstruction includes piecewise smoothing into the reconstruction model.
where
$\mathit{TV}\left(\mathbf{x}\right)$ is the total variation of the image:
Here, the
${\nabla}_{\mathbf{x}}$ and
${\nabla}_{\mathbf{y}}$ are directional gradient operators applied to the image domain along the
$\mathbf{x}$ and
$\mathbf{y}$ axes. Equation (
12) is solved by the variablesplitting method, which has already been developed [
10,
11,
12].
The iterations of L1TVOLS are explicitly shown in Algorithm 1.
Algorithm 1: The pseudocode for the ℓ1 total variationregularized ordinary least square (L1TVOLS) algorithm 

ℓ1 total variationregularized least absolute deviation (L1TVLAD): Least absolute deviation (LAD) is a statistical regression model which is robust to nonstationary noise distribution [
13]. It is possible to solve the
ℓ1 total variationregularized problem with the LAD cost function [
14]:
where
$\mathit{TV}\left(\mathbf{x}\right)$ is the total variation of the image. Note that LAD is the
ℓ1 norm of the data fidelity. The iterations of L1TVLAD are as follows:
Note that the shrinkage function (
shrink) in Algorithm 2 can be quickly solved on the CPU as well as on heterogeneous systems.
Algorithm 2: The pseudocode for the ℓ1 total variationregularized least absolute deviation (L1TVLAD) 

Multicoil image reconstruction: In multicoil regularized image reconstruction, the selfadjoint NUFFT in Equation (
4) is extended to multicoil data:
where the coilsensitivities (
${\mathbf{c}}_{\mathit{i}}$) of multiple channels multiply each channel before the NUFFT (
$\mathbf{A}$) and after the adjoint NUFFT (
${\mathbf{A}}^{\mathbf{H}}$). Sensitivity profiles can be estimated either using the magnitude of smoothed images divided by the rootmeansquared image, or using the dedicated eigenvalue decomposition method [
15]. See
Figure 6 for a visual example of estimation of coil sensitivity profiles.
2.2.4. Iterative NUFFT Using the PrimalDual Type Method
In Cartesian magnetic resonance imaging (MRI), the
$\mathbf{K}$ matrix of the
ℓ2 subproblem can be quickly (and exactly) solved by the diagonal matrix in the Fourier domain, i.e. the
${\mathbf{F}}^{\mathbf{1}}\mathbf{KF}$ is strictly diagonal. The diagonal matrix is very convenient for compressed sensing MRI on the Cartesian grid [
16]. In some nonCartesian kspaces, however, the
${\mathbf{F}}^{\mathbf{1}}\mathbf{KF}$ matrix is not strictly diagonal, which makes variablesplitting methods prone to numerical errors. These errors may accumulate during the image reconstructions, causing instabilities.
Alternatively, the primaldual hybrid gradient type of algorithm [
17,
18] exists as a simpler solution to the
ℓ1
ℓ2 regularized problems. The alternative primaldual hybrid gradient algorithms eliminate the
ℓ2 subproblem, which conquers one of the major shortcomings of the
ℓ1
ℓ2 problems.
2.3. Applications to Brain MRI
A Periodically Rotated Overlapping ParallEL Lines with Enhanced Reconstruction (PROPELLER)
kspace [
22] is used in the simulation study. NonCartesian data are retrospectively generated from a fully sampled 3T brain MRI template [
23]. In conventional methods, data are regridded to a 512 × 512
kspace by linear and cubic spline interpolation methods. The gridded data are processed by IFFT. Three PyNUFFT algorithms were compared with the conventional IFFTbased method. The matrix size = 512 × 512, oversampled ratio = 2, and kernel size = 6. Parameters of L1TVOLS and L1TVLAD are:
$\mu $ = 1,
$\lambda $ = 1, maximum iteration number = 500.
2.4. 3D Computational Phantom
The preindexing mechanism of PyNUFFT allows multidimensional NUFFT and reconstruction to be carried out. In this simulation study, a threedimensional (3D) computational phantom [
24] (
Figure 7A) was used to generate data which are randomly scattered in the 3D
kspace (
Figure 7B). The image parameters were: matrix size = 64 × 64 × 64, subsampling ratio = 57.7%, oversampled grid = 1×, and kernel size = 1. The parameters of L1TVOLS and L1TVLAD were
$\mu $ = 1,
$\lambda $ = 0.1; the number of outer iterations was 500.
2.5. Benchmark
PyNUFFT was tested on a Linux system. All the computations were completed with a complex single precision floating point (FP32). The configurations of CPU and GPU systems were as follows.
Multicore CPU: The CPU instance (m4.16xlarge, Amazon Web Services) was equipped with 64 vCPUs (Intel E5 2686 v4) and 61 GB of memory. The number of vCPUs could be dynamically controlled by the CPU hotplug functionality of the Linux system, and computations were offloaded onto the Intel OpenCL CPU device with 1 to 64 threads. The single thread CPU computations were carried out with Numpy compiled with the FFTW library [
25]. PyNUFFT was executed on the multicore CPU instance with 1–64 threads. The PyNUFFT transforms were offloaded to the OpenCL CPU device and were executed 20 times. The runtimes required for the transforms were compared with the runtimes on the singlethread CPU.
Iterative reconstructions were also tested on the multicore CPU. The matrix size was 256 × 256, and kernel size was 6. The execution times of the conjugate gradient method, ℓ1 total variation regularized reconstruction and the ℓ1 total variationregularized LAD were measured on the multicore system.
GPU: The GPU instance (p2.xlarge, Amazon Web Services) was equipped with 4 vCPUs (Intel E5 2686 v4) and one Tesla K80 (NVIDIA, Santa Clara, CA, USA) with two GK210 GPUs. Each GPU was composed of 2496 parallel processing cores and 12 GB of memory. Computations were preprocessed and offloaded onto one GPU by CUDA or OpenCL APIs. Computations were repeated 20 times to measure the average runtimes. The matrix size was 256 × 256, and kernel size was 6.
Iterative reconstructions were also tested on the K80 GPU, and the execution times of the conjugate gradient method, ℓ1 total variation regularized reconstruction and ℓ1 total variation regularized LAD on the multicore system were compared with the iterative solvers on the CPU with one thread.
Comparison between PyNUFFT and Python nfft: It was previously shown that the min–max interpolator yields a better estimation of DFT than the Gaussian kernel does for a kernel size of less than 8 [
4]. Thus, a similar testing was carried out and the amplitudes of 1000 randomly scattered nonuniform locations of PyNUFFT and nfft were compared to the amplitudes of the DFT. The input 1D array length was 256, and the kernel size was 2–7.
The computation times of PyNUFFT and Python nfft were also measured on a single CPU core. This testing used a Linux system equipped with an Intel Core i76700HQ running at 2.6–3.1 GHz (Intel, Santa Clara, CA, USA) with a system memory of 16 GB.
Comparison between PyNUFFT and gpuNUFFT : This testing used a Linux system equipped with an Intel Core i76700HQ running at 2.6–3.1 GHz (Intel, Santa Clara, CA, USA), 16 GB of system memory, and an NVIDIA GeForce GTX 965M (945 MHz) (NVIDIA, Santa Clara, CA, USA) with 2 GB of video memory (driver version 387.22), using the CUDA toolkit version 9.0.176. The gpuNUFFT was compiled with FP16. The parameters of the testing were: image size = 256 × 256, oversampling ratio = 2, kernel size = 6. The GPU version of PyNUFFT was executed using the identical parameters to gpuNUFFT. A radial
kspace with 64 spokes [
26] was used for gpuNUFFT and PyNUFFT.
Scalability of PyNUFFT: A study of the scalability of PyNUFFT was carried out to compare the runtimes of different matrix sizes and the number of nonuniform locations. The system was equipped with a CPU (Intel Core i7 6700HQ at 3500 MHz, 16 GB system memory) and a GPU (NVIDIA GeForce GTX 965m at 945 MHz, 2 GB device memory).
3. Results
3.1. Applications to Brain MRI
The data fidelity of L1TVOLS and L1TVLAD converged between 100 and 500 iterations (
Figure 8). Some visual results of MRI reconstructions can be seen in
Figure 9. The error of cubic spline was greater than the error (meansquared error (MSE) = 12.8%) of linear interpolation (MSE = 5.45%). In comparison, NUFFTbased algorithms obtained lower errors than the conventional IFFT and gridding methods. Iterative NUFFT using L1TVOLS (MSE = 2.40%) and L1TVLAD (MSE = 2.38%) yielded fewer ripples in the brain structure than in the sampling density compensation method (MSE = 2.87%).
3.2. 3D Computational Phantom
The result of the 3D phantom study can be seen in
Figure 7. While conjugate gradient (CG) generated the image volume with artifacts and blurring,
ℓ1TVOLS and
ℓ1TVLAD restored the image details and preserved the edge of the phantom.
3.3. Benchmarks
Multicore CPU:
Table 1 lists the profile for each stage of forward NUFFT, adjoint NUFFT, and selfadjoint NUFFT (Toeplitz). The overall execution speed of PyNUFFT was faster on the multicore CPU platform than the singlethread CPU, yet the acceleration factors of each stage varied from 0.95 (no acceleration) to 15.6. Compared with computations on a single thread, 32 threads accelerated interpolation and gridding by a factor of 12, and the FFT and IFFT were accelerated by a factor of 6.2–15.6.
The benefits of 32 threads are limited for certain computations, including scaling, rescaling, and interpolation gridding (${\mathbf{V}}_{\mathbf{H}}\mathbf{V}$). In these computations, the acceleration factors of 32 threads range from 0.95 to 1.85. This limited performance gain is due to the high efficiency of singlethread CPUs, which leaves limited room for improvement. In particular, the integrated interpolation gridding (${\mathbf{V}}_{\mathbf{H}}\mathbf{V}$) is already 10 times faster than the separate interpolation and regridding sequence. On a singlethread CPU, ${\mathbf{V}}_{\mathbf{H}}\mathbf{V}$ requires only 4.79 ms, whereas the separate interpolation ($\mathbf{V}$) and gridding (${\mathbf{V}}^{\mathbf{H}}$) require 49 ms. In this case, 32 threads only deliver an extra 83% of performance to ${\mathbf{V}}_{\mathbf{H}}\mathbf{V}$.
Figure 10 illustrates the acceleration on the multicore CPU against the single thread CPU. The performance of PyNUFFT improved by a factor of 5–10 when the number of threads increased from 1 to 20, and the software achieved peak performance with 30–32 threads (equivalent to 15–16 physical CPU cores). More than 32 threads seem to bring no substantial improvement to the performance.
Forward NUFFT, adjoint NUFFT and selfadjoint NUFFT (Toeplitz) were accelerated on 32 threads with an acceleration factor of 7.8–9.5. The acceleration factors of iterative solvers (conjugate gradient method, L1TVOLS and L1TVLAD) were from 4.2–5.
GPU:
Table 1 shows that GPU delivers a generally faster PyNUFFT transform, with the acceleration factors ranging from 2 to 31.
Scaling and rescaling have led to a moderate degree of acceleration. The most significant acceleration took place in the interpolationgridding (${\mathbf{V}}_{\mathbf{H}}\mathbf{V}$) in which GPU was 26–31 times faster than singlethread CPU. This significant acceleration was faster than the acceleration factors for separate interpolation ($\mathbf{V}$, with 6× acceleration) and gridding (${\mathbf{V}}^{\mathbf{H}}$ with 4–4.6× acceleration).
Forward NUFFT, adjoint NUFFT and selfadjoint NUFFT (Toeplitz) were accelerated on K80 GPU by 5.4–13. Iterative solvers on GPU were 6.3–8.9 faster than singlethread, and about twice as fast as with 32 threads.
Comparison between PyNUFFT and Python nfft: A comparison between PyNUFFT and nfft (
Figure 11) evaluated (1) the accuracy of the min–max interpolator and the Gaussian kernel; and (2) the runtimes of a singlecore CPU. The min–max interpolator in PyNUFFT attains a lower error than Gaussian kernel. PyNUFFT also requires less CPU times than nfft, due to the fact that nfft recalculates the interpolation matrix with each nfft or nfft_adjoint call.
Comparison between PyNUFFT and gpuNUFFT:
Figure 12 compares the runtimes of different GPU implementations. In forward NUFFT, the fastest is the PyNUFFT (batch), followed by PyNUFFT (loop) and gpuNUFFT.
In singlecoil adjoint NUFFT, the performance of PyNUFFT (loop), PyNUFFT (batch) and gpuNUFFT is similar. Multicoil NUFFT increases the runtimes, and the PyNUFFT (loop) is the slowest in the adjoint transform in the case of three coils.
Scalability of PyNUFFT:
Figure 13 evaluates the performance of forward NUFFT and adjoint NUFFT vs. the number of nonuniform locations (M) for different matrix sizes. The condition of M = 300,000 is close to a fully sampled 512 × 512
kspace (with 262,144 samples). The values at M = 0 (yintercept ) indicate the runtimes for scaling (
$\mathbf{S}$) and FFT (
$\mathbf{F}$), which change with the matrix size. The slope can be attributed to the runtimes versus M, which is due to interpolation (
$\mathbf{V}$) or gridding (
${\mathbf{V}}^{\mathbf{H}}$).
For a large problem size (matrix size = 512 × 512, M = 300,000), GPU PyNUFFT requires less than 10 ms in the forward transform, and less than 15 ms in the adjoint transform. For a small problem size (matrix size = 128 × 128, M = 1000), GPU PyNUFFT requires 830 ns in the forward transform, and 850 ns in the adjoint transform.
4. Discussion
4.1. Related Work
Different kernel functions (interpolators) are available in previous NUFFT implementations, including: (1) the min–max interpoaltor [
4]; (2) the fast radial basis functions [
27,
28]; (3) the least square interpolator [
29]; (4) the least mean squared error interpolator [
30]; (5) the fast Gaussian summation [
31]; (6) the Kaiser–Bessel function [
26]; and (7) the linear system transfer function or inverse reconstruction [
32,
33].
The NUFFTs have been implemented in different programming languages: (1) MATLAB (Mathworks Inc., MA, USA) [
4,
26,
30,
34]; (2) C++ [
35]; (3) CUDA [
26,
35]; (4) Fortran [
31]; and (5) OpenACC using a PGI compiler (PGI Compilers & Tools, NVIDIA Corporation, Beaverton, OR, USA) [
36]. Several Python implementations of NUFFT came to our attention during the preparation of this manuscript. There are onedimensional nonequispaced fast Fourier transform (nfft package in pure Python) and the multidimensional Python nfft (Python wrapper of the nfft Clibrary). A python based MRI reconstruction toolbox (mripy) based on nfft was accelerated using the Numba compiler. NUFFT has been accelerated on single and multiple GPUs. Fast iterative NUFFT using the Kaiser–Bessel function was accelerated on a GPU with total variation regularization [
35] and total generalized variation regularization [
26]. A realtime inverse reconstruction was developed in Sebastinan et al. [
37] and Murphy et al. [
38], as a pregridding procedure saves interpolation and gridding during iterations. The patent of Nadar et al. [
39] describes a custom multiGPU buffer to improve the memory access for image reconstruction with a nonuniform
kspace.
In addition to NUFFT, iterative DFT can also be accelerated on GPUs [
40,
41].
4.2. Discussions of PyNUFFT
The current PyNUFFT has the advantage of high portability between different hardware and software systems. The NUFFT transforms (forward, adjoint, and Toeplitz) have been accelerated on multicore CPUs and GPUs. In particular, the benefits of fast iterative solvers (including least square and iterative NUFFT) have been shown in the results of benchmarks. The image reconstruction times (with 100 iterations) for one 256 × 256 image are less than 4 s on a 32thread CPU platform and less than 2 s on a GPU platform.
The current PyNUFFT has been tested with computations using single precision floating numbers (FP32). However, the number of double precision floating point (FP64) units on the GPU is only a fraction of the number of FP32 units, which reduces the performance with FP64 and slows down the performance of PyNUFFT.
In the future, a GPU NUFFT library written in pure C would allow researchers to use the power hardware accelerators in different highlevel languages. However, the innate complexity of heterogeneous platforms tends to lower the portability of software, which results in considerable efforts for development and testing. Recently, computer scientists have proposed several emerging GPGPU initiatives to simplify the task, such as the Low Level Virtual Machine (LLVM), OpenACC, and OpenMP 4.0, and these standards are likely to mature in the next few years.
5. Conclusions
An opensource PyNUFFT package was implemented to accelerate the nonCartesian image reconstruction on multicore CPU and GPU platforms. The acceleration factors were 6.3–9.5× on a 32thread CPU platform and 5.4–13× on a Tesla K80 GPU. The iterative solvers with 100 iterations could be completed within 4 s on the 32thread CPU platform and within 2 s on the GPU.
Acknowledgments
The author acknowledges the opensource MATLAB NUFFT software [
4], which is essential for the development of PyNUFFT. The author thanks Ciuciu and the research group at Neurospin, CEA, Paris, France for their inspirational suggestions about the Chambolle–Pock and Condat–Vu algorithms, which will be included in the future release of PyNUFFT. This work was supported by the Ministry of Science and Technology, Taiwan (in 2016–2017), partially by the Cambridge Commonwealth, European and International Trust, Cambridge, UK, and the Ministry of Education, Taiwan (2012–2016). The benchmarks were carried out on Amazon Web Services provided by Educate credit.
Author Contributions
The author developed the PyNUFFT in 2012–2018. Since 2016, PyNUFFT has been reengineered to improve its runtime efficiency.
Conflicts of Interest
The author received an NVIDIA GPU grant. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.
Abbreviations
The following abbreviations are used in this manuscript:
CG  conjugate gradient 
CPU  central processing unit 
CPU  central processing unit 
DFT  discrete Fourier transform 
GBPDNA  generalized basis pursuit denoising algorithm 
GPU  graphic processing unit 
IFFT  inverse fast Fourier transform 
MRI  magnetic resonance imaging 
MSE  mean squared error 
NUFFT  nonuniform fast Fourier transform 
TV  total variation 
References
 Klöckner, A.; Pinto, N.; Lee, Y.; Catanzaro, B.; Ivanov, P.; Fasih, A. PyCUDA and PyOpenCL: A scriptingbased approach to GPU runtime code generation. Parallel Comput. 2012, 38, 157–174. [Google Scholar] [CrossRef]
 Opanchuk, B. Reikna, A Pure Python GPGPU Library. Available online: http://reikna.publicfields.net/ (accessed on 9 October 2017).
 Free Software Foundation. GNU General Public License. 29 June 2007. Available online: http://www.gnu.org/licenses/gpl.html (accessed on 7 March 2018).
 Fessler, J.; Sutton, B.P. Nonuniform fast Fourier transforms using minmax interpolation. IEEE Trans. Signal Proc. 2003, 51, 560–574. [Google Scholar] [CrossRef]
 Danalis, A.; Marin, G.; McCurdy, C.; Meredith, J.; Roth, P.; Spafford, K.; Tipparaju, V.; Vetter, J. The Scalable HeterOgeneous Computing (SHOC) Benchmark Suite. In Proceedings of the 3rd Workshop on GeneralPurpose Computation on Graphics Processors (GPGPU 2010), Pittsburgh, PA, USA, 14 March 2010. [Google Scholar]
 Knuth, D.E. The Art of Computer Programming, Volume 1 (3rd ed.): Fundamental Algorithms; Addison Wesley Longman Publishing Co., Inc.: Redwood City, CA, USA, 1997. [Google Scholar]
 Fessler, J.; Lee, S.; Olafsson, V.T.; Shi, H.R.; Noll, D.C. Toeplitzbased iterative image reconstruction for MRI with correction for magnetic field inhomogeneity. IEEE Trans. Signal Proc. 2005, 53, 3393–3402. [Google Scholar] [CrossRef]
 Pipe, J.G.; Menon, P. Sampling density compensation in MRI: Rationale and an iterative numerical solution. Magn. Reson. Med. 1999, 41, 179–186. [Google Scholar] [CrossRef]
 Jones, E.; Oliphant, T.; Peterson, P. SciPy: Open source scientific tools for Python. 2001. Available online: http://www.scipy.org (accessed on 9 October 2017).
 Lin, J.M.; Patterson, A.J.; Chang, H.C.; Chuang, T.C.; Chung, H.W.; Graves, M.J. Whitening of Colored Noise in PROPELLER Using Iterative Regularized PICO Reconstruction. In Proceedings of the 23rd Annual Meeting of International Society for Magnetic Resonance in Medicine, Toronto, ON, Canada, 30 May–5 June 2015; p. 3738. [Google Scholar]
 Lin, J.M.; Patterson, A.J.; Lee, C.W.; Chen, Y.F.; Das, T.; Scoffings, D.; Chung, H.W.; Gillard, J.; Graves, M. Improved Identification and Clinical Utility of PseudoInverse with Constraints (PICO) Reconstruction for PROPELLER MRI. In Proceedings of the 24th Annual Meeting of International Society for Magnetic Resonance in Medicine, Singapore, 7–13 May 2016; p. 1773. [Google Scholar]
 Lin, J.M.; Tsai, S.Y.; Chang, H.C.; Chung, H.W.; Chen, H.C.; Lin, Y.H.; Lee, C.W.; Chen, Y.F.; Scoffings, D.; Das, T.; et al. PseudoInverse Constrained (PICO) Reconstruction Reduces Colored Noise of PROPELLER and Improves the GrayWhite Matter Differentiation. In Proceedings of the 25th Annual Meeting of International Society for Magnetic Resonance in Medicine, Honolulu, HI, USA, 22–28 April 2017; p. 1524. [Google Scholar]
 Wang, L. The penalized LAD estimator for high dimensional linear regression. J. Multivar. Anal. 2013, 120, 135–151. [Google Scholar] [CrossRef]
 Lin, J.M.; Chang, H.C.; Chao, T.C.; Tsai, S.Y.; Patterson, A.; Chung, H.W.; Gillard, J.; Graves, M. L1LAD: Iterative MRI reconstruction using L1 constrained least absolute deviation. In Proceedings of the 34th Annual Scientific Meeting of ESMRMB, Barcelona, Spain, 19–21 October 2017. [Google Scholar]
 Uecker, M.; Lai, P.; Murphy, M.J.; Virtue, P.; Elad, M.; Pauly, J.M.; Vasanawala, S.S.; Lustig, M. ESPIRiTan eigenvalue approach to autocalibrating parallel MRI: Where SENSE meets GRAPPA. Magn. Reson. Med. 2014, 71, 990–1001. [Google Scholar] [CrossRef] [PubMed]
 Goldstein, T.; Osher, S. The split Bregman method for ℓ1regularized problems. SIAM J. Imaging Sci. 2009, 2, 323–343. [Google Scholar] [CrossRef]
 Chambolle, A.; Pock, T. A firstorder primaldual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 2011, 40, 120–145. [Google Scholar] [CrossRef]
 Boyer, C.; Ciuciu, P.; Weiss, P.; Mériaux, S. HYR2PICS: Hybrid Regularized Reconstruction for Combined Parallel Imaging and Compressive Sensing in MRI. In Proceedings of the 9th IEEE International Symposium on Biomedical Imaging (ISBI), Barcelona, Spain, 2–5 May 2012; pp. 66–69. [Google Scholar]
 Loris, I.; Verhoeven, C. Iterative algorithms for total variationlike reconstructions in seismic tomography. GEM Int. J. Geomath. 2012, 3, 179–208. [Google Scholar] [CrossRef]
 Beck, A.; Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
 Knoll, F.; Bredies, K.; Pock, T.; Stollberger, R. Second order total generalized variation (TGV) for MRI. Magn. Reson. Med. 2011, 65, 480–491. [Google Scholar] [CrossRef] [PubMed]
 Lin, J.M.; Patterson, A.J.; Chang, H.C.; Gillard, J.H.; Graves, M.J. An iterative reduced fieldofview reconstruction for periodically rotated overlapping parallel lines with enhanced reconstruction PROPELLER MRI. Med. Phys. 2015, 42, 5757–5767. [Google Scholar] [CrossRef] [PubMed]
 Lalys, F.; Haegelen, C.; Ferre, J.C.; ElGanaoui, O.; Jannin, P. Construction and assessment of a 3T MRI brain template. Neuroimage 2010, 49, 345–354. [Google Scholar] [CrossRef] [PubMed][Green Version]
 Schabel, M. MathWorks File Exchange: 3D SheppLogan Phantom. Available online: https://uk.mathworks.com/matlabcentral/fileexchange/94163dshepploganphantom (accessed on 9 October 2017).
 Frigo, M.; Johnson, S. The Design and Implementation of FFTW3. Proc. IEEE 2005, 93, 216–231. [Google Scholar] [CrossRef]
 Knoll, F.; Schwarzl, A.; Diwoky, C.S.D. gpuNUFFT—An opensource GPU library for 3D gridding with direct Matlab Interface. In Proceedings of the 22nd Annual Meeting of ISMRM, Milan, Italy, 20–21 April 2013; p. 4297. [Google Scholar]
 Potts, D.; Steidl, G. Fast summation at nonequispaced knots by NFFTs. SIAM J. Sci. Comput. 2004, 24, 2013–2037. [Google Scholar] [CrossRef]
 Keiner, J.; Kunis, S.; Potts, D. Using NFFT 3—A software library for various nonequispaced fast Fourier transforms. ACM Trans. Math. Softw. 2009, 36, 19. [Google Scholar] [CrossRef]
 Song, J.; Liu, Y.; Gewalt, S.L.; Cofer, G.; Johnson, G.A.; Liu, Q.H. Leastsquare NUFFT methods applied to 2D and 3D radially encoded MR image reconstruction. IEEE Trans. Biom. Eng. 2009, 56, 1134–1142. [Google Scholar] [CrossRef] [PubMed]
 Yang, Z.; Jacob, M. Mean square optimal NUFFT approximation for efficient nonCartesian MRI reconstruction. J. Magn. Reson. 2014, 242, 126–135. [Google Scholar] [CrossRef] [PubMed]
 Greengard, L.; Lee, J.Y. Accelerating the nonuniform fast Fourier transform. SIAM Rev. 2004, 46, 443–454. [Google Scholar] [CrossRef]
 Liu, C.; Moseley, M.; Bammer, R. Fast SENSE Reconstruction Using Linear System Transfer Function. In Proceedings of the International Society of Magnetic Resonance in Medicine, Miami Beach, FL, USA, 7–13 May 2005; p. 689. [Google Scholar]
 Uecker, M.; Zhang, S.; Frahm, J. Nonlinear inverse reconstruction for realtime MRI of the human heart using undersampled radial FLASH. Magn. Reson. Med. 2010, 63, 1456–1462. [Google Scholar] [CrossRef] [PubMed]
 Ferrara, M. Implements 1D3D NUFFTs Via Fast Gaussian Gridding. 2009. Matlab Central. Available online: http://www.mathworks.com/matlabcentral/fileexchange/25135nufftnfftusfft (accessed on 2 March 2018).
 Bredies, K.; Knoll, F.; Freiberger, M.; Scharfetter, H.; Stollberger, R. The Agile Library for Biomedical Image Reconstruction Using GPU Acceleration. Comput. Sci. Eng. 2013, 15, 34–44. [Google Scholar]
 Cerjanic, A.; Holtrop, J.L.; Ngo, G.C.; Leback, B.; Arnold, G.; Moer, M.V.; LaBelle, G.; Fessler, J.A.; Sutton, B.P. PowerGrid: A open source library for accelerated iterative magnetic resonance image reconstruction. Proc. Intl. Soc. Mag. Reson. Med. 2016, 24, 525. [Google Scholar]
 Schaetz, S.; Voit, D.; Frahm, J.; Uecker, M. Accelerated computing in magnetic resonance imaging: Realtime imaging Using nonlinear inverse reconstruction. Comput. Math. Methods Med. 2017. [Google Scholar] [CrossRef] [PubMed]
 Murphy, M.; Alley, M.; Demmel, J.; Keutzer, K.; Vasanawala, S.; Lustig, M. FastSPIRiT compressed sensing parallel imaging MRI: Scalable parallel implementation and clinically feasible runtime. IEEE Trans. Med. Imaging 2012, 31, 1250–1262. [Google Scholar] [CrossRef] [PubMed]
 Nadar, M.S.; Martin, S.; Lefebvre, A.; Liu, J. MultiGPU FISTA Implementation for MR Reconstruction with NonUniform kspace Sampling. U.S. Patent 14/031,374, 27 March 2014. [Google Scholar]
 Stone, S.S.; Haldar, J.P.; Tsao, S.C.; Hwu, W.M.W.; Liang, Z.P.; Sutton, B.P. Accelerating Advanced MRI Reconstructions on GPUs. J. Parallel Distrib. Comput. 2008, 68, 1307–1318. [Google Scholar] [CrossRef] [PubMed]
 Gai, J.; Obeid, N.; Holtrop, J.L.; Wu, X.L.; Lam, F.; Fu, M.; Haldar, J.P.; WenMei, W.H.; Liang, Z.P.; Sutton, B.P. More IMPATIENT: A griddingaccelerated Toeplitzbased strategy for nonCartesian highresolution 3D MRI on GPUs. J. Parallel Distrib. Comput. 2013, 73, 686–697. [Google Scholar] [CrossRef] [PubMed]
© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).