Toward Real-Time Giga-Voxel Optoacoustic/Photoacoustic Microscopy: GPU-Accelerated Fourier Reconstruction with Quasi-3D Implementation

We propose a GPU-accelerated implementation of frequency-domain synthetic aperture focusing technique (SAFT) employing truncated regularized inverse k-space interpolation. Our implementation achieves sub-1s reconstruction time for data sizes of up to 100 M voxels, providing more than a tenfold decrease in reconstruction time as compared to CPU-based SAFT. We provide an empirical model that can be used to predict the execution time of quasi-3D reconstruction for any data size given the specifications of the computing system.


Introduction
Raster-scan optoacoustic (OA) angiography is a hybrid technique allowing for the imaging of blood vessels using focused ultrasonic detectors due to local, laser-generated thermoelastic expansion of hemoglobin [1]. High optical absorption by hemoglobin near the 532 nm laser wavelength, and wideband~100 MHz ultrasound detectors provide commercial OA microscopes (OAMs) [2,3] with the ability to perform volumetric optoacoustic angiography of a~6 × 6 × 3 mm 3 volume with mega-voxel resolution in several minutes. Modern laboratory OAM prototypes equipped with higher-powered lasers and faster-scanning electronics [4,5] provide giga-voxel images within the same scanning time.
To improve the lateral resolution of OAM angiography above or below the focal plane (Figure 1), the acquired OAM datasets require processing by reconstruction algorithms utilizing the synthetic aperture focusing technique (SAFT) [6]. Similar to other mechanical scanning techniques (such as cross-sectional CT and MRI), OAM SAFT reconstruction algorithms are typically not applied to full-3D datasets (XYZ), but to sequential 2D datasets (XZ) called B-scans, which are obtained in the course of mechanical zig-zag scanning. While XZ-reconstructions can often be done in parallel with B-scan acquisition, YZ-reconstructions usually require prior compensation for motion artifacts [7] characteristic of most in vivo subjects due to their respiration and heartbeat during the scanning. Although sequential 2D reconstruction is computationally less complex than its full-3D counterparts [8], the 2D SAFT reconstruction algorithms performed in the time domain (TD) may require relatively large numbers of computations, especially if performed using standard central processing units (CPUs). For example, 2D realization of the SAFT-TD reported in [9] and performed on a standard CPU could require more than one minute even for mega-voxel datasets.
Significant reduction in the computation time was achieved by performing twodimensional SAFT reconstruction in the frequency domain (FD) [10][11][12]. The 2D fast Fourier transform (FFT) simplifies the computational complexity by the factor [ × ] [ × ] ⁄ , thus providing the performance necessary for online reconstruction of mega-voxel resolution at the speed of volume scanning. However, realtime SAFT-FD processing of giga-voxel datasets still remained an unmet challenge.
This manuscript focuses on the comparison of algorithms for accelerated reconstruction of large volumes of OAM data using parallel computing [13]. Since most SAFT-FD calculations are arithmetically simple and not co-dependent, they fit well with the architecture of modern graphics processing units (GPUs), which are optimized for parallel computations. The algorithm implemented in this manuscript is described in [10].

GPU-FD Accelerated Reconstruction
The GPU-FD algorithm used in this study is based on the SAFT-FD algorithm developed in [10]. In summary, this algorithm relates the spatial frequencies ( , ) of the discrete Fourier transform of the to-be-reconstructed image to the spatial/time frequencies ( , ) of the discrete Fourier transform of the B-scan. This is nearly a one-toone relation given that is a function of the modulus of ( , ), which is the reason why the computational complexity of SAFT-FD is dominated by the discrete FT. Due to the discreteness of the frequency grids, however, crosstalk exists between different , and interpolation is needed to invert this relation. In [10], complex-valued interpolation weights were determined based on a regularized pseudo-inverse approach. Previous work [10] also showed that the number of interpolation nodes ( , ) per ( , ) can be truncated to a small value (typically 5 to 10) without a significant change to image quality, and that this is independent of the total grid sizes.
A simplified block diagram of the GPU-FD algorithm implemented in MATLAB is shown in Figure 2. While the initialization of data-independent parameters is performed on the CPU, processing of the data is performed as parallelized tasks on the GPU. The data-independent parameters are: the definition of interpolation node indices, Although sequential 2D reconstruction is computationally less complex than its full-3D counterparts [8], the 2D SAFT reconstruction algorithms performed in the time domain (TD) may require relatively large numbers of computations, especially if performed using standard central processing units (CPUs). For example, 2D realization of the SAFT-TD reported in [9] and performed on a standard CPU could require more than one minute even for mega-voxel datasets.
Significant reduction in the computation time was achieved by performing twodimensional SAFT reconstruction in the frequency domain (FD) [10][11][12]. The 2D fast Fourier transform (FFT) simplifies the computational complexity by the factor [N X × N Z ]/log 2 [N X × N Z ], thus providing the performance necessary for online reconstruction of mega-voxel resolution at the speed of volume scanning. However, real-time SAFT-FD processing of giga-voxel datasets still remained an unmet challenge.
This manuscript focuses on the comparison of algorithms for accelerated reconstruction of large volumes of OAM data using parallel computing [13]. Since most SAFT-FD calculations are arithmetically simple and not co-dependent, they fit well with the architecture of modern graphics processing units (GPUs), which are optimized for parallel computations. The algorithm implemented in this manuscript is described in [10].

GPU-FD Accelerated Reconstruction
The GPU-FD algorithm used in this study is based on the SAFT-FD algorithm developed in [10]. In summary, this algorithm relates the spatial frequencies (k X , k Z ) of the discrete Fourier transform of the to-be-reconstructed image to the spatial/time frequencies (k X , k t ) of the discrete Fourier transform of the B-scan. This is nearly a one-to-one relation given that k t is a function of the modulus of (k X , k Z ), which is the reason why the computational complexity of SAFT-FD is dominated by the discrete FT. Due to the discreteness of the frequency grids, however, crosstalk exists between different k t , and interpolation is needed to invert this relation. In [10], complex-valued interpolation weights were determined based on a regularized pseudo-inverse approach. Previous work [10] also showed that the number of interpolation nodes (k X , k Z ) per (k X , k Z ) can be truncated to a small value α (typically 5 to 10) without a significant change to image quality, and that this is independent of the total grid sizes.
A simplified block diagram of the GPU-FD algorithm implemented in MATLAB is shown in Figure 2. While the initialization of data-independent parameters is performed on the CPU, processing of the data is performed as parallelized tasks on the GPU. The data-independent parameters are: the definition of interpolation node indices, interpolation interpolation weights, and a filter matrix for combined processing of focal plane selection; compensation of limited angle detection; and Hilbert transform. The implemented pipeline consists of the simplest arithmetic operations with the exception of forward/inverse FFT 2D functions ( Figure 2) [14]. The number of spatial frequencies ( , ) is chosen to be identical to the number of pixels × . However, in the case of using zero padding, the number of frequencies would be double the number of pixels in both the X-and the Z-dimensions.
While the characteristics of the GPU-FD algorithm can be determined according to certain classifications [15], the following advantages of the pipeline (Figure 2) should be mentioned: First, GPU-FD realization has massive (>75%) data parallelism. Next, the number of threads exceeds the number of spatial frequencies, and therefore the thread count is relatively high. Further, the number of synchronizations used for one B-scan after each arithmetic operation between the FFT 2D and the inverse FFT 2D is relatively small (6 pcs). Finally, the implemented pipeline ( Figure 2) does not have branch divergence.
The minimum memory required for implementation of the GPU-FD algorithm is: where is the required number of points in the A-scan defined by the ultrasound probing depth and the axial resolution (AR) of the OAM system (usually ~ 200), and ≥ 2 ⁄ is the required number of scanning lines in B-scan defined by the scanning range and the lateral resolution (LR) of the detector (usually > 200). For enhanced OA image quality, oversampling by a factor of two, as dictated by the Nyquist sampling theorem, can often be replaced by a factor of five, leading to = 5 ⁄ and = 5 ⁄ evaluations. For enhanced reconstruction quality within the depth range , it is also important that the acquisition range is wider than , so that the diagonals corresponding to the maximum receiving angles of the detector are fully covered. However, for simplicity we will assume that the The implemented pipeline consists of the simplest arithmetic operations with the exception of forward/inverse FFT 2D functions ( Figure 2) [14]. The number of spatial frequencies (k X , k Z ) is chosen to be identical to the number of pixels N X × N Z . However, in the case of using zero padding, the number of frequencies would be double the number of pixels in both the X-and the Z-dimensions.
While the characteristics of the GPU-FD algorithm can be determined according to certain classifications [15], the following advantages of the pipeline (Figure 2) should be mentioned: First, GPU-FD realization has massive (>75%) data parallelism. Next, the number of threads exceeds the number of spatial frequencies, and therefore the thread count is relatively high. Further, the number of synchronizations used for one B-scan after each arithmetic operation between the FFT 2D and the inverse FFT 2D is relatively small (6 pcs). Finally, the implemented pipeline ( Figure 2) does not have branch divergence.
The minimum memory required for implementation of the GPU-FD algorithm is: 2(α + 1)N X × N Z , where N Z ≥ 2∆Z/AR where N Z is the required number of points in the A-scan defined by the ultrasound probing depth ∆Z and the axial resolution (AR) of the OAM system (usually N Z~2 00), and N z ≥ 2∆X/LR is the required number of scanning lines in B-scan defined by the scanning range ∆X and the lateral resolution (LR) of the detector (usually N X > 200). For enhanced OA image quality, oversampling by a factor of two, as dictated by the Nyquist sampling theorem, can often be replaced by a factor of five, leading to N Z = 5∆Z/AR and N X = 5∆X/LR evaluations. For enhanced reconstruction quality within the depth range ∆Z, it is also important that the acquisition range is wider than ∆Z, so that the diagonals corresponding to the maximum receiving angles of the detector are fully covered. However, for simplicity we will assume that the number of axial pixels N Z in the raw and the reconstruction volumes is the same throughout the manuscript.
The computation time T exe of the GPU-FD reconstruction algorithm can be estimated as: where T 0 is the execution time of the preliminary preparation of indices and coefficients, T rec is the execution time of the 2D reconstruction per B-scan ( Figure 2) and N Y is the number of B-scans. The algorithmic complexity of both the preparation T 0 and the reconstruction T rec stages are dependent on the number of spatial frequencies in the Xand Z-dimensions. In the considered case (Figure 2) of minimally required spatial sampling periods (2π/N X , 2π/N Z , the number of spatial frequencies is equal to the number of pixels in each B-scan O([N X × N Z ]). However, denser linear sampling in the Fourier domain quadratically increases the algorithmic complexity. For example, a factor-of-two zero-padding quadruples the number of spatial frequencies with respect to the original number of pixels.
The reconstruction stage primarily depends on the FFT execution, with a computational complexity of O(N X × N Z × log 2 [N X × N Z ]). In the case of fully parallelized execution of the algorithm [16], the execution time can be approximated as: where f GPU,CPU are the clock speeds; p GPU,CPU are the effective number of processing cores (cores in the case of CPU; stream processors/CUDA cores in the case of GPU); and the parameters A, B and C are scaling factors dependent on the algorithm implementation (and independent of the computing system configuration and the data volume).

Experimental Data
A fiber-based version [17] of dark-field OAM [18] with LR = 50 µm was used to perform a ∆X = ∆Y = 8 mm lateral OA scan of an experimental tumor CT-26 [19] with δX = δY = 10 µm scanning steps at a laser wavelength of 532 nm. The superficial tumor vasculature was aligned with a 6.7 ± 0.75 mm depth-of-field, a custom-made [20,21], 6.7 ± 0.75 mm depth-of-field, wideband 1-100 MHz spherical ultrasonic PVDF detector. Each OA A-scan with 10.24 µs duration was converted to 2048 samples by a 16-bit Ra-zor16 (GaGe, Lockport, IL, USA) digitizer at a sampling rate of 200 MHz. The acquired XYZ-volume was then cropped above and below the focal plane within the 1 µs time interval corresponding to a depth range of~6.7 ± 0.75 mm. The initial size of the raw OA dataset thus contained {N X × N Y × N Z = 800 × 800 × 200 = 128 M} voxels. To evaluate the voxel-count dependent characteristics of the GPU-FD algorithm, OAM angiograms with a larger/smaller number of voxels were synthesized by resampling the original OAM angiogram. Figure 3 shows the maximum intensity projections (MIP) of a raw angiographic image (128 M voxels) and the procedure of consecutive application of the GPU-FD algorithm ( Figure 2) in two perpendicular orientations, leading to quasi-3D reconstruction. While the first reconstruction in the XZ-plane improves the resolution in the X-direction, the second reconstruction in the YZ-plane also improves the resolution in the Y-direction, resulting in the OA MIP angiogram representing the vascular tumor microenvironment down to 50-micron resolution in both the X-and Y-directions. Figure 3 shows the maximum intensity projections (MIP) of a raw angiographic image (128 M voxels) and the procedure of consecutive application of the GPU-FD algorithm ( Figure 2) in two perpendicular orientations, leading to quasi-3D reconstruction. While the first reconstruction in the XZ-plane improves the resolution in the X-direction, the second reconstruction in the YZ-plane also improves the resolution in the Y-direction, resulting in the OA MIP angiogram representing the vascular tumor microenvironment down to 50-micron resolution in both the X-and Y-directions. Since the mouse was immobilized using gas anesthesia, and the thigh with the tumor located on it was fixed against the OAM immersion chamber, the mouse respiratory movements did not have a visible effect on the image quality in the Y-direction. In the presence of motion artifacts, motion compensation algorithms [22] would have to be applied between XZ-and YZ-reconstruction. Figure 4 compares the execution times of the CPU-TD [9] and CPU-FD [11], and the GPU-based counterpart of the latter. The numerical tests evaluating voxel-dependent execution time for different reconstruction algorithms were performed on different processing units characterized by different clock speeds and different thread counts (Table 1). Since the mouse was immobilized using gas anesthesia, and the thigh with the tumor located on it was fixed against the OAM immersion chamber, the mouse respiratory movements did not have a visible effect on the image quality in the Y-direction. In the presence of motion artifacts, motion compensation algorithms [22] would have to be applied between XZ-and YZ-reconstruction. Figure 4 compares the execution times of the CPU-TD [9] and CPU-FD [11], and the GPU-based counterpart of the latter. The numerical tests evaluating voxel-dependent execution time for different reconstruction algorithms were performed on different processing units characterized by different clock speeds and different thread counts (Table 1).

Results and Discussion
In comparison to the CPU-FD reconstruction (blue circles in Figure 4), the decrease in execution time for the GPU-FD algorithm (green circles in Figure 4) is more than one order of magnitude over the whole range of considered OAM data volumes. While the largest data volume (4.2G voxels) required~7 h execution time for the CPU-TD algorithm and 13 min for the CPU-FD algorithm, the accelerated GPU-FD algorithm obtained the same reconstructed 3D-dataset in less than 40 s (Figure 4). A more recent GPU (Nvidia GeForce RTX 3090) executed the GPU-FD algorithm for the standard 128M-voxel data volume in less than a second.
The execution times measured for different data sets using three different GPUs are plotted as solid curves in Figure 5. The dotted curves were fitted according to Equation (2), thus providing empirical evaluation of constants A, B and C.   In comparison to the CPU-FD reconstruction (blue circles in Figure 4), the decrease in execution time for the GPU-FD algorithm (green circles in Figure 4) is more than one order of magnitude over the whole range of considered OAM data volumes. While the largest data volume (4.2G voxels) required ~7 h execution time for the CPU-TD algorithm and ~13 min for the CPU-FD algorithm, the accelerated GPU-FD algorithm obtained the same reconstructed 3D-dataset in less than 40 s (Figure 4). A more recent GPU (Nvidia GeForce RTX 3090) executed the GPU-FD algorithm for the standard 128M-voxel data volume in less than a second.
The execution times measured for different data sets using three different GPUs are plotted as solid curves in Figure 5. The dotted curves were fitted according to Equation (2), thus providing empirical evaluation of constants A, B and C.   The precise analytical prediction of the reconstruction time ( , , , , ) is difficult due to the uncertainty in , which is an estimate of the effective number of cores participating in the computation. This is necessary because a large number of cores can only be used efficiently if the voxel count is comparatively large. While Formula (2) The expanded model as described in Formula (3) is thus capable of providing an accurate estimation of the GPU-FD algorithm computation time for any dataset size based on the technical specifications of a given computing system.
In comparison to the 3D-SAFT developed in [13], our quasi-3D implementation reaches equal performance even for the relatively small data size the authors provided as an example. This comparison is also true for a full-3D implementation of GPU-FD, due to the independence of the FFT algorithm on the dimensionality of the data. It is important to note, however, that our implementation does not employ the same level of optimization, instead prioritizing ease of development and platform-independence over performance by keeping the implementation purely as MATLAB code. Implementation as an OpenCL or CUDA kernel instead, as was done in [13], is expected to yield a significant performance advantage.
While this manuscript focuses solely on the application in optoacoustic microscopy, the same algorithm for quasi-3D reconstruction can also be used for tomography. For 2D tomography based on a linear detector array, the cyclic part of the pipeline (Figure 2) can be directly applied to individual B-scans. Since the speed of operation in 2D mode can be several orders of magnitude higher than in quasi-3D mode, 2D real-time reconstruction at more than 100 Hz B-scan acquisition rates is feasible.

Conclusions
The proposed GPU-accelerated implementation of a frequency-domain, 2D synthetic aperture focusing technique shows great promise in speeding up optoacoustic image reconstruction in both 2D and 3D. The developed numerical methods reduce the time for quasi-3D processing of giga-voxel optoacoustic data volumes to several seconds. To study the applicability of the developed algorithm for solving certain practical problems, it is advisable to use the empirical model proposed in this manuscript, which allows estimating the expected reconstruction time depending on the volume of processed data and the parameters of the computing system.