Accelerating 3-D GPU-based Motion Tracking for Ultrasound Strain Elastography Using Sum-Tables: Analysis and Initial Results

Now, with the availability of 3-D ultrasound data, a lot of research efforts are being devoted to developing 3-D ultrasound strain elastography (USE) systems. Because 3-D motion tracking, a core component in any 3-D USE system, is computationally intensive, a lot of efforts are under way to accelerate 3-D motion tracking. In the literature, the concept of Sum-Table has been used in a serial computing environment to reduce the burden of computing signal correlation, which is the single most computationally intensive component in 3-D motion tracking. In this study, parallel programming using graphics processing units (GPU) is used in conjunction with the concept of Sum-Table to improve the computational efficiency of 3-D motion tracking. To our knowledge, sum-tables have not been used in a GPU environment for 3-D motion tracking. Our main objective here is to investigate the feasibility of using sum-table-based normalized correlation coefficient (ST-NCC) method for the above-mentioned GPU-accelerated 3-D USE. More specifically, two different implementations of ST-NCC methods proposed by Lewis et al. and Luo-Konofagou are compared against each other. During the performance comparison, the conventional method for calculating the normalized correlation coefficient (NCC) was used as the baseline. All three methods were implemented using compute unified device architecture (CUDA; Version 9.0, Nvidia Inc., CA, USA) and tested on a professional GeForce GTX TITAN X card (Nvidia Inc., CA, USA). Using 3-D ultrasound data acquired during a tissue-mimicking phantom experiment, both displacement tracking accuracy and computational efficiency were evaluated for the above-mentioned three different methods. Based on data investigated, we found that under the GPU platform, Lou-Konofaguo method can still improve the computational efficiency (17–46%), as compared to the classic NCC method implemented into the same GPU platform. However, the Lewis method does not improve the computational efficiency in some configuration or improves the computational efficiency at a lower rate (7–23%) under the GPU parallel computing environment. Comparable displacement tracking accuracy was obtained by both methods.

Now FDA-approved 3-D automated breast ultrasound (ABUS) systems (e.g., GE InveniaTM; Siemens Acuson S2000 ABVS) have been released. It is our vision that with the availability of GPU-acceleration, bringing 3-D breast USE into the clinical workflow becomes possible. In the last decade, GPU-based parallel computing has been utilized to accelerate ultrasound-based motion tracking in several important applications including USE [21,[29][30][31], shear wave elastography [32] and color Doppler imaging [33]. In particular, the work by Peng et al. [21] demonstrated that high quality strain data can be obtained for a reasonably large volume (e.g., 2.5 × 2.5 × 2.5 cm 3 ) in 20 s. Consequently, investigating whether or not the concept of ST can be used to further improve 3-D motion tracking is a logical next step.
To this end, our primary goal is to investigate the feasibility of using ST in conjunction with GPUs to further accelerate 3-D motion tracking. Thus, two above-mentioned ST methods (i.e., Lewis' and Luo's methods) were analyzed and implemented for parallel (GPU) and serial (CPU) computing settings. The standard protocol of NCC calculation without the use of ST was used as the baseline in both computing settings for a systematic comparison.

A Description of GPU-Accelerated Block-Matching Algorithm in USE
As shown in Figure 1, in order to estimate one displacement vector (dx, dy, dz) for an arbitrary location (x, y, z) from a pair of pre-and post-deformation ultrasound echo volumes, the block-matching algorithm can be largely divided into three steps: (1) selecting a pair of reference and target ultrasound signals from two successive ultrasound radio frequency (RF) fields based on a predetermined search range and a tracking kernel; (2) calculating an NCC value between the pair of selected reference and target echo signals, generating a 3-D resultant NCC map for a given search range, and finding a peak from the NCC map; and (3) fitting NCC values around the peak NCC map to a 3-D quadratic surface [34] to find the estimated displacement vector. During the process, the tracking-kernel size and search range must be determined prior to the start of motion tracking. The location (x, y, z) here is referred to a location in the pre-deformation ultrasound each volume (i.e., the reference volume). Thus, the (dx, dy, dz) represents the displacement vector by which the medium moved to the post-deformation ultrasound echo volume (i.e., the target volume) from the reference volume. In the block-matching algorithm, time-domain ultrasound echo signals are used to calculate correlation.
As can be seen from the schematic diagram in Figure 1, Lewis' and Luo's methods both need to create sum tables prior to the calculation of NCC values in Step 1. Those sum-tables are used as "lookup-tables" to replace the standard NCC calculations in order to reduce the computing time. The details of Step 1 for three different ways of calculating NCC values are discussed below. Steps 2 and 3 are exactly the same for all three methods. Specifically, Step 2 is to find the maximum on the NCC map. Thus, an integer displacement vector can be determined for the location (x, y, z). Finally, in Step 3, the subs-sample displacement vector can be obtained by fitting a 3 × 3 × 3 matrix containing NCC values surrounding the maximum NCC value into a quadratic surface. Through Steps 2 and 3, the final axial, lateral and elevational displacements with sub-sampling precision can be obtained.

A Standard Protocol for
Calculating NCC-Given one reference signal f and one target signal g, the NCC function over a search range (τ x , τ y , τ z ) can be calculated as follows:  (1), the origin of the reference window (tracking kernel) is (u, v, w) and [τ x , τ y , τ z ] is the search range defined below, In the block-matching algorithm, [τ 1 , τ 2 ], [τ 3 , τ 4 ] and [τ 5 , τ 6 ] are pre-determined by the algorithm. For each 3-D shift (τ x , τ y , τ z ), Equation (1) can be used to obtain one NCC value.
Thus, looping through the entire search range yields a 3-D NCC map. Table Method-When the block matching algorithm is used, significant overlaps among adjacent tracking kernels exist, resulting in a lot of redundant computation. Thus, Lewis proposed to use pre-computed tables (also known as Sum-tables) to partially "memorize" some correlation between f and g [23]. Together with Fourier Transform (FT), Lewis' method can conceptually reduce the computational demands. More specifically, the numerator in Equation (1) is a convolution of the tracking kernel within the reference signal f with the corresponding tracking kernel of the reversed target signal g(−m, −n, −k) and can be computed by Fourier Transform (FT). The process is defined as follows:

Lewis' Sum-
where ℱ stands for an FT operator. The complex conjugate accomplishes reversal of the template via the FT's time reversal property [23]. Equation (3) was implemented via fast Fourier Transform (FFT) and thus, f and g were zero-padded to a common power of two. Now referring to the calculation of the denominator in Equation (1), two sum tables were used: one for the reference signal f and the other for the target signal g. Let s f 2 u, v, w and s g 2 u, v, w denotes the created sum-tables for f and g, respectively. More details about setting up these two sum-table can be found in Appendix A. Once these two tables become available, the calculation of denominator can be conducted through a very efficient manner as follows, Peng et al. Luo and Konofagou method [35], both the numerator and denominator of Equation (1) are calculated using sum-tables. In addition to Equations (4) and (5), the numerator can be calculated as follow using a set of sum-tables s f, g (u, v, w) to look up pre-computed numbers,

Luo-Konofagou Sum-Table Method-In
According to Equations (4)- (6), the calculation of NCC can be simplified to addition and subtraction operations. More formal analyses of algorithmic complexity and memory requirements can be found in Appendices B and C.

A Brief Description of GPU Computing-CUDA (Compute Unified Device
Architecture) is a common parallel computing architecture launched by Nvidia (Nvidia Inc., Santa Clara, CA, USA). This architecture enables GPUs to solve complex computing problems in parallel. In CUDA, a KERNEL function is defined as a function that performs multi-threaded parallel computation. Similarly, a DEVICE function is defined as a singlethreaded function called by a KERNEL function on GPU. According to the memory structure defined in CUDA, off-chip memory (global memory and texture memory) has a higher access delay than on-chip memory (register, shared memory and constant memory).
Consequently, the on-chip memory should be given priority during programming in order to improve memory access efficiency. It is worth noting that the lower-case "kernel" is used for tracking kernels in this study, while the upper-case "KERNEL" is referred as to a GPU KERNEL function.

Block-Matching Using GPU Parallel Computing-
The block-matching algorithm implemented in this study is the classic block-matching algorithm. Because of memory limitation, full 3-D block-matching tracking was first performed in a slice-by-slice manner. Thus, a full displacement vector field for a single slice consists of M × N (Axial × Lateral) displacement estimation locations; each displacement estimation location obtains one 3-D displacement vector after the block-matching process. Because tracking displacements using the classic block-matching algorithm for each location is independent (i.e., no data dependency and requirements regarding communication), in the CUDA programming structure, the M × N thread can be launched through the KERNEL function (see Section 2.2.1). As shown in Figure 1, in order to construct 3-D NCC map for each displacement estimation location, one NCC value needs to be calculated at a specified search location (τ x , τ y , τ z ). In the Step 1 (see Figure  In the subsequent Steps 2 and 3 (see Figure 1), the number of CUDA threads in the KERNEL functions are consistent with the total number of displacement estimation locations M × N. Basically, each CUDA thread invokes a DEVICE function to search the peak NCC value of the corresponding 3-D NCC map (Step 2). Then, the same DEVICE function uses a quadratic fitting to obtain sub-sample estimates (Step 3) [36].
In the process of implementation, a few notable strategies were used. First, one-dimensional thread structure was adopted to ensure the consistency of memory access. In other words, we attempted to ensure that adjacent threads read adjacent memory regions and try to satisfy the requirement of coalesced memory access. Second, RF data were stored in TEXTURE memory. Nvidia GPU TEXTURE memory technology can avoid the delay caused by crossline reading. Third, some important variables (such as axial and lateral search range, tracking kernel size) are stored in GPU constant memory for rapid accesses.

Implementing A Standard NCC Calculation on CUDA-
Under the abovementioned parallel implementation framework, improving the efficiency for memory access was the most important consideration. This is because the calculation of a 3-D NCC map will read each RF sample multiple times. In order to avoid this frequent data reading from TEXTURE memory, we load each small part of RF echo data into on-chip memory during CUDA programming to improve memory access efficiency. The detailed GPU implementation of this method is described in our early work [21].

Implementing Lewis'
Method on CUDA-In this study, the FFT function library cuFFT available in CUDA (Nvidia Inc., CA, USA) was used to calculate the numerator of Equation (1). The cuFFT library provides a simple interface for computing FFTs on Nvidia GPUs. Also, the cuFFT library has been highly optimized and systematically tested for the GPU parallel computing environment. In contrast, FFTW (http://www.fftw.org/) is a state-of-art fast FFT toolbox on CPU and was applied to implement Lewis's method on CPU.
The calculation of the denominator in Equation (1) needs two sum-tables: one for the reference RF signal and the other one for the target RF signal. In this study, a parallel scan algorithm proposed by Sengupta et al. [37] was adopted to perform rapid prefix sum (i.e., cumulative sum) computation for each direction. The scan algorithm is based on an idea of balanced tree proposed by Blelloch [38]. Equations can be found in Appendix A.
More specifically, the prefix sum builds a balanced binary tree on the input data and scans it from the top to the root for calculating the prefix sum. Figure 2 illustrates the process of setting up a 3-D sum-table. The calculation of a 3-D sum-table can be divided into the following three stages: (1) constructing a prefix sum array along with the X-axis direction; (2) constructing the prefix sum array along with the Y-axis direction based on the result of (1); and (3) using the prefix scan algorithm to build the prefix sum array along with the Zaxis direction. A for-loop operation involving the above-mentioned three stages is sufficiently fast to establish the final 3-D sum-table. Consequently, the prefix scan algorithm was only invoked once for each sum-table and twice during the entire process. If the size of the 3-D RF signal is (Rows × Columns × Slices), the corresponding KERNEL function will launch (Rows × Columns × Slices) CUDA threads in order to create a sum-table.
After setting up 3-D sum-tables, the calculations of numerator and denominator (see Equation (1)) are accomplished by FFT and through two sum-tables, respectively.

A Tissue-Mimicking Phantom Experiment-Volumetric ultrasound data
acquired from a 100 mm × 100 mm × 70 mm oil-in-gelatin phantom were also used to test the above-mentioned three different NCC calculation methods. The inclusion in the ultrasound data was 5 times stiffer than the background. As shown in Figure 3, a 9-MHz CMUT transducer connected to a Siemens Antares (Siemens Health Care, Inc. Ultrasound Division, Mountain View, CA, USA) was used to acquire radio frequency (RF) echo data. A robotic arm (see Figure 3) was used to move the CMUT transducer downward to compress the phantom. The volume-to-volume deformation was approximately 1.5%. Then, ultrasound echo data before and after the compression were first acquired using ultrasound research interface (URI, Siemens Health Care, Inc. Ultrasound Division, Mountain View, CA, USA) installed on the Siemens scanner. Each echo data represented a 40 mm (axial) × 37 mm (lateral) × 30 mm (elevation) volume. The RF sample size, line spacing and distance between two parallel image planes were 0.0193-mm, 0.119-mm and 0.214-mm, respectively. More details can be found elsewhere [19].

Data Analysis-
The GPU hardware used in this study is Nvidia GeForce GTX TITAN X (Nvidia Corp., Santa Clara, CA, USA). The GPU card comes along with 80 Stream Multiprocessors, along with a total of 5120 CUDA computing cores along with 12 GB of Memory. The GPU card is installed on a desktop workstation with a Ubuntu operating system (version 16.04), Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz, and 16 GB of host memory. ANSI C and CUDA 9.0 were adopted for implementing all CPU and GPU algorithms. All testing was done under the MATLAB platform (Version 2016b, Mathworks Inc., Natick, MA, USA) and all implemented algorithms were invoked in the MATLAB environment through the MEX interface.
In total, six different implementations were done in this study: standard NCC in CPU, standard NCC in GPU, Lewis' method in CPU and Lewis' method in GPU, Luo-Konofagou method in CPU and Luo-konofagou method in GPU. Hereafter, those six implementations are referred to as NCC-CPU, Lou-Konofagou-CPU, Lewis-CPU, NCC-GPU, Luo-Konofagou-GPU, and Lewis-GPU, respectively. In this paper, the performance is mainly compared from the following two aspects: (1) Evaluating whether or not a different implementation yields substantial errors and (2) comparing computational efficiency of those three methods, given motion tracking parameters and the computing environment (i.e., CPU or GPU).

Comparisons of Accuracy Among Different Implementations
Displacement estimates obtained from using the NCC-CPU of the block-matching algorithm were compared to other five above-mentioned implementations: Lou-konofagou-CPU, Lewis-CPU, NCC-GPU, Lou-Konofagou-GPU, Lewis-GPU. In Figure 4, the standard-NCC-CPU implementation yielded exactly same displacements as compared to the Lou-Konofagou-CPU implementation (see the second row). The differences between the standard-NCC-CPU and Lewis-CPU implementations were neglectable (<10 −15 mm), as shown in the third row of Figure 4. However, when comparing three GPU implementations (NCC-GPU, Lou-Konofagou-GPU and Lewis-GPU) to the standard-NCC-CPU implementation, small differences exist (see the fourth to sixth rows in Figure 4). The differences are quantitatively analyzed from 30 (elevational) slices of ultrasound displacement data and the result is shown in Figure 5.
It is clear from Figure 5 that all three GPU implementations produced slightly different displacement results. However, the difference was small (10 −6 -10 −4 mm). Such a small difference did not result in visible differences on respective axial strain images (see the fourth column in Figure 4).
Computational Efficiency-The influence of the tracking kernel size was investigated for six implementations and the results are shown in Figure 6. The search range was set to ( ). The computational efficiency was also examined when the search range had been varied. It is common to change the search range in USE to accommodate different frame-to-frame (2D) or volume-to-volume (3-D) strain levels occurring in vivo. As shown in Figure 7, the required time to complete the motion tracking as the increase of search range. However, the trend of time reduction between the standard NCC method and the Lou-Konofagou method remained the same. In contrast, with the increase of the search range, time needed for Lewis' method to complete the required tracking remains relatively steady ( Figure 7).
As shown in Table 1, the standard deviation values among 30 slices were low as compared to the mean value. This indicated that the time required to calculate NCC values was stable regardless of the implementation used.

Discussion and Summary
In this paper, the motion tracking accuracy and computational efficiency of the three methods in the CPU serial computing environment and GPU parallel computing environment are compared and systematically analyzed. Under the CPU environment, the Lou-Konofagou method can substantially improve computational efficiency (by 95% or more). In contrast, using Luo-Konofagou method, the 3-D tracking can still be accelerated under the GPU parallel environment. However, the rate of improvements only ranged between 17% and 46% (see Figure 6). The rate of improvements obtained by the Lewis' method was considerably less (7-23%; see Figure 6).
In order to further improve the GPU-based 3-D motion tracking, one strategy is to enhance the memory access efficiency. That requires us to make full use of on-chip memory. Recall that, according to Nvidia's GPU specifications, the latency of on-chip memory is significantly better than that of off-chip memory. However, on-chip memory capacity is limited on GPUs. With the development of GPU hardware technology, the on-chip memory capacity may continue to increase. At the same time, the current GPU implementation can also be further optimized. In multi-instruction and multi-data (MIMD) mode, when one instruction is waiting for loading data, another segment of data can be used to perform some computing tasks at the same time. Therefore, the MIMD mode may further improve the computational efficiency of GPU implementation. We expect that with the improvements of GPU hardware, real-time 3-D ultrasound motion tracking may become a reality in the near future.
As the tissue deformation increases, the required search range becomes inevitably large. Consequently, the creation of those sum-tables requires a longer time (see Appendix B). It is also found that the Lou-Konofagou method demands a high usage of memory (see Appendix C). Also, when the search range increases, the number of required sum-tables becomes bigger accordingly. In this sense, in order to deal with large tissue deformation, the Luo-Konofagou method can be used in conjunction with a multiple-compression tracking strategy [39,40]. This is because the multi-compression method can effectively reduce the required search range at the expense of performing motion tracking multiple times. Alternatively, the Luo-Konofagou method can be used together with a 3-D region-growing motion tracking method [20]. The advantage of using a region-growing motion tracking method is that the search region would be fairly small. Both directions will be explored in our future work.
Our preliminary results demonstrated that the Lewis method accelerated the 3-D motion tracking at a slow rate (see Figures 6 and 7) in both CPU and GPU computing environments. This is largely because the FFT requires a large number of calculations in order to estimate the numerator in Equation (1) regardless of the computing environment.
of the tracking kernel and the corresponding search area. When the search range is small, Lewis' method has considerably lower algorithmic complexity as compared to the standard NCC method. When the search area becomes larger, the benefit of using Lewis' method diminishes. In contrast, the algorithmic complexity of Lou-Konofagou method is not dependent on the tracking kernel size and search range. A summary of algorithmic complexity for computing a single NCC value under the blockmatching algorithm. The costs involving the construction of sum-tables is not included and analyzed below in Table A2. The size of the 3-D reference signal f are f x , f y , and f z for lateral, axial and elevation directions, respectively. The size of target signal g along the axial and lateral directions is the same as that of the reference signal f. However, the size of the 3-D target signal g for elevation direction is g z , which could differ from f z . The required number of operations in terms of setting up the sum tables is shown in Table A2. Recall that the standard NCC method has no need for using the sum-tables. A Summary of Algorithmic Complexity for Setting up Sum-tables needed for Lewis' and Luo-Konofagou methods.

Numerator
Addition None 3 f x f y ( f z + g z ) − f y ( f z + g z ) − f x f y

Multiplication
None f x f y ( f z + g z )

Subtraction
None None None None we need to access to both the reference RF volume data f (1024 × 128 × 3) and the target RF data g (1024 × 128 × 5). According to Equations (4)- (6), the RF volume data required to set up the sum-tables of f and g are 1024 × 128 × (3 + 1) and 1024 × 128 × (5 + 1). Therefore, the memory consumption for the corresponding sum-   An illustrative diagram show how the calculation of NCC values with and without sumtables are integrated into ultrasound-based motion tracking. RF signal stands for radiofrequency ultrasound echo signal. An illustration of calculating the 3-D sum-table under the GPU parallel computing environment.

Figure 3.
A photograph showing the CMUT ultrasound transducer and the tissue-mimicking phantom. The transducer is attached to a fixture that can be moved in the axial direction in order for the transducer to compress the phantom. The original picture was published in [19] and reuse permission has been granted for work presented in this paper.     A summary of computing time (mean ± one standard deviation) using 6 different implementations. 3D Displacement fields were first calculated for 30 slices. Then, the mean values (± one standard deviation) were derived and displayed below. The tracking kernel size and search region were 69 × 9 × 3 and 18 × 5 × 3, respectively.