Efficient Massive Computing for Deformable Volume Data Using Revised Parallel Resampling

In this paper, we propose an improved parallel resampling technique. Parallel resampling is a deformable object generation method based on volume data applied to medical simulations. Existing parallel resampling is not suitable for massive computing, because the number of samplings is high and floating-point precision problems may occur. This study addresses these problems to obtain improved user latency when performing medical simulations. Specifically, instead of interpolating values after volume sampling, the efficiency is improved by performing volume sampling after coordinate interpolation. Next, the floating-point error in the calculation of the sampling position is described, and the advantage of barycentric interpolation using a reference point is discussed. The experimental results showed a significant improvement over the existing method. Volume data comprising more than 600 images used in clinical practice were deformed and rendered at interactive speed. In an Internet of Everything environment, medical imaging systems are an important application, and simulation image generation is also valuable in the overall system. Through the proposed method, the performance of the whole system can be improved.


Introduction
Virtual medical procedures are being applied to clinical education and surgical planning, contributing to the improvement in medical services on the Internet of Everything (IoE). Generally, virtual medical procedures concern performing simulations of deforming human body data using volume-based and surface-based methods for volume deformation and rendering. Surface-based methods are advantageous in that they require fewer computational data and are faster compared to other methods. This is because the computation considers only the surface of the object. However, it has a disadvantage in that it is difficult to apply to topological changes such as cutting or merging. Although cutting has been studied extensively [1][2][3], merging is a difficult problem.
Medical simulation involves operations such as incision and suturing and, thus, it is appropriate to apply a volume-based deformation method that is independent of topology changes. Due to the large size of volume data, calculating the deformation is time consuming, and more efficient methods are desired. Recently, parallel deformation algorithms using GPU (chainmail [4], mass-spring [5], position-based dynamics [6], etc.) have been well used.
It is important to visualize the deformed data along with the deformation calculation. To render more precise results, the volume resolution needs to be expressed as large as 512 3 and, thus, visualization becomes time consuming. In this study, we visualized high-resolution deformed volumes by utilizing GPU parallelization.
The visualization of volume deformation is largely divided into two types (assuming that the general ray-casting method is used). The first method generates deformed viewing rays while leaving the volume data unchanged, and the second method generates new deformed volume data using straight viewing rays.
The first method performs iterative sampling for each ray in the original data. To calculate each sampling position in a deformed viewing ray, the inverse transform of the user deformation needs to be calculated, which becomes difficult when calculating a large number of inverse transforms. It becomes difficult to handle exceptions, such as the cut region, where inverse transformations do not exist, and the process is slow and potentially erroneous due to the use of iterative numerical solutions such as the Newton-Raphson or gradient-descent-like methods [7]. This issue has been addressed in the literature. When only a portion of the entire volume data was deformed, the screen area corresponding to the deformed portion was calculated in advance, and supersampling was applied to that area during rendering in [8]. In [9], an efficient approach to decompose the entire volume unequally was proposed. However, since the inverse transformation vector was simply defined as the opposite vector of the forward transformation, a large error may occur when the amount of change in the transformation vector is large. Since the inverse transformation does not exist in the cut region, this exception can be handled by making a special mark in the region where the inverse transformation does not exist. In [10], this problem was solved by introducing an alpha volume indicating the cut region. In [11], it was considered that when sampling was performed through inverse transform, and errors occur in the interpolation and gradient calculation process. In their study, image quality improvement in fine areas was realized by interpolation using a nonlinear higher-order function and by considering deformation in the gradient calculation.
The second method is to directly create deformed volume data from original volume data. If point-based forward mapping is applied, holes or overlapping problems generally occur. Although image-based backward mapping is a possible solution [12], it is time consuming to identify corresponding particles for each output grid position. Therefore, we used tetrahedron-based forward mapping using rasterization to address this problem. This technique was not considered feasible previously due to the high computation times, but recent advances in GPUs have made it possible. Parallel resampling is a typical tetrahedronbased forward mapping method for volume deformation using GPU parallelization.
For reference, the term parallel resampling is also used for particle filter techniques [13,14], which are common methods used to estimate the evolving state of nonlinear, non-Gaussian time-variant systems. However, the parallel resampling used in this study was different from the above studies, as it is a volume-based sampling technique. The basic parallel resampling method [15,16] cannot handle many tetrahedra due to the fact of its performance limitations. To implement more sophisticated deformations, new methods should be explored. In this study, we generated deformed volume data at a high speed by considering parallel resampling. By improving the resolution of the deformation, 512 3 data were deformed and visualized in real time.
The contributions of this paper are as follows:

1.
We propose an efficient volume deformation computing for massive data; 2.
User latency was improved through a high-speed deformable object creation algorithm; 3.
We present a more reliable barycentric interpolation method suitable for GPUs.

Overall Flow of Our System
The overall flow of this study is illustrated in three steps as shown in Figure 1. (Step 1) The entire volume data that needed to be transformed were composed of cells that were hexahedrons of a fixed size. Each cell was decomposed into five tetrahedra. The vertex matrices, X 0 and X 1 , were created using the coordinate values of the four vertices constituting one tetrahedron as column vectors. For reference, it was assumed that the coordinate values after deformation corresponding to X 1 were already calculated through simulation methods such as 3D chainmail [4] or mass-spring [5], and physical simulation was not within the scope of this study. (Step 2) The goal of this study was to store the resampling value for each grid point inside every tetrahedron to generate the entire deformed volume data. Therefore, first, in order to quickly extract the area inside the tetrahedron, the axisaligned bounding box (AABB) of the tetrahedron was calculated in deformed coordinates (Step 3). For each grid point belonging to the AABB region, it was tested whether the grid point was inside the tetrahedron. The resampling value was calculated at the grid points that passed the test. matrices, X0 and X1, were created using the coordinate values of the four vertices constituting one tetrahedron as column vectors. For reference, it was assumed that the coordinate values after deformation corresponding to X1 were already calculated through simulation methods such as 3D chainmail [4] or mass-spring [5], and physical simulation was not within the scope of this study. (Step 2) The goal of this study was to store the resampling value for each grid point inside every tetrahedron to generate the entire deformed volume data. Therefore, first, in order to quickly extract the area inside the tetrahedron, the axis-aligned bounding box (AABB) of the tetrahedron was calculated in deformed coordinates (Step 3). For each grid point belonging to the AABB region, it was tested whether the grid point was inside the tetrahedron. The resampling value was calculated at the grid points that passed the test. The structure of this paper is as follows: Section 2.1 describes Steps 1 and 2 of Figure  1 as a related study. In this study, Step 3 of Figure 1 is efficiently performed by applying the two proposed methods. Section 2.2 describes how to efficiently calculate resampling values, and Section 2.3 describes how to efficiently calculate the resampling position using the coordinate system. Next, in Section 3, the experimental results are presented, and in Section 4, the conclusions are drawn.

Related Work-Parallel Resampling
Parallel resampling is a method of storing forward mapping results in new volume data. When forward mapping is performed in a point-based manner, as shown in Figure  2a, problems such as overlaps or holes occur. If a kernel filter is used instead of a point, as shown in Figure 2b, overlap occurs a different number of times for each pixel, and parallelization becomes difficult. The structure of this paper is as follows: Section 2.1 describes Steps 1 and 2 of Figure 1 as a related study. In this study, Step 3 of Figure 1 is efficiently performed by applying the two proposed methods. Section 2.2 describes how to efficiently calculate resampling values, and Section 2.3 describes how to efficiently calculate the resampling position using the coordinate system. Next, in Section 3, the experimental results are presented, and in Section 4, the conclusions are drawn.

Related Work-Parallel Resampling
Parallel resampling is a method of storing forward mapping results in new volume data. When forward mapping is performed in a point-based manner, as shown in Figure 2a, problems such as overlaps or holes occur. If a kernel filter is used instead of a point, as shown in Figure 2b, overlap occurs a different number of times for each pixel, and parallelization becomes difficult. matrices, X0 and X1, were created using the coordinate values of the four vertices constituting one tetrahedron as column vectors. For reference, it was assumed that the coordinate values after deformation corresponding to X1 were already calculated through simulation methods such as 3D chainmail [4] or mass-spring [5], and physical simulation was not within the scope of this study. (Step 2) The goal of this study was to store the resampling value for each grid point inside every tetrahedron to generate the entire deformed volume data. Therefore, first, in order to quickly extract the area inside the tetrahedron, the axis-aligned bounding box (AABB) of the tetrahedron was calculated in deformed coordinates (Step 3). For each grid point belonging to the AABB region, it was tested whether the grid point was inside the tetrahedron. The resampling value was calculated at the grid points that passed the test. The structure of this paper is as follows: Section 2.1 describes Steps 1 and 2 of Figure  1 as a related study. In this study, Step 3 of Figure 1 is efficiently performed by applying the two proposed methods. Section 2.2 describes how to efficiently calculate resampling values, and Section 2.3 describes how to efficiently calculate the resampling position using the coordinate system. Next, in Section 3, the experimental results are presented, and in Section 4, the conclusions are drawn.

Related Work-Parallel Resampling
Parallel resampling is a method of storing forward mapping results in new volume data. When forward mapping is performed in a point-based manner, as shown in Figure  2a, problems such as overlaps or holes occur. If a kernel filter is used instead of a point, as shown in Figure 2b, overlap occurs a different number of times for each pixel, and parallelization becomes difficult. If rasterization is performed by connecting these points to a triangle, holes and overlaps can be avoided, and parallelization is also possible. The four vertices constituting the rectangle in the undeformed space are transformed into two adjacent triangles in the deformed space (Figure 3a,b). Since the output occurs only when the center of the pixel in the deformed space is in the triangle, the resampling operation occurs only once at each output coordinate.  If rasterization is performed by connecting these points to a triangle, holes and overlaps can be avoided, and parallelization is also possible. The four vertices constituting the rectangle in the undeformed space are transformed into two adjacent triangles in the deformed space (Figure 3a,b). Since the output occurs only when the center of the pixel in the deformed space is in the triangle, the resampling operation occurs only once at each output coordinate. In three dimensions, eight vertices constitute a hexahedral cell. As shown in Figure  4, the cell is divided (Figure 4a  Each tetrahedron is transformed by changing the coordinates of each vertex comprising the cell. The process of resampling inside each tetrahedron is as follows. For example, it is assumed that 2 3 voxels constitute one cell. Volume data with a size of L × M × N voxels comprise (L − 1) × (M − 1) × (N − 1) cells. If the cell comprises B 3 voxels, the volume data comprise approximately L/B × M/B × N/B cells. A cell is decomposed into five tetrahedra regardless of the cell size, and the size of each tetrahedron is proportional to the cell size.
Whether the output voxels are inside the transformed tetrahedron is determined using the barycentric coordinates, and resampling is performed at each voxel position inside the tetrahedron. The area near the tetrahedron is defined by the AABB of the tetrahedron, as shown in Figure 3c. For each candidate voxel inside the AABB, the barycentric coordinates (for the four vertices of the tetrahedron) are calculated. Since the calculated value (b) is the barycentric coordinates in the three-dimensional space, it is In three dimensions, eight vertices constitute a hexahedral cell. As shown in Figure 4, the cell is divided (Figure 4a  If rasterization is performed by connecting these points to a triangle, holes and overlaps can be avoided, and parallelization is also possible. The four vertices constituting the rectangle in the undeformed space are transformed into two adjacent triangles in the deformed space (Figure 3a,b). Since the output occurs only when the center of the pixel in the deformed space is in the triangle, the resampling operation occurs only once at each output coordinate. In three dimensions, eight vertices constitute a hexahedral cell. As shown in Figure  4, the cell is divided (Figure 4a  Each tetrahedron is transformed by changing the coordinates of each vertex comprising the cell. The process of resampling inside each tetrahedron is as follows. For example, it is assumed that 2 3 voxels constitute one cell. Volume data with a size of L × M × N voxels comprise (L − 1) × (M − 1) × (N − 1) cells. If the cell comprises B 3 voxels, the volume data comprise approximately L/B × M/B × N/B cells. A cell is decomposed into five tetrahedra regardless of the cell size, and the size of each tetrahedron is proportional to the cell size.
Whether the output voxels are inside the transformed tetrahedron is determined using the barycentric coordinates, and resampling is performed at each voxel position inside the tetrahedron. The area near the tetrahedron is defined by the AABB of the tetrahedron, as shown in Figure 3c. For each candidate voxel inside the AABB, the barycentric coordinates (for the four vertices of the tetrahedron) are calculated. Since the calculated value (b) is the barycentric coordinates in the three-dimensional space, it is Each tetrahedron is transformed by changing the coordinates of each vertex comprising the cell. The process of resampling inside each tetrahedron is as follows. For example, it is assumed that 2 3 voxels constitute one cell. Volume data with a size of L × M × N voxels comprise (L -1) × (M -1) × (N -1) cells. If the cell comprises B 3 voxels, the volume data comprise approximately L/B × M/B × N/B cells. A cell is decomposed into five tetrahedra regardless of the cell size, and the size of each tetrahedron is proportional to the cell size.
Whether the output voxels are inside the transformed tetrahedron is determined using the barycentric coordinates, and resampling is performed at each voxel position inside the tetrahedron. The area near the tetrahedron is defined by the AABB of the tetrahedron, as shown in Figure 3c. For each candidate voxel inside the AABB, the barycentric coordinates (for the four vertices of the tetrahedron) are calculated. Since the calculated value (b) is the barycentric coordinates in the three-dimensional space, it is expressed as a four-dimensional vector. When each component of the vector is between 0 and 1, it is determined to be inside a tetrahedron.
Representative existing studies using this approach include [15,16]. A tetrahedron was generated using a relatively large cell in [15], and a cell with the same voxel size was generated in [16]. This parallel resampling method can be performed in real time using a touch screen [17], and it can be applied by generating a tetrahedral mesh in an intermediate step [18] when generating volume data from a general mesh.

Efficient Sampling Using Coordinate Interpolation
In this study, we aimed to improve the sampling performance by combining the advantages of parallel resampling, Gascon's method [15], and Aguilera's method [16]. Moreover, the characteristics of the two previous studies are explained, and the differences from this study are shown. For convenience, each thread of the GPU is expressed as a thread, and the combined bundle of threads is expressed as a thread block.
Gascon's method assumes that the size of each tetrahedron is suitably large (more than a few tens of voxels in size). Therefore, one or several thread blocks correspond to each tetrahedron. Threads belonging to one thread block can share the information of the tetrahedron (X 0 , bounding box). The transformation of the tetrahedron is performed in the CPU, because the total number of tetrahedra was fewer than 5000 in Gascon's study. However, the number of tetrahedra has to be significantly increased in order to achieve smooth movement of deformation. In this study, one tetrahedron was assumed to be as small as the voxel size and, thus, there was no reason to configure one tetrahedron as a thread block and activate hundreds of threads.
Aguilera's method defines a vertex as a coordinate in deformed space and a density value. Coordinates in deformed space use the precalculated simulation results. A cell is a hexahedron with eight voxels as vertices, and the eight density values are obtained by performing texture sampling at each voxel position. Resampling concerns interpolating the density values stored at vertex positions. The resampling value at the grid points in the transformed space is calculated and stored in the deformed volume data. Aguilera's method [16] is different from Gascon's method [15]. In Gascon's method, one large tetrahedron contains several cells, whereas in Aguilera's method, one cell is decomposed into five very small tetrahedra. Each thread is used to process one cell, i.e., five tetrahedra.
In our study, we constructed a high-speed algorithm to generate precise results by combining only the advantages of the two previous studies. Aguilera's method was used for each thread processing one cell, which was decomposed into five small tetrahedra. Gascon's method was used for texture sampling in the rasterization step instead of sampling eight times for each cell in the modeling step. Our method is efficient because the tetrahedron is small, and the actual resampling in a small tetrahedron is infrequent.
Each thread is in charge of one cell to perform parallel processing. One cell is decomposed into five tetrahedra, and calculation is performed for each tetrahedron. The resampling is calculated at every grid point inside the AABB of each tetrahedron in deformed space (output volume data). The coordinates are obtained by the weighted average of the four tetrahedron vertices. Aguilera's method calculates the weighted average of the brightness values (Figure 5a 3 ), while Gascon's method calculates the weighted average of the coordinates in the undeformed space (Figure 5b 2 ). In this study, texture sampling was performed at the interpolated coordinates according to Gascon's method (Figure 5b 3 ). This method reduces the number of texture sampling compared to Aguilera's method. Since the cell size is 1 in undeformed space, on average, texture writing will occur only once for each cell, although one cell comprises five tetrahedrons.
As many threads as the number of hexahedral cells are launched, if we assume that the size of volume data is (volx, voly, and volz): number of threads = number of cells = (volx -1)·(voly -1)·(volz -1) Note that five tetrahedra are created for each cell: number of tetrahedra = 5·(volx -1)·(voly -1)·(volz -1) As many threads as the number of hexahedral cells are launched, if we assume that the size of volume data is (volx, voly, and volz): Note that five tetrahedra are created for each cell: The number of output voxels is volx•voly•volz with the same size as the input voxel. Since the tetrahedra are adjacent to each other without overlapping, the maximum number of resampling and writing occurs in volx•voly•volz, which is the size of the output data. The number of outputs for one cell is approximately 1. Compared to Aguilera's method, where texture sampling occurs eight times for one cell, the proposed method is more efficient.
We store both positions before and after deformation for the eight vertices of the cell, because resampling is performed after the deformation. The data required for each thread are 8 (vertices per cell) × 2 (before and after movement) × 3 (x, y, z) × 4 (size of float) = 192 bytes. In Aguilera's method, it is 8 (vertices per cell) × (3 (x, y, z) × 4 (size of float) + 2 (size of density value)) = 112 bytes. The proposed method uses slightly more memory than the existing method. For reference, Gascon's method shares the coordinates of a tetrahedron before deformation for each block; thus, the coordinates before deformation can be read from a precalculated memory. Gascon's method seems to require less memory, but it can be used only when the number of tetrahedra is small.
The last step of generating deformation data is to perform sampling and store each sampling value in the target volume data. Equation (4) is a matrix comprising the coordinates of the four vertices of a tetrahedron, where is generated with coordinates in undeformed space, and is generated with coordinates in deformed space. Sampling is performed at the coordinates, x0, before deformation, which is obtained from the coordinates x1 of the grid point after deformation. Since the transformation of one tetrahedron is assumed to be an affine transform, the barycentric coordinates of x0 and the barycentric coordinates of x1 are the same as in Equations (5)   The number of output voxels is volx·voly·volz with the same size as the input voxel. Since the tetrahedra are adjacent to each other without overlapping, the maximum number of resampling and writing occurs in volx·voly·volz, which is the size of the output data. The number of outputs for one cell is approximately 1. Compared to Aguilera's method, where texture sampling occurs eight times for one cell, the proposed method is more efficient.
We store both positions before and after deformation for the eight vertices of the cell, because resampling is performed after the deformation. The data required for each thread are 8 (vertices per cell) × 2 (before and after movement) × 3 (x, y, z) × 4 (size of float) = 192 bytes. In Aguilera's method, it is 8 (vertices per cell) × (3 (x, y, z) × 4 (size of float) + 2 (size of density value)) = 112 bytes. The proposed method uses slightly more memory than the existing method. For reference, Gascon's method shares the coordinates of a tetrahedron before deformation for each block; thus, the coordinates before deformation can be read from a precalculated memory. Gascon's method seems to require less memory, but it can be used only when the number of tetrahedra is small.
The last step of generating deformation data is to perform sampling and store each sampling value in the target volume data. Equation (4) is a matrix comprising the coordinates of the four vertices of a tetrahedron, where X 0 is generated with coordinates in undeformed space, and X 1 is generated with coordinates in deformed space. Sampling is performed at the coordinates, x 0 , before deformation, which is obtained from the coordinates x 1 of the grid point after deformation. Since the transformation of one tetrahedron is assumed to be an affine transform, the barycentric coordinates of x 0 and the barycentric coordinates of x 1 are the same as in Equations (5) and (6).
To obtain the barycentric coordinates, the coordinates x 1 of the voxel (Equation (6)) are defined in the form of a four-dimensional homogeneous coordinate and multiplied by the inverse of the matrix comprising four column vectors of the tetrahedral vertex coordinates. As shown in Equation (7), x 0 is obtained by multiplying the barycentric coordinates b by X 0 , which is the column vector matrix using four vertices of a tetrahedron in undeformed coordinates. The output data are generated with the value obtained by texture sampling on the undeformed coordinates.

Efficient Barycentric Interpolation for a Massive Number of Tetrahedra
As described in Equation (7), the calculation of the inverse matrix occurs for each tetrahedron. Since we considered the large number of 786 M (= 512 2 × 600 × 5) tetrahedra (Aguilera used 65 M tetrahedra [16]), efficient computation is required. Here, we explain the importance of efficient inverse matrix computation and discuss the numerical instability that occurs when the number of tetrahedra increases.

Barycentric Interpolation and Inverse Matrix
In this study, the coordinates in undeformed space were calculated using barycentric coordinates, and sampling was performed for each output grid point. As expressed by com-puteBaryCoords in Algorithm 1, the barycentric coordinates are calculated using the inverse matrix (Equation (6)). It is necessary to calculate the inverse matrix for each tetrahedron, but as the number of tetrahedra increases and the size decreases, it becomes numerically unstable. In the following example, the coordinates of the four points constituting a tetrahedron are A (255.9, 256.7, and 133.1), B (256.7, 255.9, and 133.4), C (256.7, 256.7, and 132.3), and D (255.9, 255.9, and 132.3). Using each point as a column vector, the inverse of the matrix X 1 is obtained as: However, if the inverse matrix of X 1 is calculated with a single-precision floating point (float), an error occurs. Appendix A shows finding the determinant, which is the first step to finding the inverse matrix. The correct determinant value is 1.216, but the calculation value using float is 2.010187, which shows an obvious error. The reason for this is that the formulas in the form of a·b·c-d·e·f are repeated to calculate the inverse matrix. Both a·b·c and d·e·f, which are the result of multiplying the coordinate values, are respectively large values (>10 6 ). However, since the result of a·b·c-d·e·f is small (<10), an error easily occurs when using float. In our study, since each cell was small, the coordinate values of the adjacent vertices constituting a tetrahedron were similar. As the number of cells is increased, this error becomes more prominent, and the result becomes unusable. The basic approach is to use the double-precision floating point (double). However, the double operation is significantly slower than the float operation, because a typical GPU contains less double-precision computing hardware. To solve this problem, we used a reference point for the barycentric coordinates described in next section.

Calculation of the Barycentric Coordinates Using the Reference Point
In this study, the inverse matrix calculation was used only to obtain the barycentric coordinates, b, to determine whether each point was inside the tetrahedron. Even when translating every point of the tetrahedron, the barycentric coordinates do not change. To keep the coordinate values as small as possible, we translated each point so that it was close to the origin (Figure 6a).
For convenience, the last vertex D among the four vertices of the tetrahedron was translated to the origin (Figure 6b), i.e., we calculated the barycentric coordinates with respect to D [19]. Since the size of the tetrahedron was very small, the coordinates of all points inside the AABB of the tetrahedron were located very close to the origin. Each value of a·b·c and d·e·f becomes smaller, and the error is negligible when float is used.
In this study, the inverse matrix calculation was used only to obtain the barycentric coordinates, b, to determine whether each point was inside the tetrahedron. Even when translating every point of the tetrahedron, the barycentric coordinates do not change. To keep the coordinate values as small as possible, we translated each point so that it was close to the origin (Figure 6a). For convenience, the last vertex D among the four vertices of the tetrahedron was translated to the origin (Figure 6b), i.e., we calculated the barycentric coordinates with respect to D [19]. Since the size of the tetrahedron was very small, the coordinates of all points inside the AABB of the tetrahedron were located very close to the origin. Each value of a·b·c and d·e·f becomes smaller, and the error is negligible when float is used.
Equation (5) can be rewritten as follows: Now, if all of the points, A, B, C, D, and X, are moved in parallel by -D, it can be expressed as: Therefore, if only the 3 × 3 submatrix is observed, the following is obtained: Here, b 3 (bA, bB, and bC) can be obtained by calculating only the inverse of the 3 × 3 matrix of (Equation (13)) instead of the 4 × 4 matrix. Moreover, the bD value is calculated using bA + bB + bC + bD = 1. In the process of matrix inversion, the form a·b-c·d is used instead of a·b·c-d·e·f; therefore, we can use float without errors. Although calculation of the barycentric coordinates using the reference point is not a new proposal, it is worth highlighting that float can be used instead of double. Equation (5) can be rewritten as follows: Now, if all of the points, A, B, C, D, and X, are moved in parallel by -D, it can be expressed as: Therefore, if only the 3 × 3 submatrix is observed, the following is obtained: Here, b 3 (b A , b B , and b C ) can be obtained by calculating only the inverse of the 3 × 3 matrix of (Equation (13)) instead of the 4 × 4 matrix. Moreover, the b D value is calculated using b A + b B + b C + b D = 1. In the process of matrix inversion, the form a·b-c·d is used instead of a·b·c-d·e·f ; therefore, we can use float without errors. Although calculation of the barycentric coordinates using the reference point is not a new proposal, it is worth highlighting that float can be used instead of double.

Experimental Setup
The program was developed using C++ based on Visual Studio. Rendering was performed using ray casting [20] and parallelized with CUDA [21] on a laptop equipped with a GeForce GTX 1650 Mobile and a desktop computer equipped with a GeForce RTX 2080. The volume data used in the experiment were anonymized medical image CT data, and the details are shown in Table 1. In this study, the transformation of the volume data is not of concern. We assumed that the transformation results existed and paid attention to generating volume data by parallel resampling. Therefore, we did not perform physics-based deformation separately but transformed the data with simple precalculated formulas. The three transformations generated for testing are as follows.
The Wave transform (Figure 7b) performs transverse translation in the x-axis direction. The Twist transform ( (Figure 7c) twists and rotates about the z-direction axis. The Bubble transform (Figure 7d) is expressed in the form of a sphere, and regional expansion and contraction occur. Assuming that the undeformed position is p, the center point of the volume data is c, the current time is t, and the deformed position is p , and each transformation is expressed as follows. In addition, x, y, and z are unit vectors in each axis direction. Twist Here, we show the effect of the proposed method using various testing transformations. In order to effectively reveal the experimental results, the experimental sequence was performed in the reverse order of the main sections. First, the results of the efficient barycentric interpolation method, described in Section 2.3, are presented, and then the effect of the sampling method proposed in Section 2.2 is shown.

Efficient Barycentric Interpolation
In order to show the efficiency of the barycentric interpolation method, described in Section 2.3, various methods for the inverse matrix calculation were analyzed. The abdomen data used and the average time of running 500 frames were measured. In Table 2, only the resampling time is indicated, and the rendering time was measured separately. The average rendering time was measured to be less than 1 ms on the desktop computer, and approximately 14 ms on the laptop. This is fast enough to enable real-time visualization. The Wave transform (Figure 7b) performs transverse translation in the x-axis direction. The Twist transform ( (Figure 7c) twists and rotates about the z-direction axis. The Bubble transform (Figure 7d) is expressed in the form of a sphere, and regional expansion and contraction occur. Assuming that the undeformed position is p, the center point of the volume data is c, the current time is t, and the deformed position is p′, and each transformation is expressed as follows. In addition, x, y, and z are unit vectors in each axis direction. Bubble: Here, we show the effect of the proposed method using various testing transformations. In order to effectively reveal the experimental results, the experimental sequence was performed in the reverse order of the main sections. First, the results of the efficient barycentric interpolation method, described in Section 2.3, are presented, and then the effect of the sampling method proposed in Section 2.2 is shown.

Efficient Barycentric Interpolation
In order to show the efficiency of the barycentric interpolation method, described in Section 2.3, various methods for the inverse matrix calculation were analyzed. The abdomen data used and the average time of running 500 frames were measured. In Table  2, only the resampling time is indicated, and the rendering time was measured separately. The average rendering time was measured to be less than 1 ms on the desktop computer, and approximately 14 ms on the laptop. This is fast enough to enable real-time visualization.  The execution time according to different inverse matrix calculations was measured, and as shown in the Table 2, the execution speed of test (a) was the slowest. This was because the inverse of a 4 × 4 matrix requires significant computation. Calculating the barycentric coordinates with respect to a vertex [19] reduces the matrix size 4 × 4 to 3 × 3. Experiment (b) calculated the inverse of a 3 × 3 matrix using the double and, thus, the amount of computation was reduced significantly. As a result, compared to experiment (a), the speed improved by more than double. As explained in Section 2.3 on the relationship between data types and error, in the case of the experiment (c) using the float operation, the speed was improved by 4 to 6 times compared to experiment (b). It was surprising that the performance of the float on the GPU was higher compared to the double. Comparing the execution times of the proposed method (c) with the brute-force method (a), the performance improved by 10 to 22 times.

Desktop Notebook
Although the proposed method improved the inverse matrix calculation speed, a different pattern was observed in the degree of improvement for the desktop and laptop computers. It improved by 14 to 22 times on the desktop and by 10 to 16 times on the laptop. In general, the graphics memory of laptops has a lower performance compared to desktops due to the fact of energy and heat problems. Therefore, the memory read/write bottleneck is more severe in a laptop. In the case of experiment (a), most of the time was consumed in calculating the inverse matrix comprising arithmetic operations. Further, in the case of experiment (c), the memory operations, such as resampling, become important, because the inverse matrix calculation has been optimized and reduced. Therefore, the performance improvement of (c) was partially reduced in the notebook. It was expected that the effect of the proposed method will be greater in hardware with high memory speed.
In the case of Wave, it was slightly faster than Twist or Bubble in experiment (a), because the calculation of Wave (Equation (14)) is simpler than that of the other equations. When we added some complex instructions to the Wave, the speed of Wave became similar to that of Twist and Bubble. For reference, the total execution time included the predefined deformation calculation time, inverse matrix calculation time, and data generation time using texture sampling.
In order to confirm that there were no errors in the output image of the proposed method, the output images using the existing method and the proposed method were compared. In the case of the output image using the proposed method and the existing method, the values of the image difference were exactly equal to 0. As described in Section 2.3, if the coordinate values are kept small by moving the tetrahedron to the origin, the inverse matrix can be calculated with sufficient precision, even when we use small cells and float operations.

Efficient Sampling Using Coordinate Interpolation
The performance improvement was analyzed by applying the sampling method proposed in Section 2.2. The 3 × 3 float matrix calculation method, which showed the best performance in Table 2 column (c), was commonly applied to generate Table 3. The average time was measured for 500 frames. Aguilera's method (a) obtained a weighted average from the eight sampled values for a cell, but the proposed method (b) sampled only once per output voxel by weighted averaging the coordinates in the original volume. When the proposed method was executed, it can be seen that the speed improved by 10-20% depending on the transform. Although the performance improvement was not significant, it was meaningful in that it provided additional performance improvement to the already optimized operation. In the above experiment, the degree of performance improvement varied according to the transform. In the case of Bubble, since only a part of the data was transformed, the data reusability of the nonmoving part was high. Moreover, the Twist was complicated, as shown in Table 2. In addition, when multiple threads stored the deformed points using the Twist, the memory addresses to be stored were not contiguous due to the rotation, thereby reducing the locality and efficiency of memory access. In summary, the proposed method was more effective with local range deformation. In medical simulations, deformation occurs locally such as pulling or incising a part of human body data. The proposed method is more suitable for general medical surgery simulation.
In this study, the execution time was independent of the distribution of density values such as bone and soft tissue arrangement, because we performed the same operation for each cell. However, the execution time was related to the transformation pattern and size of the volume data. Applying each transformation to various data, the execution time showed a proportional relationship with the size of the data. Table 4 presents the results of applying Wave transform to various data. The rendering results on the lung, colon, and legs data used are shown in Figure 8. In the above experiment, the degree of performance improvement varied according to the transform. In the case of Bubble, since only a part of the data was transformed, the data reusability of the nonmoving part was high. Moreover, the Twist was complicated, as shown in Table 2. In addition, when multiple threads stored the deformed points using the Twist, the memory addresses to be stored were not contiguous due to the rotation, thereby reducing the locality and efficiency of memory access. In summary, the proposed method was more effective with local range deformation. In medical simulations, deformation occurs locally such as pulling or incising a part of human body data. The proposed method is more suitable for general medical surgery simulation.
In this study, the execution time was independent of the distribution of density values such as bone and soft tissue arrangement, because we performed the same operation for each cell. However, the execution time was related to the transformation pattern and size of the volume data. Applying each transformation to various data, the execution time showed a proportional relationship with the size of the data. Table 4 presents the results of applying Wave transform to various data. The rendering results on the lung, colon, and legs data used are shown in Figure 8.

Conclusions
We proposed a novel efficient method of parallel resampling by developing volumebased parallel resampling. In modern deformation modeling, the number of cells increases significantly as the size of cells decreases in order to implement sophisticated movements. Thus, considering the fact that texture sampling and inverse matrix calculation are time consuming in existing methods, we highlighted the cause of the problems and suggested new methods to solve them.
In order to calculate the inverse transformation in the deformation simulation, the inverse matrix calculation using the vertex positions of the tetrahedron is frequently used. Considering that the error is related to the size of the elements of the matrices when calculating the inverse matrices, a method of maintaining smaller matrix elements was described. When the number of cells increased, the correct calculation could be performed solely by using the double in the existing method. Since modern GPUs are optimized for floats rather than doubles, this leads to performance degradation.

Conclusions
We proposed a novel efficient method of parallel resampling by developing volumebased parallel resampling. In modern deformation modeling, the number of cells increases significantly as the size of cells decreases in order to implement sophisticated movements. Thus, considering the fact that texture sampling and inverse matrix calculation are time consuming in existing methods, we highlighted the cause of the problems and suggested new methods to solve them.
In order to calculate the inverse transformation in the deformation simulation, the inverse matrix calculation using the vertex positions of the tetrahedron is frequently used. Considering that the error is related to the size of the elements of the matrices when calculating the inverse matrices, a method of maintaining smaller matrix elements was described. When the number of cells increased, the correct calculation could be performed solely by using the double in the existing method. Since modern GPUs are optimized for floats rather than doubles, this leads to performance degradation.
In this study, we set a reference point for each tetrahedron, and all vertices inside the tetrahedron were moved in parallel closer to the origin, as the reference point moved to the origin. The first advantage of the proposed method is that each element of the matrix becomes close to zero. It was possible to calculate a large number of cells without deterioration of the output data, even when using float. The second advantage is that the number of operations required for the inverse matrix can be reduced, because a 3 × 3 matrix can be used instead of a 4 × 4 matrix. As a result, a 10 to 20 time improvement was seen in the results of the experiments on a laptop and a desktop computer.
In addition, we proposed a method to efficiently perform texture sampling. In a previous study, to process one cell, texture sampling was performed at each vertex for a total of eight samplings. The output values were calculated by interpolating the sampled values. As the number of cells increases, the required texture sampling also increases proportionally, and performance degradation occurs. In this study, texture sampling was performed at the output voxel by interpolating the coordinates in the undeformed space. As a result, the execution speed was improved by reducing the number of texture samplings.
In the experiments in this study, three different deformation patterns were proposed, and their performances were observed accordingly. We observed a performance improvement for all deformations when the deformation pattern was relatively simple. Therefore,