Computational Optimization of the 3D Least-Squares Matching Algorithm by Direct Calculation of Normal Equations

3D least-squares matching is an algorithm that allows to measure subvoxel-precise displacements between two data sets of computed tomography voxel data. The determination of precise displacement vector fields is an important tool for deformation analyses in in-situ X-ray micro-tomography time series. The goal of the work presented in this publication is the development and validation of an optimized algorithm for 3D least-squares matching saving computation time and memory. 3D least-squares matching is a gradient-based method to determine geometric (and optionally also radiometric) transformation parameters between consecutive cuboids in voxel data. These parameters are obtained by an iterative Gauss-Markov process. Herein, the most crucial point concerning computation time is the calculation of the normal equations using matrix multiplications. In the paper at hand, a direct normal equation computation approach is proposed, minimizing the number of computation steps. A theoretical comparison shows, that the number of multiplications is reduced by 28% and the number of additions by 17%. In a practical test, the computation time of the 3D least-squares matching algorithm was proven to be reduced by 27%.


Introduction
Volume data analysis in a voxel space representation is a logical extension of 2D image data processing. In the field of Photogrammetry and Computer Vision, image correlation techniques are often used for fine measurement of corresponding points in image sequences or stereo image pairs. Early contributions applied cross-correlation to image data [1]. In the early 80's, gradient based algorithms were developed using an iterative least squares method [2] that also offers subpixel precision. Lucas and Kanade [2] only introduced two translation parameters, whereas Ackermann [3] and Grün [4] extended the model toward an affine and radiometric transformation, which also considers rotation, scaling and shear to obtain a better adaption for perspective distortions. As an example for the least-squares matching (LSM) algorithm, an image section of a textured concrete surface with two views is illustrated in Figure 1. Figure 1a shows a section of the reference image with the patch (subimage) consisting of the pixels included in the computation. The pixels are illustrated as red squares with red circles in the centers. In Figure 1b, an image section of the second view as well as the transformed patch (red squares with red circles) are depicted. In this example, the transformation includes translation, rotation, scale and shear and the corresponding parameters are obtained by the LSM algorithm. The most important parameters are the shifts, which typically show an accuracy of 0.02 to 0.05 pixels in this kind of application. It is also possible to use more complex geometric models for the perspective transformation [5]. Grün and Baltsavias [6] introduced geometrically constrained multiphoto matching, where the solution can inherently be forced to the epipolar line. Other contributions combined digital image matching and object surface reconstruction. In addition to a geometric 3D model of the surface, they used a radiometric model for the surface and the images. The unknowns of the surface densities, 3D points and image orientation parameters were computed in an iterative least-squares adjustment [7,8].
Maas et al. [9] transferred the idea from the 2D image matching to 3D voxel data (digital volume correlation DVC). They applied the subvoxel-precise cuboid tracking (3D least-squares matching) with 12 parameters of a 3D affine transformation to a sequence of multi-temporal voxel representations of mixing fluids. Bay et al. [10] extended the cross-correlation technique to 3D, combined it with a subvoxel-precise refinement and applied it to X-ray tomography data for strain analysis. Liebold et al. [11] also introduced a radiometric transformation to 3D cuboid matching.
In the last years, also free DVC code was provided. For instance, TomoWarp2 ensures the computation of 3D displacement fields using 3D cross-correlation [12]. They also implemented a further step to calculate subvoxel positions considering the maximum crosscorrelation integer position and the 26 neighbor voxel around. The free software package SPAM is based on TomoWarp2 [13] and also provides a tool for the coarse registration that can consider a rotation between the reference and the deformed volume. In ALDVC [14], a Fast Fourier Transform (FFT) is applied for the estimation of initial shift values, followed by an algorithm including a 12-parameter affine transformation similar to 3D least-squares matching for computing subvoxel positions.
The least-squares matching algorithm contains an adjustment procedure where normal equations are computed. To speed up the process, Sutton et al. [15] showed a direct computation of the normal equations for 2D least-squares matching including affine parameters, and Jin et al. [16] also included radiometric parameters.
The publication at hand is based on the work of Maas et al. [9] and Liebold et al. [11]. It presents an optimized algorithm for the 3D LSM, likewise to what has been shown for 2D [15,16]. The central goal of the work is the acceleration of the 3D algorithm in order to save computation time and memory, without restrictions to the universality of the algorithm. The paper is structured as follows: In the next section, an overview of the computation of 3D least-squares matching is given including the optimization. In addition, a comparison to the standard method is shown. The publication closes with a conclusion.

Mathematical Model
In the following, some formulae of Liebold et al. [11] are reused and extended. 3D least-squares matching is a gradient-based method and is applied to two voxel data sets. In the reference volume, a small cuboid is defined containing the voxel of the cuboid's center (point to be matched) and its neighborhood. The dimensions dim x , dim y and dim z are odd numbers, and a typical cuboid size is for example 13 × 13 × 13 = 2197 voxel. The aim is to find the corresponding cuboid in the second volume data set. The coordinates of the center of the cuboid in the reference state x Re f ,c , y Re f ,c , z Re f ,c are integer values, whereas the coordinates of the voxel in the second state x De f , y De f , z De f are floating-point values.
Similar to Figure 1 for 2D, Figure 2 illustrates the 3D case with a voxel grid (gray). In the reference state, a subvolume consisting of e.g., 5 × 5 × 5 voxel (green cuboid with red edges and red center points, Figure 2a). In the second state, depicted in Figure 2b, the affine-transformed cuboid as well as the faded reference subvolume are shown. The shift vector that connects the center points of the subvolumes is colored in magenta. 3D LSM is used to determine the transformation parameters, including the three translations. The range of the cuboid in the reference volume is: Equation (2) shows the relationship between the gray values of the first (reference) and second (deformed) volume, also taking into account radiometric parameters (r 0 , r 1 ). This relationship is valid for each voxel of the cuboid.
where Between the reference and the deformed state, a coordinate transformation is performed. Similar to [3] in 2D and [9] in 3D, an affine transformation is used as mathematical model containing displacements, rotations, scaling and shear, see Equation (3), where a 0 , b 0 and c 0 are the translation parameters.
where a i , b i , c i = parameters of the affine transformatioñ x,ỹ,z = reduced reference coordinates The reduced reference coordinates are computed as follows: Thus, the vector of unknowns is: The parameter vector p consists of 14 unknowns, and a cuboid with an exemplary size of 13 × 13 × 13 = 2197 voxel (observations) leads to an over-determination. Thus, the parameters of p are computed in a least-squares adjustment process. Therefore, residuals are added to Equation (2) that results in the observation equations: where v(x Re f , y Re f , z Re f ) = residuum The first terms of the Taylor series expansion are used to linearize the observation equation, see Equation (7) where initial values of the parameters are included. They are marked with an additional zero subscript. Section 2.2 shows a way to obtain initial values for 3D LSM.
f (x Re f , y Re f , z Re f ) + v(x Re f , y Re f , z Re f ) ≈ r 0,0 + dr 0 + (r 1,0 + dr 1 ) · g(x De f ,0 , y De f ,0 , z De f ,0 ) where r 0,0 = initial value of the brightness parameter r 0 r 1,0 = initial value of the contrast parameter The volume gradients of the gray values are written with their short forms in Equation (7). The full notation is: There are different ways to obtain these gradients: One possibility is the computation of central differences g x,i , g y,i , g z,i (numerical differentiation, Equation (9) where g x,i = derivative in x direction at integer positions g y,i = derivative in y direction at integer positions g z,i = derivative in z direction at integer positions Another approach is the use of tri-cubic spline interpolation to compute the derivatives [17] (derivatives of cubic polynomials). And a further possibility is the calculation of the derivatives in the reference volume as an approximation. The advantage of the last point is once-only computation in the iterative least-squares process.
Thus, the differentials dx De f , dy De f and dz De f are: where dx De f = differential of the x coordinate dy De f = differential of the y coordinate dz De f = differential of the z coordinate The vector of corrections to the unknowns is: dp = da 0 , da 1 , da 2 , da 3 , db 0 , db 1 , db 2 , db 3 , dc 0 , dc 1 , dc 2 , dc 3 , dr 0 , dr 1 T The vector of unknowns results from the sum of the vector of the initial values p 0 and the vector of the corrections to the unknowns dp: The linearized observation equations can be written in matrix notation: where The Jacobian matrix is built up as follows: where g y,1 g y,1 ·x 1 g y,1 ·ỹ 1 g y,1 ·z 1 g y,2 g y,2 ·x 2 g y,2 ·ỹ 2 g y,2 ·z 2 . . . . . . . . . . . . g y,n g y,n ·x n g y,n ·ỹ n g y,n ·z n    (16) g z,n g z,n ·x n g z,n ·ỹ n g z,n ·z n    (17) The reduced observation vector l is: In the Gauss-Markov model, the weighted sum of the squared residuals is minimized: v T · W · v → min dp (20) where W = matrix of weights In the following, only the case of equally weighted observations (W = I, I: identity matrix) is considered, see Equation (21).
The solution of this problem can be obtained using the normal equations: As the computation of the normal matrix A T · A and the right hand side A T · l are done directly, it is not necessary to build up and store the whole Jacobian matrix A and the reduced observation vector l. The detailed computation process is shown in Section 2.3.
After setting up the matrices A T · A and A T · l, Equation (22) is solved for dp and the initial values of the unknowns are updated: The steps of the interpolation of the gray values and their derivatives as well as the computation of A T · A and A T · l, the determination of dp and the update of the unknowns are repeated until the process converges.
After the iterative adjustment routine, the residuals are determined by rearranging Equation (6). The gray values g i are obtained by applying Equation (3) with the updated affine parameters and tri-linear (or tri-cubic) interpolation. The individual residual v i is: where v i = individual residual The standard deviation of the unit weight is: where s 0 = standard deviation of the unit weight n = number of voxel of the cuboid u = number of unknowns

Initial Values
The model in Equation (2) is not linear and if the movements between the volume data sets may exceed the dimensions of the cuboid, initial values have to be obtained for the 3D LSM algorithm.
Often, mainly translations and only small rotations occur between different epochs of measurements. In these cases, 3D cross-correlation can be used to obtain initial shifts [10]. An overview of the optimized algorithm can be found in Appendix A.

Direct Computation of the Normal Equations
As mentioned above, the normal matrix (A T · A) and the right hand side vector (A T · l) are computed directly, so that building up the Jacobian matrix A as well as the reduced observation vector l and the matrix multiplications can be avoided in order to gain efficiency and save memory. To compute the terms of the normal equations, the dot products of the columns of the Jacobian matrix (Equation (14)) as well as the dot products with the reduced observation vector are calculated (each combination of columns). Equation (26) shows the normal matrix that is organized in submatrices. G 11,i , G 12,i , G 13,i , G 22,i , G 23,i , G 33,i and G 44,i as well as the whole normal matrix are symmetric matrices and only the upper triangular matrices are shown. In Equation (28), the right hand side of the normal equations is depicted.
g y,i ·g z,i g y,i ·g z,i ·x i g y,i ·g z,i ·ỹ i g y,i ·g z,i ·z i g y,i ·g z,i ·x 2 i g y,i ·g z,i ·x i ·ỹ i g y,i ·g z,i ·x i ·z i g y,i ·g z,i ·ỹ 2 i g y,i ·g z,i ·ỹ i ·z i g y,i ·g z,i ·z 2 g i ·g y,i g y,i ·x i g i ·g y,i ·x i g y,i ·ỹ i g i ·g y,i ·ỹ i g y,i ·z i g i ·g y,i ·z i   , G 34,i = g z,i g i ·g z,i g z,i ·x i g i ·g z,i ·x i g z,i ·ỹ i g i ·g z,i ·ỹ i g z,i ·z i g i ·g z,i ·z i , G 44,i = In fact, the geometric and radiometric parameters are usually independent on each other and are therefore often solved independently in LSM. If only the affine parameters are unknowns and the radiometric parameters are excluded from the model, the last two columns and last two rows of A T · A as well as the last two values of A T · l are omitted. Appendix B and C consider the two cases concerning this point.
For each voxel, the following terms are computed and saved in temporal variables because they are used several times: Substituting the terms of Equation (29), the submatrices of the normal matrix and the right hand sight are shown in Equations (30) and (31). This reduces the number of computation steps, too.

Computational Effort
The computational costs are compared between the direct computation and the standard way that includes building up A and l as well as the matrix multiplications (A T · A and A T · l). The detailed determination of the number of multiplication steps and addition steps is shown in the following.

Computational Costs of the Direct Method
A total of 100 terms have to be computed to form the normal equations directly, see Equation (32). This results from 60 terms of the matrices G 11,i , G 12,i , G 13,i , G 22,i , G 23,i , G 33,i and 26 terms of the matrices G 14,i , G 24,i , G 34,i , G 44,i as well as 14 further terms of A T · l. 26 terms (sums) of A T · l are already computed in G 14,i , G 24,i , G 34,i and G 44,i .
To evaluate the computational effort, the number of additions and multiplications are counted: Altogether, for the direct computation of A T · A and A T · l, 95 multiplications and 100 additions per voxel are required. The additional multiplication and summations for building up A T · A and A T · l can be neglected due to the large number of voxel.

Computational Costs of the Standard Method
The computational effort of the standard method contains building up the Jacobian matrix A and the reduced observation vector l as well as the matrix-matrix multiplication and the matrix-vector multiplication.
The following terms are calculated and saved in temporal variables for each line of the matrix A (for each voxel): For the each line of matrix A, the following nine terms have to be computed: For each element of l, there are one multiplication and two additions (subtractions): Altogether, for building up A and l and for the matrix multiplications A T · A and A T · l, 132 multiplications and 121 additions per voxel are required. Table 1 summarizes the consideration and shows the number of multiplications and additions per voxel for the standard method and the direct computation of the normal equations. The direct method reduces the number of multiplications by 28% and reduces the number of additions by 17%. Appendices B and C also consider the cases that the radiometric parameters are fixed and that these parameters are omitted. In addition to the reduced number of multiplications and summations, the direct method also save memory because the computation of the n × 14 Jacobian matrix (A) and n × 1 observation vector (l) is avoided. The comparison was done on a desktop machine with a AMD Ryzen 7 3800X 8-Cores (16 threads) processor at 3.9 GHz. In the test, 3D LSM was applied to 146,933 cuboids defined in the first two data sets of an in-situ tension test of Lorenzoni et al. [18]. The patch dimension was set to 15 × 15 × 15 vx. The algorithm is implemented in C++ (using the GNU Compiler Collection, https://gcc.gnu.org) and compiled with the optimization option -O3. The loop over the matching points with the computation of the 3D LSM is parallelized with the OpenMP library (16 parallel units), and the time measurement only considers this loop. The time measurement includes the whole 3D LSM process so that not only the calculation of the normal matrix A T · A and the right hand side vector A T · l is analyzed. However, the computation of the normal equations is one of the most time-consuming steps herein.

Theoretical Comparison
Different calculation modes are tested: Method M1 stands for the direct computation. M2 represents the standard computation with building up the Jacobian matrix A and the reduced observation vector l as well as the matrix multiplications (only upper triangular matrix of the normal matrix) with an own implementation. M3 and M4 are similar to M2, but the matrix multiplications are done using the Eigen library. Eigen is a C++ template library for linear algebra (release 3.4.0, https://eigen.tuxfamily.org/). For M3, the full matrix of A T · A is computed, and for M4, only the upper triangular matrix is computed. For all modes, before the calculation of the normal matrix begins, the values of the interpolated gray values and derivatives are computed and saved in separate arrays. All modes lead to the same numerical results, only the calculation times differ. Table 2 shows the calculation times for the different modes. The direct method is the fastest variant and almost twice as fast as the own standard computation M2. Compared to the best matrix multiplication variant M4 using Eigen (only upper triangular matrix), the direct method reduces the computation time by 27%. This reduction approximately agrees with the theoretical decrease of the multiplications of 28% (Section 2.4.3). The computation time difference between M3 and M4 is very small (only 9% decrease), although significantly more matrix elements (91) have to be computed (M3: 132 multiplications and 121 addition per voxel, M4: 223 multiplications and 212 addition per voxel).

Conclusions
The paper presents an optimized algorithm to compute 3D least-squares matching in voxel data sequences including geometric and radiometric parameters. Compared to the standard method of the estimation of the normal equations, the number of multiplications is reduced by 28% and the number of additions is reduced by 17%. In a practical test, the computation time was decreased by 27%.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

LSM
least-squares matching vx voxel DVC digital volume correlation SHCC strain-hardening cement-based composites FFT Fast Fourier Transform

Appendix A. Optimized 3D Cross-Correlation
In the following, an optimized method of the 3D cross-correlation is presented. In our approach, the normalized cross-correlation coefficient is used, see Equation (A2). t is the template, the subvolume of the reference volume, to be matched. The number of voxel of the template is γ x+h x ,y+h y ,z+h z = ∑ i,j,kĝx,y,z,i,j,k ·t i,j,k ∑ i,j,kĝ 2 x,y,z,i,j,k · ∑ i,j,kt wherê g x,y,z,i,j,k = g x+i−1,y+j−1,z+k−1 − g x+h x ,y+h y ,z+h z g x+h x ,y+h y ,z+h z = 1 n cc · ∑ i,j,k g x+i−1,y+j−1,z+k−1 t i,j,k = t i,j,k − t t= 1 n cc · ∑ i,j,k t i,j,k ; mean intensity (template) The terms of Equation (A2) are transformed as follows: ∑ i,j,kĝ x,y,z,i,j,k ·t i,j,k = ∑ i,j,k [g x+i−1,y+j−1,z+k−1 · t i,j,k − g x+i−1,y+j−1,z+k−1 · t − t i,j,k · g x+h x ,y+h y ,z+h z + t · g x+h x ,y+h y ,z+h z ] = ∑ i,j,k g x+i−1,y+j−1,z+k−1 · t i,j,k − n cc · t · g x+h x ,y+h y ,z+h z (A3) ∑ i,j,kĝ 2 x,y,z,i,j,k = ∑ i,j,k g 2 x+i−1,y+j−1,z+k−1 − n cc · g 2 x+h x ,y+h y ,z+h z The values of ∑ i,j,k g x+i−1,y+j−1,z+k−1 · t i,j,k are computed for the whole search area using the 3D Fast Fourier transform as also done by Lewis [19] for 2D and Yang et al. [14] for 3D: F −1 {F (g) • F * (t )} where F * are the complex conjugate values (•: Hadamard product, element-wise multiplication). t is a volume data set with the same size as the search volume containing t and is filled up with zeros for indices exceeding the dimensions of t.

•
The values of g x+h x ,y+h y ,z+h z and ∑ i,j,k g 2 x+i−1,y+j−1,z+k−1 are computed using the integral volume method [20].
The terms ∑ i,j,k t 2 i,j,k and t are computed straight forward. The position of the maximum coefficient max x,y,z γ x+h x ,y+h y ,z+h z can be used as initial position for the 3D LSM.

Appendix B. 3D Least-Squares Matching with Fixed Radiometric Parameters
Appendix B.2. Computational Effort Appendix B.2.1. Direct Method As above, for each voxel, the following terms are computed and saved in temporal variables because they are used several times: 60 terms of the matrices G 11 , G 12 , G 13 , G 22 , G 23 and G 33 (Equation (30)) and 12 terms of A T · l are computed. To evaluate the computational effort of the direct method, the number of additions and multiplications are counted. Only the operations per voxel are considered, the other operations can be neglected.  Table A1 shows the number of computations per voxel for the direct and the standard method. Using the direct method, 20.4% less multiplications and 19.6% less additions are done. Table A1. Comparison of the computational effort between the direct computation of the normal equations and the standard method (affine mathematical model with fixed radiometric parameters).

Method
Multiplications The right hand side of the normal equations simplifies to: Similar to the case above, for each voxel, the following terms are computed and saved in temporal variables because they are used several times: gxx i = g x,i ·x i gxy i = g x,i ·ỹ i gxz i = g x,i ·z i gyx i = g y,i ·x i gyy i = g y,i ·ỹ i gyz i = g y,i ·z i gzx i = g z,i ·x i gzy i = g z,i ·ỹ i gzz i = g z,i ·z i d f i = f i − g i (A22) 60 terms of the matrices G 11 , G 12 , G 13 , G 22 , G 23 and G 33 (Equation (30)) and 12 terms of A T · l are computed. To evaluate the computational effort of the direct method, the number of additions and multiplications are counted. Only the operations per voxel are considered, the other operations can be neglected. Thus, for the standard method, 99 multiplications and 91 additions have to be computed per voxel.
Appendix C.2.3. Comparison Table A2 shows the number of computations per voxel for the direct and the standard method. Using the direct method, 18.2% less multiplications and 19.8% less additions are done. Table A2. Comparison of the computational effort between the direct computation of the normal equations and the standard method (affine mathematical model without radiometric parameters). Standard  99  91  Direct  81  73