Spatiotemporal Correlation-Based Accurate 3D Face Imaging Using Speckle Projection and Real-Time Improvement

: The reconstruction of 3D face data is widely used in the ﬁelds of biometric recognition and virtual reality. However, the rapid acquisition of 3D data is plagued by reconstruction accuracy, slow speed, excessive scenes and contemporary reconstruction-technology. To solve this problem, an accurate 3D face-imaging implementation framework based on coarse-to-ﬁne spatiotemporal correlation is designed, improving the spatiotemporal correlation stereo matching process and accelerating the processing using a spatiotemporal box ﬁlter. The reliability of the reconstruction parameters is further veriﬁed in order to resolve the contention between the measurement accuracy and time cost. A binocular 3D data acquisition device with a rotary speckle projector is used to continuously and synchronously acquire an infrared speckle stereo image sequence for reconstructing an accurate 3D face model. Based on the face mask data obtained by the high-precision industrial 3D scanner, the relationship between the number of projected speckle patterns, the matching window size, the reconstruction accuracy and the time cost is quantitatively analysed. An optimal combination of parameters is used to achieve a balance between reconstruction speed and accuracy. Thus, to overcome the problem of a long acquisition time caused by the switching of the rotary speckle pattern, a compact 3D face acquisition device using a ﬁxed three-speckle projector is designed. Using the optimal combination parameters of the three speckles, the parallel pipeline strategy is adopted in each core processing unit to maximise system resource utilisation and data throughput. The most time-consuming spatiotemporal correlation stereo matching activity was accelerated by the graphical processing unit. The results show that the system achieves real-time image acquisition, as well as 3D face reconstruction, while maintaining acceptable systematic precision. speckle patterns yields better measurement accuracy, while the optimal matching window will continue to shrink and the computation time will increase slightly. To meet the re-quirement of rapid 3D face imaging under slight motion, this study provides a compact 3D face acquisition device with three ﬁxed speckle projectors. Using the optimal combination parameters for three-speckle stereo images, the spatiotemporal correlation stereo-matching process, which takes the longest time among the components, is accelerated by the GPU through the use of a parallel pipeline strategy between each core processing unit; the purpose of real-time data acquisition and real-time 3D reconstruction. In the future, the authors intend to apply the proposed scheme to 3D face veriﬁcation and recognition in practical scenarios.


Introduction
In the field of computer vision and computer graphics, acquiring, modelling and synthesising three-dimensional (3D) human faces has become an active research topic. In particular, 3D face models have been used in various situations, such as medical plastic surgery [1], 3D face recognition [2], entertainment [3] and artistic rendering [4]. Hence, 3D face reconstruction has attracted widespread attention.
Existing image-based 3D face reconstruction methods have two main development directions. One emerges from the perspective of object measurement, which usually requires special image acquisition equipment. The technology includes multiview stereo vision [5], structured light [6] and time-of-flight [7]. Predicting a 3D face model from a single image is another important research area. This method is driven by the prior data of the face, which constructs a 3D model based on a learning method. The 3D deformable model [8], the "shape-from-shading" method [9,10], and the convolutional neural network (CNN) regression [11][12][13][14] are the most common techniques.
Shape-measurement technology based on structured light encoding [6] has been widely used in the field of 3D face reconstruction. There are two mainstream structured can measure slow-moving objects. Meanwhile, temporal and spatial encoding schemes are designed according to different speed and accuracy requirements. Große et al. [31] comprehensively utilised the characteristics of two methods to find a pair of homologous points through the temporal correlation of the grey value sequence, where spatial correlation was used to verify the correctness of the homologous points. Under equivalent accuracy, the model can reduce the spatial correlation area and make it suitable for the surface reconstruction of objects with large fluctuations. Mainly, it reduces the number of images required for 3D reconstruction, which makes it possible to reconstruct dynamic scenes. Harendt et al. [32] proposed a weighted spatiotemporal correlation method to reconstruct static or moving objects, and the weight value was used to adjust the spatial and temporal parameters of the matching region. Tang et al. [33] evaluated the relationship between the size of the correlation area and the number of projection patterns. Experiments showed that selecting an appropriate window and a reasonable number of speckle patterns produced more accurate results. However, they only analysed the dumbbell gauges, and their results were not universally generalisable for face modelling. The 3D measurement system proposed by Zhou et al. [34] consists of an expensive industrial camera and a motordriven projection device with high-density speckles to ensure the integrity and accuracy of the 3D face. However, the device is too large and expensive for routine use, and the matching procedure is not optimised. Its computation speed is slow, and it cannot meet the application requirements of a portable 3D face recognition system. Fu et al. [35] projected a visible-light speckle pattern on the human face and utilised a spatiotemporal stereo matching algorithm to achieve fast 3D face reconstruction. However, it only projected three frames of speckle patterns, and the selected parameters were not optimally evaluated, which limits reconstruction accuracy.
To quickly obtain a high-precision real 3D face model, the authors conducted research on measurement methods and parallel acceleration processing strategies. The contribution of this study is to the following three aspects: 1. A framework for implementing a high-precision 3D face imaging technique with a coarse-to-fine spatiotemporal correlation is designed. A spatiotemporal box filter is used to improve the spatiotemporal correlation stereo matching process and accelerate the calculation.
2. The relationship between the number of projected speckle patterns, matching window size, reconstruction accuracy and time cost in the spatiotemporal correlation 3D face imaging method is illuminated through experiments. The optimal combination of parameters to meet the requirements of a balanced strategy of reconstruction speed and accuracy is obtained. A reference for real-time high-precision 3D reconstruction is provided in the next step.
3. Based on previous research, a compact 3D face acquisition system using a fixed three-speckle projection method is designed. Using the optimal combination parameters for three-speckle stereo images, a parallel pipeline strategy is adopted between each core functional unit to maximise system resource utilisation and data throughput. The most time-consuming spatiotemporal correlation stereo-matching process is accelerated by the graphical processing unit (GPU). Experiments verify that the entire system achieves the goals of real-time data acquisition and real-time 3D reconstruction while maintaining high-precision modelling.

Stereo Matching
Stereo matching [36] is the core procedure of the 3D reconstruction method for binocular vision. The goal of stereo matching is to find the corresponding homologous point pairs in the two views and to acquire the disparity. The disparity of the corresponding pixels is estimated by establishing a similarity measurement function to minimise the correlation value [37] in order to obtain the depth. According to epipolar geometry [36], stereo matching constrains and simplifies the complex 2-dimensional (2D) search problem Appl. Sci. 2021, 11, 8588 4 of 18 into a simple 1-dimensional search. The matching process uses the "winner takes all" (WTA) criterion to estimate the disparity of each pixel. General similarity measurement functions in stereo-matching algorithms include the sum of absolute difference, sum of squared difference and zero normalised cross correlation (ZNCC).

Stereo-Matching Method Based on Spatiotemporal Correlation
A diagram of stereo matching based on the spatiotemporal correlation is shown in Figure 1. A series of time-varying speckle patterns are projected onto the surface of the measured object, and two cameras synchronously acquire a set of speckle-coded stereoimage pairs. To implement the speckle spatiotemporal correlation calculation, the projector must project multiple different speckle patterns continuously. Most current solutions use commercial projectors [30][31][32][33], but they are too large to meet the requirements of 3D face imaging equipment production. Given these considerations, our group designed a 3D face acquisition device based on a rotating speckle projector, which consists of a light-emitting diode (LED) light source, multiple sets of lenses and speckle diaphragms, which can project any number of speckle patterns. A schematic of a 3D face acquisition device with a rotary speckle projector is shown in Figure 2. After the binocular camera acquires a pair of stereo images, the speckle mask rotates at a certain angle and stops. The camera then acquires the next pair of images. In the experimental part of this study, the equipment is employed to acquire multiple speckle stereo image pairs to analyse the relationship between the number of projection patterns, 3D reconstruction accuracy and calculation speed. estimate the disparity of each pixel. General similarity measurement funct matching algorithms include the sum of absolute difference, sum of squa and zero normalised cross correlation (ZNCC).

Stereo-Matching Method Based on Spatiotemporal Correlation
A diagram of stereo matching based on the spatiotemporal correlatio Figure 1. A series of time-varying speckle patterns are projected onto the measured object, and two cameras synchronously acquire a set of speckle image pairs. To implement the speckle spatiotemporal correlation calculati tor must project multiple different speckle patterns continuously. Most cur use commercial projectors 30-33, but they are too large to meet the requi face imaging equipment production. Given these considerations, our grou 3D face acquisition device based on a rotating speckle projector, which con emitting diode (LED) light source, multiple sets of lenses and speckle diaph can project any number of speckle patterns. A schematic of a 3D face acqu with a rotary speckle projector is shown in Figure 2. After the binocular ca a pair of stereo images, the speckle mask rotates at a certain angle and stop then acquires the next pair of images. In the experimental part of this study, is employed to acquire multiple speckle stereo image pairs to analyse the re tween the number of projection patterns, 3D reconstruction accuracy an speed.  Stereo matching 36 is the core procedure of the 3D reconstruction method for bino ular vision. The goal of stereo matching is to find the corresponding homologous point pai in the two views and to acquire the disparity. The disparity of the corresponding pixels estimated by establishing a similarity measurement function to minimise the correlatio value 37 in order to obtain the depth. According to epipolar geometry 36, stereo matchin constrains and simplifies the complex 2-dimensional (2D) search problem into a simple dimensional search. The matching process uses the "winner takes all" (WTA) criterion estimate the disparity of each pixel. General similarity measurement functions in stere matching algorithms include the sum of absolute difference, sum of squared differenc and zero normalised cross correlation (ZNCC).

Stereo-Matching Method Based on Spatiotemporal Correlation
A diagram of stereo matching based on the spatiotemporal correlation is shown Figure 1. A series of time-varying speckle patterns are projected onto the surface of th measured object, and two cameras synchronously acquire a set of speckle-coded stereo image pairs. To implement the speckle spatiotemporal correlation calculation, the proje tor must project multiple different speckle patterns continuously. Most current solution use commercial projectors 30-33, but they are too large to meet the requirements of 3 face imaging equipment production. Given these considerations, our group designed 3D face acquisition device based on a rotating speckle projector, which consists of a ligh emitting diode (LED) light source, multiple sets of lenses and speckle diaphragms, whic can project any number of speckle patterns. A schematic of a 3D face acquisition devic with a rotary speckle projector is shown in Figure 2. After the binocular camera acquire a pair of stereo images, the speckle mask rotates at a certain angle and stops. The camer then acquires the next pair of images. In the experimental part of this study, the equipmen is employed to acquire multiple speckle stereo image pairs to analyse the relationship b tween the number of projection patterns, 3D reconstruction accuracy and calculatio speed.  When performing spatiotemporal stereo matching, by calculating the correlation coefficients in two cubes composed of correlation windows in the time domain and the spatial domain, the matching positions of homologous points can be obtained. The stereomatching method based on the spatiotemporal correlation is shown in Figure 3. Taking p, one matching point in the left image, as the centre, a rectangular region, Ω, having a width of w x and a length of w y on the temporal domain along the sampling time, the axis is formed. The same-size cube from the right camera is chosen to perform the spatiotemporal correlation operation with the volume around p. When performing spatiotemporal stereo matching, by calculating the correlation co efficients in two cubes composed of correlation windows in the time domain and the spa tial domain, the matching positions of homologous points can be obtained. The stereo matching method based on the spatiotemporal correlation is shown in Figure 3. Taking p , one matching point in the left image, as the centre, a rectangular region, Ω , having a width of x w and a length of y w on the temporal domain along the sampling time, the axis is formed. The same-size cube from the right camera is chosen to perform the spatio temporal correlation operation with the volume around p . According to the epipolar constraint characteristics of binocular vision, the corre sponding points in the left and right views are searched on a line parallel to the baseline A similarity measure function is considered for spatiotemporal correlation: the spatiotem poral zero mean normalised cross-correlation (STZNCC) 34. This function can bette adapt to the characteristics of speckle images, which effectively reduce the impact of in consistent brightness of the left and right images. STZNCC expression is defined as fol lows:   According to the epipolar constraint characteristics of binocular vision, the corresponding points in the left and right views are searched on a line parallel to the baseline. A similarity measure function is considered for spatiotemporal correlation: the spatiotemporal zero mean normalised cross-correlation (STZNCC) [34]. This function can better adapt to the characteristics of speckle images, which effectively reduce the impact of inconsistent brightness of the left and right images. STZNCC expression is defined as follows: where (x, y) denotes the pixel coordinates, d is the disparity and C STZNCC (x, y, d) is the correlation coefficient of the disparity; d, in the right image. L (x,y) (l, h, t) and R (x−d,y) (l, h, t) denote the pixel intensity in the matching window of the t-th left image and right image, respectively; L (x,y) and R (x−d,y) denote the average intensity of the left and right in a w x × w y window, respectively; and where N is the total number of speckle stereo image pairs by ranging l, l over interval [−w x /2, w x /2], h over interval [−w y /2, w y /2] and t over [1, N].
Appl. Sci. 2021, 11, 8588 6 of 18 After rectification, the matching-cost computation of homonymous points can be performed on the same line. The WTA criterion is used to obtain the final disparity, where the best cumulative matching cost is calculated.

Coarse-to-Fine Spatiotemporal Correlation Computation Scheme
As shown in Figure 4, the proposed scheme uses binocular stereo [36] as the basic architecture. The infrared speckle projector continuously projects a set of binary random speckle patterns to the measured face, and the binocular infrared camera simultaneously collects stereo image pairs. Because the camera parameters are obtained through Zhang's [38] pre-calibration method, the image pairs are rectified according to its parameters to ensure that homologous points are searched along the epipolar line and matched. During the disparity computation procedure, the coarse-to-fine spatiotemporal correlation algorithm is adopted. Meanwhile, a spatiotemporal box filter (STBF) is used to accelerate correlation computation. After the subpixel disparity computation is completed, a fine-grained 3D face model can be obtained. (3) After rectification, the matching-cost computation of homonymous points can be performed on the same line. The WTA criterion is used to obtain the final disparity, where the best cumulative matching cost is calculated.

Coarse-to-Fine Spatiotemporal Correlation Computation Scheme
As shown in Figure 4, the proposed scheme uses binocular stereo 36 as the basic architecture. The infrared speckle projector continuously projects a set of binary random speckle patterns to the measured face, and the binocular infrared camera simultaneously collects stereo image pairs. Because the camera parameters are obtained through Zhang's 38 pre-calibration method, the image pairs are rectified according to its parameters to ensure that homologous points are searched along the epipolar line and matched. During the disparity computation procedure, the coarse-to-fine spatiotemporal correlation algorithm is adopted. Meanwhile, a spatiotemporal box filter (STBF) is used to accelerate correlation computation. After the subpixel disparity computation is completed, a finegrained 3D face model can be obtained. During the process of disparity computation in this study, a coarse-to-fine two-level spatiotemporal correlation matching strategy based on a grid search is adopted.

Coarse Disparity Estimation
The left images were used as reference images and the right images were used as targets. The reference images are divided into small horizontal and vertical grids by a fixed step interval, and the disparity at the cross point in the reference image is obtained by searching for the matching point in the target image. After the correlation threshold is set, the pixel with the maximum correlation value is identified as the homologous point. A smaller matching window can improve the computational speed of the matching-point search, but a larger window can greatly reduce the error rate of matching because matching tends to be easier. At this stage, the coarse disparity of each grid point is only used as a guide value for the subsequent fine disparity calculation; hence, a relatively large correlation window was chosen to reduce the matching error rate. The grid-point interval is consistent with the size of the coarse matching window in order to ensure full coverage of the initial disparity. The search range of the disparity is determined by During the process of disparity computation in this study, a coarse-to-fine two-level spatiotemporal correlation matching strategy based on a grid search is adopted.

Coarse Disparity Estimation
The left images were used as reference images and the right images were used as targets. The reference images are divided into small horizontal and vertical grids by a fixed step interval, and the disparity at the cross point in the reference image is obtained by searching for the matching point in the target image. After the correlation threshold is set, the pixel with the maximum correlation value is identified as the homologous point. A smaller matching window can improve the computational speed of the matching-point search, but a larger window can greatly reduce the error rate of matching because matching tends to be easier. At this stage, the coarse disparity of each grid point is only used as a guide value for the subsequent fine disparity calculation; hence, a relatively large correlation window was chosen to reduce the matching error rate. The grid-point interval is consistent with the size of the coarse matching window in order to ensure full coverage of the initial disparity. The search range [d min c , d max c ] of the disparity is determined by the depth region of the measured object. Disparity consistency and sequential constraints were employed to remove outliers. To further reduce the computation of disparity, a disparity propagation strategy [39] is introduced to narrow the search range to a smaller Appl. Sci. 2021, 11, 8588 7 of 18 radius, r p c . Because our measurement target is a human face, the disparity calculation satisfies the continuous and slowly changing constraint. Therefore, after the previous effective matching disparity was obtained, the disparity search range was also reduced to a certain interval when searching for adjacent matching points. Note that the correlation computation at the current stage may still cause some missing disparity, thereby creating holes. It is thus necessary to interpolate the disparity map to fill-in the missing disparity. According to the aforementioned continuity assumption, because the disparity of a reliable grid point, p g , has been acquired, the disparity of its next adjacent point can be initialised using the disparity of p g by up-sampling.

Fine Disparity Estimation
Following the coarse disparity, d p c , computation, the square area centred on the grid point in the disparity map is filled with the disparity of the grid point. Because the coarse disparity map has given the initial position of the fine-matching search, it is only necessary to calculate the fine disparity, d . Because the final disparity is determined by the result of fine matching, the selection of the fine matching window size directly determines the accuracy of 3D face reconstruction. To ensure measurement accuracy, a smaller correlation window is introduced in order to effectively reduce the computation cost.

Disparity Selection and Sub-Pixel Disparity Refinement
The disparity selection is performed using the WTA strategy, wherein each disparity is selected corresponding to the maximum correlation coefficient, as follows: Now that the integer pixel position, (x − d int , y, N), having the largest correlation value, is found in the target image as the best candidate point, a quadratic curve is fitted with five points centred on the matching point. The matched sub-pixel disparity, d sub , is computed according to the coordinates corresponding to the extreme points of the quadratic curve: where a and c are the fitting coefficients, (x − d sub , y, N), representing the coordinate of the maximum value of the fitting curve and C stzncc (x, y, x − d sub , y, N) is the new matching position point in place of (x − d int , y, N). Finally, the desired disparity map is acquired through post-processing (e.g., filtering-out outliers and smoothing the surface).

Spatiotemporal Box Filter
A box filter is a fast-filtering algorithm performed recursively in a 2D space. The basic idea is to make full use of the previous computation result during the loop computation, which ensures that the next computation result can be acquired with only a small computational effort. The typical implementation is a sliding window. The summing process is accelerated by adding move-in pixels to the sliding window and subtracting the move-out pixels. The computation is independent of window size. Therefore, a box filter that eliminates redundant information is more prominent than an integral image method [35]. The box filter can reduce the complexity of the summation operation from O(w x w y ) to O(1). The traditional box filter is used only in the spatial domain. This method is now extended to the temporal domain. This STBF is schematically shown in Figure 5.
ppl. Sci. 2021, 11, x FOR PEER REVIEW 8 The box filter can reduce the complexity of the summation operation from ( x O w w (1) O . The traditional box filter is used only in the spatial domain. This method is extended to the temporal domain. This STBF is schematically shown in Figure 5.
The acceleration computation of STBF in the row and column directions is giv Equations (6) and (7), where ( , ) S x y is the sum of grey values in the spatiotempora tangular window centred on ( , ) x y . To perform fast-matching cost computations in entire disparity cube, the sum and the sum of squares of grey values in the matching dow in Equation (1) must be calculated in advance by STBF.

Real-Time Acquisition and Reconstruction of 3D Face
To achieve the goal of real-time acquisition and real-time reconstruction of 3D f the authors consider optimising the system from two aspects: shortening the image a sition time and increasing the speed of 3D reconstruction. In the previous design met a rotary speckle projector was used to construct a spatiotemporal correlation 3D fac quisition device. Owing to the influence of the inertial motion of the object, it was im sible to ensure that the rotary speckle mask could be switched quickly in a short Simultaneously, owing to the particularity of the face target, the image acquisition should be shortened as much as possible to reduce the problem of decreased measure accuracy caused by face movement. Therefore, the idea of using multiple fixed sp projectors is proposed to replace the rotary speckle projector for fast acquisition o faces; the feasibility of the idea is verified through subsequent experimental analysis To improve the computing performance of the system, the authors consider de posing the multiple steps of 3D face reconstruction into multiple sub-function mod and each module is executed in parallel in a pipeline. The continuous acquisition and play procedure of the 3D face based on a fixed speckle projection is shown in Figu The entire pipeline is divided into four functional modules: image data acquisition polar rectification and disparity computation, point-cloud generation and point-c display. The four modules create threads to work in a synchronous and parallel pip The acceleration computation of STBF in the row and column directions is given in Equations (6) and (7), where S(x, y) is the sum of grey values in the spatiotemporal rectangular window centred on (x, y). To perform fast-matching cost computations in the entire disparity cube, the sum and the sum of squares of grey values in the matching window in Equation (1) must be calculated in advance by STBF.

Real-Time Acquisition and Reconstruction of 3D Face
To achieve the goal of real-time acquisition and real-time reconstruction of 3D faces, the authors consider optimising the system from two aspects: shortening the image acquisition time and increasing the speed of 3D reconstruction. In the previous design method, a rotary speckle projector was used to construct a spatiotemporal correlation 3D face acquisition device. Owing to the influence of the inertial motion of the object, it was impossible to ensure that the rotary speckle mask could be switched quickly in a short time. Simultaneously, owing to the particularity of the face target, the image acquisition time should be shortened as much as possible to reduce the problem of decreased measurement accuracy caused by face movement. Therefore, the idea of using multiple fixed speckle projectors is proposed to replace the rotary speckle projector for fast acquisition of 3D faces; the feasibility of the idea is verified through subsequent experimental analysis.
To improve the computing performance of the system, the authors consider decomposing the multiple steps of 3D face reconstruction into multiple sub-function modules, and each module is executed in parallel in a pipeline. The continuous acquisition and display procedure of the 3D face based on a fixed speckle projection is shown in Figure 6. The entire pipeline is divided into four functional modules: image data acquisition, epipolar rectification and disparity computation, point-cloud generation and point-cloud display. The four modules create threads to work in a synchronous and parallel pipeline mode to ensure that each thread simultaneously obtains the maximum utilisation of each component of the hardware system. Three buffer queues are inserted between the four modules: image, disparity and point cloud queues. Among the modules, the epipolar rectification and disparity computation is the core of the entire operation which costs the most time. Therefore, the function was added to the GPU to accelerate the computation. Cooperating with the high-speed image data acquisition hardware, real-time computation and display of the 3D face is achieved. modules: image, disparity and point cloud queues. Among the modules, the epipolar rectification and disparity computation is the core of the entire operation which costs the most time. Therefore, the function was added to the GPU to accelerate the computation. Cooperating with the high-speed image data acquisition hardware, real-time computation and display of the 3D face is achieved.

Setup
Experiments were conducted on a desktop PC equipped with an Intel I7-9700K CPU (3.6 GHz), 8-GB RAM and an NVIDIA GeForce RTX1660Ti GPU. The spatiotemporal correlation matching algorithm was implemented using C++ using Visual Studio 2015. Several function libraries (i.e., OpenMP, OpenCL and OpenGL) were used to develop our algorithm. The setup with a rotary speckle projector for capturing infrared (IR) speckle images is shown in Figure 7. The tested object was placed 500-mm directly in front of the projector. The device has two 200-Hz infrared light cameras with a baseline length of 120 mm. To avoid interference with human eyes and reduce the absorption of infrared light by the skin, the device was equipped with a 730-nm near-IR LED as speckle projection illumination, which is non-inductive to humans. IR cameras capture images simultaneously, whereas the rotary speckle projector projects speckle-encoded patterns onto a facial mask sequentially. The projected pattern number is consistent with the number of infrared stereo image pairs collected. The projector comprises an infrared light source, micromotor, pattern mask and lens group. A speckle pattern mask was generated by pseudo-random encoding. When the binocular camera acquires images synchronously, it is necessary to ensure that the speckle mask in the rotary projector remains stationary. The motor then drives the mask gear to rotate a certain angle (e.g., 9°) to ensure that there is no correlation between the continuously collected speckle encoding images, because the speckle patterns of the two pairs of continuous images are quite different. To reduce the motion blur caused by the rotating mask, the next set of image pairs is collected after the mask is completely still. The specific relationship between the rotation angle and the accuracy of 3D reconstruction is being studied by other members of our group, and this content is beyond the scope of this article. When the experimental device is operating, the acquisition time of infrared images depends heavily on the number of speckle patterns required by the spatiotemporal correlation algorithm. The exposure time of the infrared cameras was 2 ms, and the speckle pattern rotatory switching time was 11 ms for each image pair. Therefore, N × 13 ms is required to acquire N pairs of stereo images. A 3D face acquisition device with a rotating speckle projector is shown in Figure 7.

Setup
Experiments were conducted on a desktop PC equipped with an Intel I7-9700K CPU (3.6 GHz), 8-GB RAM and an NVIDIA GeForce RTX1660Ti GPU. The spatiotemporal correlation matching algorithm was implemented using C++ using Visual Studio 2015. Several function libraries (i.e., OpenMP, OpenCL and OpenGL) were used to develop our algorithm. The setup with a rotary speckle projector for capturing infrared (IR) speckle images is shown in Figure 7. The tested object was placed 500-mm directly in front of the projector. The device has two 200-Hz infrared light cameras with a baseline length of 120 mm. To avoid interference with human eyes and reduce the absorption of infrared light by the skin, the device was equipped with a 730-nm near-IR LED as speckle projection illumination, which is non-inductive to humans. IR cameras capture images simultaneously, whereas the rotary speckle projector projects speckle-encoded patterns onto a facial mask sequentially. The projected pattern number is consistent with the number of infrared stereo image pairs collected. The projector comprises an infrared light source, micromotor, pattern mask and lens group. A speckle pattern mask was generated by pseudo-random encoding. When the binocular camera acquires images synchronously, it is necessary to ensure that the speckle mask in the rotary projector remains stationary. The motor then drives the mask gear to rotate a certain angle (e.g., 9 • ) to ensure that there is no correlation between the continuously collected speckle encoding images, because the speckle patterns of the two pairs of continuous images are quite different. To reduce the motion blur caused by the rotating mask, the next set of image pairs is collected after the mask is completely still. The specific relationship between the rotation angle and the accuracy of 3D reconstruction is being studied by other members of our group, and this content is beyond the scope of this article. When the experimental device is operating, the acquisition time of infrared images depends heavily on the number of speckle patterns required by the spatiotemporal correlation algorithm. The exposure time of the infrared cameras was 2 ms, and the speckle pattern rotatory switching time was 11 ms for each image pair. Therefore, N × 13 ms is required to acquire N pairs of stereo images. A 3D face acquisition device with a rotating speckle projector is shown in Figure 7. To overcome the problem of an excessively long acquisition time caused by th switching of the rotary speckle pattern, a compact 3D face acquisition device using a fixed three-speckle projector was designed. A 3D face acquisition device with three fixed speckle projections is shown in Figure 8. The three fixed speckle projectors use differen speckle masks and project the speckle patterns in sequence, and the two infrared camera To overcome the problem of an excessively long acquisition time caused by the switching of the rotary speckle pattern, a compact 3D face acquisition device using a fixed three-speckle projector was designed. A 3D face acquisition device with three fixed speckle projections is shown in Figure 8. The three fixed speckle projectors use different speckle masks and project the speckle patterns in sequence, and the two infrared cameras synchronously collect speckle-coded images in a time-sharing manner. The switching time of the speckle projection was reduced significantly. In the system, the interval for taking a single pair of speckle images was 5 ms, and three pairs of stereo-image acquisitions cost 15 ms total. The characteristics of short-time exposure and fast acquisition in the system showed good robustness and anti-interference characteristics for the 3D reconstruction of slight movements of the human face. The reason for adopting the structural design of three fixed speckle projections was deduced from the subsequent experimental analysis. To overcome the problem of an excessively long acquisition time caused by th switching of the rotary speckle pattern, a compact 3D face acquisition device using a fixe three-speckle projector was designed. A 3D face acquisition device with three fixe speckle projections is shown in Figure 8. The three fixed speckle projectors use differen speckle masks and project the speckle patterns in sequence, and the two infrared camera synchronously collect speckle-coded images in a time-sharing manner. The switchin time of the speckle projection was reduced significantly. In the system, the interval fo taking a single pair of speckle images was 5 ms, and three pairs of stereo-image acquis tions cost 15 ms total. The characteristics of short-time exposure and fast acquisition in th system showed good robustness and anti-interference characteristics for the 3D recon struction of slight movements of the human face. The reason for adopting the structura design of three fixed speckle projections was deduced from the subsequent experimenta analysis. Meanwhile, a high-precision industrial 3D scanner (ATOS Core 300 with measure ment accuracy ±0.02 mm) was used to collect 3D face mask data as ground truth, certifie by VDI/VDE 2634 Part2 40. The reconstructed 3D face model data acquired by our devic were compared with Core 300 to obtain the measurement accuracy of our algorithms an equipment. Figure 9 shows a pair of speckle stereo images taken by our experimental de vice. Figure 10 shows the 3D scanned data gathered using the ATOS Core 300. Meanwhile, a high-precision industrial 3D scanner (ATOS Core 300 with measurement accuracy ±0.02 mm) was used to collect 3D face mask data as ground truth, certified by VDI/VDE 2634 Part2 [40]. The reconstructed 3D face model data acquired by our device were compared with Core 300 to obtain the measurement accuracy of our algorithms and equipment. Figure 9 shows a pair of speckle stereo images taken by our experimental device. Figure 10 shows the 3D scanned data gathered using the ATOS Core 300.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 11 of 19 Figure 9. Pairs of infrared image acquired by our device. Figure 9. Pairs of infrared image acquired by our device.

Evaluation on 3D Reconstruction Precision and Performance
From the computation theory of spatiotemporal correlation, it can be deduced that capturing more image pairs leads to higher accuracy. On the contrary, fewer image pairs reduce the capturing time and have a weak impact on the measurement accuracy of moving objects. Consequently, this study focuses on improving the measurement efficiency and optimising the compromise between the accuracy and speed of the algorithm.
Several IR image pairs were projected with a rotary speckle projector using the experimental device shown in Figure 7. The number of speckle stereo image pairs, N, ranged from 1 to 12, and the resolution was 1280 × 1024. The binocular infrared camera was calibrated using Zhang's 38 method to obtain the camera parameters, with which the acquired IR images were rectified to facilitate correlation computation. To obtain a higher measurement accuracy, different combination parameters were used for different numbers of stereo image pairs. Different matching window sizes were adopted for image pairs, N, during coarse and fine matching. During coarse disparity calculation, usually, coarse window sizes, c Appl. Sci. 2021, 11, x FOR PEER REVIEW 13 of 19 Figure 11. Three-dimensional reconstruction results with a number of stereo image pairs, N, and a fine matching window, f w . The reconstruction accuracy continues to improve as N increases When N is small (e.g., N ≤ 3), a larger window creates a more precise and complete model. Figure 11. Three-dimensional reconstruction results with a number of stereo image pairs, N, and a fine matching window, w f . The reconstruction accuracy continues to improve as N increases. When N is small (e.g., N ≤ 3), a larger window creates a more precise and complete model.
The impact of changes in the projected pattern number, N, and fine matching window, w f , was analysed on the accuracy of 3D reconstruction through a specific quantitative analysis. The average error and standard deviation were used as error evaluation criteria. The variation curves of the average error and standard deviation along with the change in N, w f , are shown in Figure 12. As shown, when N continues to increase, the measurement error decreases accordingly. Meanwhile, the optimal correlation window does not remain static but decreases as N increases. Figure 13 shows the curves of the disparity computation time corresponding to different N values with the enlargement of the fine matching window,w f . When N is small (e.g., N ≤ 3), a larger window creates a more precise and complete model. Figure 12. Comparison of reconstruction error obtained between the projected pattern numbers, N, and fine matching window, w f . When N continues to increase, the reconstruction error decreases accordingly. Meanwhile, the optimal window size will also drop.  Regarding the choice of matching window size, a larger window image in the measurement of shape will function as a low-pass filter and suppress rapid variations in the measured field, leading to a reduction in spatial resolution or loss of high-frequency details, including an increase in computation cost. A comparison between the reconstruction results obtained by our proposed method and the scanned model using Core 300 is shown in Figure 14.  Regarding the choice of matching window size, a larger window image in the measurement of shape will function as a low-pass filter and suppress rapid variations in the measured field, leading to a reduction in spatial resolution or loss of high-frequency details, including an increase in computation cost. A comparison between the reconstruction results obtained by our proposed method and the scanned model using Core 300 is shown in Figure 14.  Table 1, it is found that the 3D reconstruction results have the following characteristics: (1) In the current configuration environment, the reconstruction accuracy of the 3D model continues to improve as N increases the number of speckle stereo image pairs, but when the projected speckle patterns exceed a certain range (N ≥ 6), the trend of accuracy improvement gradually weakens. (2) To obtain a higher 3D reconstruction accuracy, the optimal fine matching window is not as small as possible. For a given number of patterns, the optimal fine window size is determined for a given number of patterns. For example, when N = 1, 9 × 9 is the optimal fine window size and the measurement error is the smallest; when N = 3, the optimal fine window size is 7 × 7. When N = 12, the optimal fine window size becomes 3 × 3, the average error is 0.071 mm, and the standard deviation is 0.091 mm.
From the overall trend, as the number of speckle stereo image pairs increase, the optimal matching window continues to shrink. (3) The computation time increases as the matching window increases. The greater the number of projected patterns involved in the calculation, the faster the computation time. As shown in Figure 10, in the case of a fixed number of speckle patterns, the computation time increases proportionally with the increase in the matching window. (4) There is a trade-off between measurement accuracy and calculation cost. When the control average error is less than 0.15 mm, the overall reconstruction error decreases with the increase in the number of projection patterns, and the optimal window size continues to shrink. Combining the comprehensive analysis of Figures 9 and 10, it is found that to obtain the best measurement accuracy for each number, N, of stereo image pairs, the computation time also increases slightly as the number of speckle patterns increases. Regarding the choice of matching window size, a larger window image in the measurement of shape will function as a low-pass filter and suppress rapid variations in the measured field, leading to a reduction in spatial resolution or loss of high-frequency details, including an increase in computation cost. A comparison between the reconstruction results obtained by our proposed method and the scanned model using Core 300 is shown in Figure 14.    Table 1, it is found that the 3D reconstruction results have the following characteristics: (1) In the current configuration environment, the reconstruction accuracy of the 3D model continues to improve as N increases the number of speckle stereo image pairs, but when the projected speckle patterns exceed a certain range (N ≥ 6), the trend of accuracy improvement gradually weakens. (2) To obtain a higher 3D reconstruction accuracy, the optimal fine matching window is not as small as possible. For a given number of patterns, the optimal fine window size is determined for a given number of patterns. For example, when N = 1, 9 × 9 is the optimal fine window size and the measurement error is the smallest; when N = 3, the optimal fine window size is 7 × 7. When N = 12, the optimal fine window size becomes 3 × 3, the average error is 0.071 mm, and the standard deviation is 0.091 mm.
From the overall trend, as the number of speckle stereo image pairs increase, the optimal matching window continues to shrink. (3) The computation time increases as the matching window increases. The greater the number of projected patterns involved in the calculation, the faster the computation time. As shown in Figure 10, in the case of a fixed number of speckle patterns, the computation time increases proportionally with the increase in the matching window. (4) There is a trade-off between measurement accuracy and calculation cost. When the control average error is less than 0.15 mm, the overall reconstruction error decreases with the increase in the number of projection patterns, and the optimal window size continues to shrink. Combining the comprehensive analysis of Figures 9 and 10, it is found that to obtain the best measurement accuracy for each number, N, of stereo image pairs, the computation time also increases slightly as the number of speckle patterns increases.   (1) Optimal matching window corresponding to the pattern number. (2) Minimum average e fitting the face mask between ours and Core 300. (3) Minimum standard derivation of fitting face mask between the proposed method and Core 300. (4) : The computation time on the CP when the optimal matching window corresponds to N. (5) : Computation time on the GPU w the optimal matching window corresponding to N.

Real-Time Improvement for Three Speckle Patterns Projection
Following the previous optimal parameter analysis, comprehensively cons the measurement accuracy, image acquisition time, cost, volume, power consumpt environment and other factors, the design scheme of a compact 3D face acquisition was adopted with three fixed speckle projectors. Based on the optimum paramet bination of the three-speckle pattern projections obtained from the previous analy coarse matching window with a size of 11 × 11, grid interval of 11 and fine match of 7 × 7 was selected. The parameters were applied to the GPU parallel accelerati cess of the spatiotemporal correlation computations. The multi-core parallel fun  (1) Optimal matching window corresponding to the pattern number. (2) Minimum average error of fitting the face mask between ours and Core 300. (3) Minimum standard derivation of fitting the face mask between the proposed method and Core 300. (4) : The computation time on the CPU when the optimal matching window corresponds to N. (5) : Computation time on the GPU when the optimal matching window corresponding to N.

Real-Time Improvement for Three Speckle Patterns Projection
Following the previous optimal parameter analysis, comprehensively considering the measurement accuracy, image acquisition time, cost, volume, power consumption, use environment and other factors, the design scheme of a compact 3D face acquisition device was adopted with three fixed speckle projectors. Based on the optimum parameter combination of the three-speckle pattern projections obtained from the previous analyses, the coarse matching window with a size of 11 × 11, grid interval of 11 and fine matching size of 7 × 7 was selected. The parameters were applied to the GPU parallel acceleration process of the spatiotemporal correlation computations. The multi-core parallel function library, openMP, was employed for dense point cloud generation and triangulation of 3D surfaces. The point-cloud dynamic real-time display was achieved using the OpenGL pipeline strategy. The image acquisition, epipolar rectification and disparity estimation, point-cloud generation and 3D display were implemented in parallel through a multithreaded pipeline. Based on the three-speckle pattern projection equipment, high-speed image acquisition and real-time 3D face reconstruction were achieved with an average error of 0.098 mm, a standard deviation of 0.133 mm, a 15-ms image acquisition time and a 35-ms disparity computational time.

Discussion
Regarding the implementation of the proposed method and the experimental process, several aspects need to be further explained and discussed.
(1) Measurement accuracy. The STBF-accelerated spatiotemporal correlation matching strategy proposed in this paper differs from the test results in the literature [23]. First, the literature [23] did not consider the human face as the analysis object and did not use a coarse-to-fine matching strategy, and its calculation speed was not as good as the method in this paper. Furthermore, based on the comparison of the calculation results, it was found that the combination trend between the selected matching window size and the number of speckle patterns was different. In [23], when the number of projected speckle patterns exceeded three frames, the change in the size of the matching window had only a small effect on the reconstruction accuracy. The test results in this study show that, as the number of projection patterns increases, the 3D reconstruction accuracy continues to improve. When the number of patterns is greater than six, the trend of accuracy improvement gradually slows down. This phenomenon may be caused by differences in the measurement object. (2) Balance of measurement accuracy and time cost. A 3D face acquisition device with a rotating speckle projector is more suitable for use in scenes where the accuracy requirements are strict, and there is no clear limitation on the acquisition time. To obtain better reconstruction accuracy, more stereo image pairs were selected to participate in the spatiotemporal stereo-matching process, which is more suitable for scenes where objects remain stationary. According to the research results of this study, six sets of image pairs met the requirements for high-precision modelling. When reconstructing fast-moving objects, the authors attempted to reduce the number of stereo image pairs and employed 3D reconstruction equipment with fixed speckle projectors for 3D face image acquisition. (3) Real-time acquisition, reconstruction and display. Using the single-shot speckle structure [17][18][19], the acquisition time of the stereo image pair was short, but the real-time reconstruction effect was usually not achieved. Although the literature [27] achieved a real-time reconstruction frequency of 30 fps, the accuracy was only 0.55 mm, which is far from the accuracy of the proposed method. Aiming at 3D face acquisition equipment using a fixed speckle projector, the method in this study implements real-time image acquisition and real-time 3D face reconstruction. However, during the 3D data display process, the point-cloud structure can be used to display 3D data in real time through the openGL interface. If the triangular facet structure of texture mapping is adopted, it is limited by openGL's utilisation of the low-level image cache and does not display 3D data in real time. In our next step, the authors will study the openGL parallel display strategy, which has achieved a more realistic display effect of 3D data.

Conclusions
To improve the accuracy and performance of 3D face imaging, an effective coarse-tofine spatiotemporal stereo matching scheme using speckle pattern projection was proposed, which was accelerated by STBF. Comparing our 3D face reconstruction results with the ground-truth data collected by high-precision 3D industrial scanning equipment for global errors, the impact of the projected speckle-pattern number and matching windows on measurement accuracy and reconstruction speed were further researched. Through quantitative and qualitative analyses of the experimental data, the optimum combination of parameters needed to provide a balanced strategy of reconstruction speed and accuracy was made. It was demonstrated that the system can achieve accuracy with an average measurement error of 0.071 mm and standard deviation of 0.091 mm with N = 12 speckle stereo-image pairs. Experimental results showed that increasing the number of projected speckle patterns yields better measurement accuracy, while the optimal matching window will continue to shrink and the computation time will increase slightly. To meet the requirement of rapid 3D face imaging under slight motion, this study provides a compact 3D face acquisition device with three fixed speckle projectors. Using the optimal combination parameters for three-speckle stereo images, the spatiotemporal correlation stereo-matching process, which takes the longest time among the components, is accelerated by the GPU through the use of a parallel pipeline strategy between each core processing unit; the purpose of real-time data acquisition and real-time 3D reconstruction. In the future, the authors intend to apply the proposed scheme to 3D face verification and recognition in practical scenarios.
Author Contributions: Conceptualisation, W.X., K.F. and P.Z.; Data curation, J.Z. and P.Z.; Investigation, H.Y., J.Z. and P.Z.; Software, W.X. and P.Z.; Writing-original draft, W.X. and K.F.; Writing-review and editing, W.X. and K.F. All authors have read and agreed to the published version of the manuscript.