Spatiotemporal Matching Cost Function Based on Differential Evolutionary Algorithm for Random Speckle 3D Reconstruction

: Random speckle structured light can increase the texture information of the object surface, so it is added in the binocular stereo vision system to solve the matching ambiguity problem caused by the surface with repetitive pattern or no texture. To improve the reconstruction quality, many current researches utilize multiple speckle patterns for projection and use stereo matching methods based on spatiotemporal correlation. This paper presents a novel random speckle 3D reconstruction scheme, in which multiple speckle patterns are used and a weighted-fusion-based spatiotemporal matching cost function (STMCF) is proposed to ﬁnd the corresponding points in speckle stereo image pairs. Furthermore, a parameter optimization method based on differential evolutionary (DE) algorithm is designed for automatically determining the values of all parameters included in STMCF. In this method, since there is no suitable training data with ground truth, we explore a training strategy where a passive stereo vision dataset with ground truth is used as training data and then apply the learned parameter value to the stereo matching of speckle stereo image pairs. Various experimental results verify that our scheme can realize accurate and high-quality 3D reconstruction efﬁciently and the proposed STMCF exhibits superior performance in terms of accuracy, computation time and reconstruction quality than the state-of-the-art method based on spatiotemporal correlation. determine the values of all in a parameter optimization based on DE algorithm is and a training strategy is in this Experimental results demonstrate that our scheme can achieve accurate and high-quality 3D reconstruction efﬁciently and the performance of STMCF outperforms the state-of-the-art method based on spatiotemporal correlation, which is widely used in random speckle 3D reconstruction. At present, STMCF cannot meet the requirements for real-time application. In future work, we intend to reduce the computation time by hardware acceleration and apply it to 3D reconstruction of the dynamic indoor scene.


Introduction
Three-dimensional reconstruction technique based on measurement has been widely applied in various fields, such as industrial manufacturing [1], protection of historical relics [2], architectural design [3], digital entertainment [4], etc. It is an active and important research direction in the field of computer vision. The key point of this technique is how to acquire the depth information of the measured object. At present, measurement-based 3D reconstruction methods can be mainly divided into two classes: passive and active measurement methods.
The passive measurement methods generally obtain images by cameras from multiviews, then 3D spatial information of the measured object is calculated by a specific algorithm. Among them, binocular stereo vision is one of the most widely used methods due to low cost and being easy for implementation. It firstly finds the corresponding points between stereo image pairs captured by two cameras from different views and then the depth information is recovered by triangulation. Stereo matching is a critical and challenging work of binocular stereo vision, which determines the correctness of corresponding points and the accuracy of depth information. Many kinds of stereo matching algorithms have been proposed by researchers in the last decades [5][6][7][8][9][10]. However, there is still no good solution for repetitive pattern and textureless regions.
The active measurement methods usually emit optical signals to the surface of the object. Both time of flight (TOF) [11] and structured light [12] are typical active approaches. The TOF camera transmits continuous near-infrared ray to the target, then the reflected ray is received by a specific sensor. By measuring the flight time of the ray, the depth information is obtained. It has advantages of fast response and less computational cost. Nevertheless, it has the shortcomings of low accuracy and depth map resolution.
The structured light technique also works on the principle of triangulation. Different with passive binocular stereo vision, it improves 3D reconstruction quality by projecting a pre-designed pattern onto the object. Fringe structured light [13] and random speckle structured light [14] are two mainstream structured light encoding methods currently. Compared with the former, random speckle structured light does not need accurate phase shift and the cost of projection equipment is relatively low. Hence, it is widely applied in various commercial 3D sensors, such as Kinect [4], RealSense [15], etc. As shown in Figure 1 (the photo in Figure 1 is a plastic human face mask, as are photos in Figure 2), random speckle structured light is commonly used in combination with the binocular stereo vision system because it can enrich texture information of the object surface that is helpful to find the matching point more accurately. In order to improve the reconstruction quality, many existing studies utilize multiple speckle patterns for projection and use stereo matching methods based on spatiotemporal correlation. ing points and the accuracy of depth information. Many kinds of stereo matching algorithms have been proposed by researchers in the last decades [5][6][7][8][9][10]. However, there is still no good solution for repetitive pattern and textureless regions. The active measurement methods usually emit optical signals to the surface of the object. Both time of flight (TOF) [11] and structured light [12] are typical active approaches. The TOF camera transmits continuous near-infrared ray to the target, then the reflected ray is received by a specific sensor. By measuring the flight time of the ray, the depth information is obtained. It has advantages of fast response and less computational cost. Nevertheless, it has the shortcomings of low accuracy and depth map resolution.
The structured light technique also works on the principle of triangulation. Different with passive binocular stereo vision, it improves 3D reconstruction quality by projecting a pre-designed pattern onto the object. Fringe structured light [13] and random speckle structured light [14] are two mainstream structured light encoding methods currently. Compared with the former, random speckle structured light does not need accurate phase shift and the cost of projection equipment is relatively low. Hence, it is widely applied in various commercial 3D sensors, such as Kinect [4], RealSense [15], etc. As shown in Figure  1 (the photo in Figure 1 is a plastic human face mask, as are photos in Figure 2), random speckle structured light is commonly used in combination with the binocular stereo vision system because it can enrich texture information of the object surface that is helpful to find the matching point more accurately. In order to improve the reconstruction quality, many existing studies utilize multiple speckle patterns for projection and use stereo matching methods based on spatiotemporal correlation.    ing points and the accuracy of depth information. Many kinds of stereo matching algorithms have been proposed by researchers in the last decades [5][6][7][8][9][10]. However, there is still no good solution for repetitive pattern and textureless regions. The active measurement methods usually emit optical signals to the surface of the object. Both time of flight (TOF) [11] and structured light [12] are typical active approaches. The TOF camera transmits continuous near-infrared ray to the target, then the reflected ray is received by a specific sensor. By measuring the flight time of the ray, the depth information is obtained. It has advantages of fast response and less computational cost. Nevertheless, it has the shortcomings of low accuracy and depth map resolution.
The structured light technique also works on the principle of triangulation. Different with passive binocular stereo vision, it improves 3D reconstruction quality by projecting a pre-designed pattern onto the object. Fringe structured light [13] and random speckle structured light [14] are two mainstream structured light encoding methods currently. Compared with the former, random speckle structured light does not need accurate phase shift and the cost of projection equipment is relatively low. Hence, it is widely applied in various commercial 3D sensors, such as Kinect [4], RealSense [15], etc. As shown in Figure  1 (the photo in Figure 1 is a plastic human face mask, as are photos in Figure 2), random speckle structured light is commonly used in combination with the binocular stereo vision system because it can enrich texture information of the object surface that is helpful to find the matching point more accurately. In order to improve the reconstruction quality, many existing studies utilize multiple speckle patterns for projection and use stereo matching methods based on spatiotemporal correlation.   In this paper, a novel random speckle 3D reconstruction scheme is presented. Sequence of different speckle patterns are projected to the measured object and then two cameras capture images from different views synchronously. The speckle pattern that we used is designed by [16]. After acquiring sequence of speckle image pairs, the schematic diagram of the scheme is shown in Figure 2 and the main contributions are as follows: (1) A weighted-fusion-based spatiotemporal matching cost function (STMCF) is proposed for accurately searching the matching point pairs between left and right speckle images. (2) In order to automatically determine the value of each parameter used in STMCF, a parameter optimization method based on differential evolutionary (DE) algorithm is designed and a training strategy is explored in this method. (3) A series of comparative experimental results indicate that our scheme can achieve accurate and high-quality 3D reconstruction efficiently and the proposed STMCF can provide better performance on the aspects of accuracy, computation time and reconstruction quality than the state-of-the-art method based on spatiotemporal correlation, which is widely used in random speckle 3D reconstruction.
The remainder of this paper is organized as follows. The related work is discussed in Section 2. Section 3 describes the STMCF. Disparity computation and subpixel refinement are given in Section 4. The parameter optimization method based on DE algorithm is presented in Section 5. Experiments and discussion are provided in Section 6 and finally, Section 7 concludes this paper.

Weighted-Fusion-Based Matching Cost Function
The main task of matching cost function is to quantify the matching cost of corresponding points between stereo image pairs. Classical matching cost computation methods can be divided into three kinds: pixel-based methods, window-based methods and nonparametric methods. Each kind of method has their own disadvantages. Hence, the commonly used form of matching cost function is the weighted-fusion of several different methods.
Mei et al. [17] effectively combined the absolute differences (AD) and Census transformation [18], known as AD-Census. It provided more accurate matching result than individual method. Hosni et al. [19] used the pixel-based truncated absolute difference of AD and gradient as the matching cost, which was robust to illumination changes. Hamzah et al. [20] proposed a new matching cost function that combined three different similarity measures including AD, gradient and Census transformation. It could reduce the radiometric distortions and the effect of illumination variations. More recently, Hong et al. [21] proposed a weighted-fusion matching cost function that included AD, square difference (SD), Census and Rank transformations. In addition, they considered the stereo matching as an optimization problem and used an evolutionary algorithm to automatically optimize the parameters in the cost function.
The mentioned above weighted-fusion-based matching cost functions were all designed for passive binocular stereo vision, so they only considered the spatial feature and could not utilize the temporal feature brought by multiple speckle stereo image pairs.

3D Reconstruction Based on Binocular Stereo Vision with Random Speckle Projection
Since random speckle structured light can effectively improve the accuracy of stereo matching, a lot of research on 3D reconstruction based on binocular stereo vision with random speckle projection has been reported. According to the number of projected patterns, they can be mainly classified into two types.
The first type used the single speckle pattern. Gu et al. [22] proposed an improved semi-global matching algorithm using the single-shot random speckle structured light. In this algorithm, a new penalty term was defined for alleviating the ladder phenomenon. Although it could achieve 3D dynamic reconstruction, the measurement accuracy was relatively low. Yin et al. [23] designed an end-to-end stereo matching network for singleshot 3D shape measurement. It firstly used a multi-scale residual sub-network to extract feature tensors from speckle images, then a lightweight 3D U-net network was proposed to implement efficient cost aggregation. Zhou et al. [14] presented a high-precision 3D measurement scheme that used a single-shot color binary speckle pattern and an extended temporal-spatial correlation matching algorithm for the measurement of dynamic and static objects. However, when the object surface is colorful, the measurement effect may be affected.
The second type used multiple different speckle patterns. To acquire high-quality results, many current studies employ stereo matching methods based on spatiotemporal correlation. Tang et al. [24] proposed an improved 3D reconstruction method through combining advantages of both spatial correlation and temporal correlation methods. Meanwhile, by analyzing the relationship between the correlation window size and the number of speckle patterns, the tradeoff between the accuracy and efficiency was found. Fu et al. [25] proposed a 3D face reconstruction scheme based on space-time speckle projection. In order to enhance the computing efficiency, several optimization strategies were used and a popular similarity metric namely zero-mean normalized cross correlation (ZNCC) was used to compute the matching cost. For measuring dynamic scenes that include static, slow and fast moving objects, Harendt et al. [26] proposed an adaptive spatiotemporal correlation method, where the weights were introduced for automatically adjusting the matched image regions.

STMCF
To find the matching point pairs between left and right speckle images more accurately and make full use of both spatial and temporal features provided by multiple speckle stereo image pairs, we expand our previously proposed matching cost function [27] to the temporal domain and a weighted-fusion-based STMCF is proposed. It combines four different similarity metrics: AD, Census transformation [18], horizontal image gradient and vertical image gradient. Each of them has expanded to the temporal domain and their computation methods are detailed as follows.
Since the acquired speckle stereo image pairs are grayscale images, AD is computed by using the gray value between two points at the left and right speckle images. For a given point m (x m , y m ) in the left speckle image, the cost value C AD (x m , y m , d, N) is computed by the following equation: where N is the number of speckle stereo image pairs, which is equal to the number of projected speckle patterns, I t Left (x m , y m ) is the gray value of point m in the left image under the t-th speckle pattern projection and I t Right (x m −d, y m ) is the gray value of the corresponding point in the right image under the t-th speckle pattern projection when the disparity is d.
Census transformation [18] firstly obtains the binary code by comparing the gray value between the center pixel and other pixels in the given window as follows: where S t (x m , y m ) is the binary code of point m in the left image under the t-th speckle pattern projection, ⊗ represents the bitwise connection operation and w m is the given window centered on point m.
Afterwards, Hamming distance is utilized for measuring similarity between the corresponding points. The cost value C census (x m , y m , d, N) is calculated by the following expression: where the notations are listed as follows: The horizontal and vertical image gradient are computed by our proposed method, respectively. In contrast to traditional gradient computing method, besides the gradient of the speckle image itself, our method also takes the gradient of the guided image into account. The guided image is obtained by using the guided filter [28] (GF) because it has advantages of high computational efficiency, good smoothing effect and edge-preserving feature. The computing method of horizontal and vertical image gradient can be respectively expressed as follows: where IL and IR respectively represent the left and right speckle images, GIL and GIR denote the guided images of left and right speckle images respectively, gx t (x m , y m ) and gy t (x m , y m ) respectively indicate the horizontal and vertical grayscale gradients of point m under the t-th speckle pattern projection. After normalizing the cost values mentioned above, STMCF is acquired by weighted fusion of them as follows: where α, β, γ and δ are weight values to control the influence of each cost value, C (x m , y m , d, N) indicates the final matching cost value of point m when the disparity is d and the number of speckle stereo image pairs is N. T represents the truncation function defined as: where τ denotes the truncation threshold. The purpose of truncation function is to improve the robustness for outliers.

Disparity Computation and Subpixel Refinement
The Winner-take-all [8] (WTA) strategy is used for determining the disparity of each point as follows: where d int (x m , y m , N) denotes the integer disparity of point m when the number of speckle stereo image pairs is N, d min and d max represent the maximum and minimum disparity, respectively. Afterwards, the subpixel disparity of each point is computed by the interpolation function based on data histogram proposed by [29] as follows: The x in Equation (12) is defined as: After simple post-processing (hole filling and surface smoothing), the final disparity map for 3D reconstruction is obtained.

Parameter Optimization Method Based on DE Algorithm
There are totally 12 parameters in STMCF as shown in Table 1. The values of these parameters will directly affect the performance of STMCF. Inspired by [21], a parameter optimization method based on DE algorithm is designed to automatically determine the value of each parameter. The flowchart of the method is illustrated in Figure 3. Each step is detailed in the following sections. Appl. Sci. 2022, 11, x FOR PEER REVIEW 7 of 19

Population Initialization
According to the process of DE algorithm, the population P is initialized firstly. It includes T solutions. Each solution contains a set of randomly generated parameter values, which are all uniformly distributed within the corresponding possible range. Table 2 shows the data type and value range of all parameters.

Solution Evaluation
Each solution is evaluated by the fitness function designed by us. However, the ground truth needs to be known in advance and there is no appropriate training data with ground truth. In order to solve this problem, we explore a training strategy that a passive stereo vision dataset with ground truth is used as training data, then apply the learned parameter value to the speckle stereo image pairs. Here we select the Middlebury [30] dataset because it is a dataset of indoor scene that is similar to our image acquisition environment. Note that the stereo image pairs for training are grayscale images, which are the same as the speckle stereo image pairs.
Based on above training strategy, the process of solution evaluation is as follows. Disparity maps corresponding to each solution are obtained by using the stereo matching method shown in Figure 2. It should be noted that the value of N in STMCF is set to 1 at this time. Afterwards, according to the ground truth and mask images provided by training sets of Middlebury 2014 datasets, the average percentage of those pixels whose error

Population Initialization
According to the process of DE algorithm, the population P is initialized firstly. It includes T solutions. Each solution contains a set of randomly generated parameter values, which are all uniformly distributed within the corresponding possible range. Table 2 shows the data type and value range of all parameters.

Solution Evaluation
Each solution is evaluated by the fitness function designed by us. However, the ground truth needs to be known in advance and there is no appropriate training data with ground truth. In order to solve this problem, we explore a training strategy that a passive stereo vision dataset with ground truth is used as training data, then apply the learned parameter value to the speckle stereo image pairs. Here we select the Middlebury [30] dataset because it is a dataset of indoor scene that is similar to our image acquisition environment. Note that the stereo image pairs for training are grayscale images, which are the same as the speckle stereo image pairs.
Based on above training strategy, the process of solution evaluation is as follows. Disparity maps corresponding to each solution are obtained by using the stereo matching method shown in Figure 2. It should be noted that the value of N in STMCF is set to 1 at this time. Afterwards, according to the ground truth and mask images provided by training sets of Middlebury 2014 datasets, the average percentage of those pixels whose error is greater than K (K = 1) pixels in non-occluded area is computed by the fitness function as follows: In Equation (16), N val_nocc (i) is calculated by: where E(s) denotes the fitness value of solution s, n is the number of disparity maps, N val_nocc (i) represents the number of valid pixels within the non-occluded area in the i-th ground truth, I i val_nocc denotes the set of all valid pixels within the non-occluded area in the i-th ground truth, d s i is the i-th disparity map corresponding to solution s, d i GT represents the i-th ground truth and M i denotes the i-th mask image.

L-SHADE
L-SHADE [31] (Success-History-based Adaptive DE with Linear Population Reduction) is used to determine the value of each parameter. It is an improved DE algorithm that won the Congress on Evolutionary Computation competition for single-objective optimization in 2014 [21]. Moreover, it is one of the current state-of-the-art DE algorithms, which is mainly used to solve global optimization problems with continuous variables. The procedure of L-SHADE is shown by Algorithm 1, in which the population size of generation G+1 is computed using LPSR (Linear Population Size Reduction) strategy as follows: where N min is the possible smallest population size, N init is the initial population size, NFE represents the current number of evaluations and MAX_NFE represents the maximum number of evaluations.  Similar to classical DE algorithm, the procedure of L-SHADE also includes three main steps: mutation, crossover and selection. They are repeated in the main loop until some termination criteria are met. In this paper, the maximum number of fitness function evaluations is employed as the termination criteria.
Algorithm 2 shows how memories are updated in line 31 of Algorithm 1. Due to the length limitation of this paper, readers may refer to [31] for more details about L-SHADE algorithm.

Setup
All experiments are carried out on a PC configured with Intel Core i7-8750H CPU (2.2 GHz) and 16G RAM. The STMCF and parameter optimization method are implemented using C++ by Visual Studio 2015. The value of each parameter in STMCF is set according to the result obtained by the parameter optimization method, as listed in Table 3. All parameters are kept constant in the following experiments. Speckle stereo image pairs are acquired by the device presented in Figure 4. Sequence of different speckle patterns are generated and projected to the measured object by the rotary speckle projector in which a pseudo-random speckle pattern mask is inside. When two infrared cameras (the baseline is 120 mm, the focal length is 8 mm, the resolution is 1280 × 1024) capture images synchronously, the mask will stop rotating. Afterwards, the mask continues to rotate at a certain angle (~10 • ) and generate another pattern. The switching time between different speckle patterns is 11 ms and the exposure time of the camera is 2 ms. Therefore, the acquisition time for each speckle image pair is 13 ms. The number of speckle patterns can be set in the range of 3-12. The two infrared cameras are calibrated by the method proposed in [32] and the obtained camera parameters are used to rectify the speckle stereo image pairs.
All experiments are carried out on a PC configured with Intel Core i7-8750H CPU (2.2 GHz) and 16G RAM. The STMCF and parameter optimization method are implemented using C++ by Visual Studio 2015. The value of each parameter in STMCF is set according to the result obtained by the parameter optimization method, as listed in Table  3. All parameters are kept constant in the following experiments. Speckle stereo image pairs are acquired by the device presented in Figure 4. Sequence of different speckle patterns are generated and projected to the measured object by the rotary speckle projector in which a pseudo-random speckle pattern mask is inside. When two infrared cameras (the baseline is 120 mm, the focal length is 8 mm, the resolution is 1280 × 1024) capture images synchronously, the mask will stop rotating. Afterwards, the mask continues to rotate at a certain angle (~ 10 ) and generate another pattern. The switching time between different speckle patterns is 11 ms and the exposure time of the camera is 2 ms. Therefore, the acquisition time for each speckle image pair is 13 ms. The number of speckle patterns can be set in the range of 3-12. The two infrared cameras are calibrated by the method proposed in [32] and the obtained camera parameters are used to rectify the speckle stereo image pairs.

Effectiveness of the Proposed Gradient Computing Method
To verify the effectiveness of the proposed gradient computing method, the experiment is conducted to make comparison on reconstruction accuracy between two gradient computing methods: one includes the gradient of the guided image, the other does not. It is notable that the two methods use their own optimized values of parameters. A dumbbell gauge referred to the Germany guide line VDI VDE2634 Part2 [33] and is used for accuracy evaluation, as shown in Figure 5.

Effectiveness of the Proposed Gradient Computing Method
To verify the effectiveness of the proposed gradient computing method, the experiment is conducted to make comparison on reconstruction accuracy between two gradient computing methods: one includes the gradient of the guided image, the other does not. It is notable that the two methods use their own optimized values of parameters. A dumbbell gauge referred to the Germany guide line VDI VDE2634 Part2 [33] and is used for accuracy evaluation, as shown in Figure 5. The sphere diameter dA = 50.7784 mm, dB = 50.7856 mm and sphere center distance D = 100.0480 mm. The distance between the dumbbell gauge and projector is ~550 mm, the disparity range is −100 to 100, which is the same as other measured objects. Table 4 shows a comparison of the results by projecting N speckle patterns (the bold figure indicates result with lower error, the same below). Note that points with large error near the boundary are deleted and the measured results are obtained by employing the commercial software Geomagic Studio. It can be seen from Table 4 that after including the gradient of the guided image, the errors are all decreased to varying degrees. Specifically, the error of sphere diameters dA, dB and sphere center distance D are reduced by 39.4%, 74.5% and 18.6% on average, respectively. Above data prove that the proposed gradient computing method is effective.  (1) Abs.Err = Absolute value of error. (2) Not include the gradient of the guided image. (3) Include the gradient of the guided image.

Experimental Comparison with the State-of-the-art Method Based on Spatiotemporal Correlation
As one of the state-of-the-art methods based on spatiotemporal correlation, the spatiotemporal zero-mean normalized cross correlation (STZNCC) is widely used in random speckle 3D reconstruction [16,25,34,35]. To validate the performance of STMCF, a series of comparative experiments between STZNCC and STMCF are carried out.

Comparison of Reconstruction Accuracy
Firstly, the dumbbell gauge mentioned in Section 6.2 is used for accuracy comparison. The errors of measured results by projecting N speckle patterns are listed in Table 5. The sphere diameter d A = 50.7784 mm, d B = 50.7856 mm and sphere center distance D = 100.0480 mm. The distance between the dumbbell gauge and projector is~550 mm, the disparity range is −100 to 100, which is the same as other measured objects. Table 4 shows a comparison of the results by projecting N speckle patterns (the bold figure indicates result with lower error, the same below). Note that points with large error near the boundary are deleted and the measured results are obtained by employing the commercial software Geomagic Studio. It can be seen from Table 4 that after including the gradient of the guided image, the errors are all decreased to varying degrees. Specifically, the error of sphere diameters d A , d B and sphere center distance D are reduced by 39.4%, 74.5% and 18.6% on average, respectively. Above data prove that the proposed gradient computing method is effective.  (1) Abs.Err = Absolute value of error. (2) Not include the gradient of the guided image. (3) Include the gradient of the guided image.

Experimental Comparison with the State-of-the-Art Method Based on Spatiotemporal Correlation
As one of the state-of-the-art methods based on spatiotemporal correlation, the spatiotemporal zero-mean normalized cross correlation (STZNCC) is widely used in random speckle 3D reconstruction [16,25,34,35]. To validate the performance of STMCF, a series of comparative experiments between STZNCC and STMCF are carried out.

Comparison of Reconstruction Accuracy
Firstly, the dumbbell gauge mentioned in Section 6.2 is used for accuracy comparison. The errors of measured results by projecting N speckle patterns are listed in Table 5. The results indicate that although when N = 3, the error of STMCF is a little greater than that of STZNCC in terms of d B , the rest of the errors of STMCF are lower than STZNCC. Especially for N = 6, 9 and 12, the advantage is more obvious; the error of d A , d B and D, are at least 24.9% (N = 12), 36.6% (N = 9) and 16.0% (N = 9) lower than STZNCC, respectively.
To further evaluate reconstruction accuracy of the object with complex surface, a plastic human face mask (see Figure 6a) is also tested. Its ground truth (see Figure 6b) is obtained by the industrial scanner ATOS Core 300 (the measurement accuracy is ±0.02 mm, certified by the Germany guide line VDI VDE2634 Part2 [33]). The Geomagic Studio is employed to compare the differences between the reconstruction result and ground truth quantitatively. The error statistics of results, respectively using STZNCC and STMCF by projecting N speckle patterns, are presented in Figure 7.  The results indicate that although when N = 3, the error of STMCF is a little greater than that of STZNCC in terms of dB, the rest of the errors of STMCF are lower than STZNCC. Especially for N = 6, 9 and 12, the advantage is more obvious; the error of dA, dB and D, are at least 24.9% (N = 12), 36.6% (N = 9) and 16.0% (N = 9) lower than STZNCC, respectively.
To further evaluate reconstruction accuracy of the object with complex surface, a plastic human face mask (see Figure 6a) is also tested. Its ground truth (see Figure 6b) is obtained by the industrial scanner ATOS Core 300 (the measurement accuracy is ±0.02 mm, certified by the Germany guide line VDI VDE2634 Part2 [33]). The Geomagic Studio is employed to compare the differences between the reconstruction result and ground truth quantitatively. The error statistics of results, respectively using STZNCC and STMCF by projecting N speckle patterns, are presented in Figure 7.  It can be easily discovered from Figure 7 that with the increase of N, the errors of both STMCF and STZNCC tend to decrease. In Figure 7a, except for N = 12, the maximum errors of STMCF are all less than STZNCC. In Figure 7b,c, the performance of STMCF is better than STZNCC in all cases. Therefore, in general, the accuracy of reconstructing the human face mask using STMCF is higher than that of STZNCC. Figures 8-11 shows reconstruction results and corresponding error distribution maps of the plastic human face mask, respectively, using STMCF and STZNCC by projecting N(N=3,6,9,12) speckle patterns. It can be easily discovered from Figure 7 that with the increase of N, the errors of both STMCF and STZNCC tend to decrease. In Figure 7a, except for N = 12, the maximum errors of STMCF are all less than STZNCC. In Figure 7b,c, the performance of STMCF is better than STZNCC in all cases. Therefore, in general, the accuracy of reconstructing the human face mask using STMCF is higher than that of STZNCC. Figures 8-11 shows reconstruction results and corresponding error distribution maps of the plastic human face mask, respectively, using STMCF and STZNCC by projecting N (N = 3, 6, 9, 12) speckle patterns.
It can be observed from Figures 8-11 that the quality of both results obtained by STMCF and STZNCC becomes better with the increase of N. Besides, by carefully comparing the error distribution maps between STZNCC and STMCF in Figures 8-11, we can find that the result of STMCF is more accurate in most areas. Above conclusions are consistent with the results shown in Figure 7.

Comparison of Computation Time
The computation time comparison between STZNCC and STMCF by projecting N speckle patterns are illustrated in Figure 12. It is worth mentioning that both methods use the coarse-to-fine matching strategy proposed in [25].
Appl. Sci. 2022, 11, x FOR PEER REVIEW 16 o find that the result of STMCF is more accurate in most areas. Above conclusions are c sistent with the results shown in Figure 7.

Comparison of Computation Time
The computation time comparison between STZNCC and STMCF by projectin speckle patterns are illustrated in Figure 12. It is worth mentioning that both methods the coarse-to-fine matching strategy proposed in [25]. The results show that the computation time of both STZNCC and STMCF will when N increases. Moreover, although when N = 12, the computation time for a dumb gauge of STMCF is slightly higher than that of STZNCC, the computation time of STM is less than STZNCC in most cases.

Comparison of Real Human Face Reconstruction
In order to further verify the performance of STMCF in actual application situatio the author's face under various expressions is reconstructed by STMCF and compa with the results of STZNCC, as shown in Figure 13. The results show that the computation time of both STZNCC and STMCF will rise when N increases. Moreover, although when N = 12, the computation time for a dumbbell gauge of STMCF is slightly higher than that of STZNCC, the computation time of STMCF is less than STZNCC in most cases.

Comparison of Real Human Face Reconstruction
In order to further verify the performance of STMCF in actual application situations, the author's face under various expressions is reconstructed by STMCF and compared with the results of STZNCC, as shown in Figure 13.
Appl. Sci. 2022, 11, x FOR PEER REVIEW 16 of 19 find that the result of STMCF is more accurate in most areas. Above conclusions are consistent with the results shown in Figure 7.

Comparison of Computation Time
The computation time comparison between STZNCC and STMCF by projecting N speckle patterns are illustrated in Figure 12. It is worth mentioning that both methods use the coarse-to-fine matching strategy proposed in [25]. The results show that the computation time of both STZNCC and STMCF will rise when N increases. Moreover, although when N = 12, the computation time for a dumbbell gauge of STMCF is slightly higher than that of STZNCC, the computation time of STMCF is less than STZNCC in most cases.

Comparison of Real Human Face Reconstruction
In order to further verify the performance of STMCF in actual application situations, the author's face under various expressions is reconstructed by STMCF and compared with the results of STZNCC, as shown in Figure 13. By careful observation, it can be found that the results of STMCF display contour details of the human face as clearly as that of STZNCC. More importantly, the smoothness of major areas on the human face (such as forehead, cheek and chin) is significantly better than STZNCC. Therefore, the effect of 3D face reconstruction using STMCF is more realistic.

Discussion
This paper reports a novel random speckle 3D reconstruction scheme, in which STMCF is proposed and a parameter optimization method based on DE algorithm is designed. A series of experimental results show that compared with STZNCC, STMCF has advantages in accuracy, computation time and reconstruction quality. However, there are still several aspects which need to be discussed.
(1) The error of d A , d B and D shown in Tables 4 and 5 do not decrease with the increase of N, which is not consistent with the results presented in Figure 7. That is because the measured result of d A , d B and D are obtained by least square sphere fitting algorithm; similar results have also been reported in the literature [24]. Nevertheless, the error statistics shown in Figure 7 are calculated by a comparison between the reconstruction result and ground truth. (2) According to the results of Figure 12, on the whole, the performance of STMCF is better than STZNCC in terms of computation time. However, it is easily observable that the growth rate of STMCF in computation time is higher than that of STZNCC. So, when the value of N is smaller, the advantage of STMCF in computation time is more obvious. (3) Since there is no suitable training data with ground truth, a training strategy is explored in this paper. Experimental results validate the effectiveness of the strategy. This shows that the training results acquired from passive stereo vision data can be applied to the stereo matching of speckle stereo image pairs when the image acquisition environment is similar. The above conclusion can be helpful for other relative research works.

Conclusions
A novel random speckle 3D reconstruction scheme is presented. In this scheme, STMCF is proposed for stereo matching of speckle stereo image pairs. Besides, to automatically determine the values of all parameters included in STMCF, a parameter optimization method based on DE algorithm is designed and a training strategy is explored in this method. Experimental results demonstrate that our scheme can achieve accurate and high-quality 3D reconstruction efficiently and the performance of STMCF outperforms the state-of-the-art method based on spatiotemporal correlation, which is widely used in random speckle 3D reconstruction. At present, STMCF cannot meet the requirements for real-time application. In future work, we intend to reduce the computation time by hardware acceleration and apply it to 3D reconstruction of the dynamic indoor scene.

Data Availability Statement:
The dataset used to support the findings of this study are included in the article, which are cited at relevant places within the text as [30].