Simultaneous Calibration: A Joint Optimization Approach for Multiple Kinect and External Cameras

Camera calibration is a crucial problem in many applications, such as 3D reconstruction, structure from motion, object tracking and face alignment. Numerous methods have been proposed to solve the above problem with good performance in the last few decades. However, few methods are targeted at joint calibration of multi-sensors (more than four devices), which normally is a practical issue in the real-time systems. In this paper, we propose a novel method and a corresponding workflow framework to simultaneously calibrate relative poses of a Kinect and three external cameras. By optimizing the final cost function and adding corresponding weights to the external cameras in different locations, an effective joint calibration of multiple devices is constructed. Furthermore, the method is tested in a practical platform, and experiment results show that the proposed joint calibration method can achieve a satisfactory performance in a project real-time system and its accuracy is higher than the manufacturer’s calibration.


Introduction
Camera calibration is a process of estimating intrinsic parameters (such as focal length, principal point and lens distortion) and extrinsic parameters (such as rotation and translation) of camera (including color camera and depth camera) [1]. It has been widely used in computer/machine vision, and it makes the measurement of distances in the real world from their projections on the image plane possible [2]. Thus, with the continuous development of computer/machine vision, the camera calibration has been widely applied in 3D reconstruction [3,4], structure from motion [5], object tracking [6][7][8] and gesture recognition [9,10], etc.
On 4 November 2010, with the launch of low-cost Microsoft Kinect sensors (Los Angeles, CA, USA) (the image capture device of the Kinect includes a color camera and a depth sensor which consists of an infrared (IR) projector combined with an IR camera), 3D depth cameras are increasingly attracting researchers due to their versatile applications in computer vision [11]. However, it is well known that Kinect intrinsics vary from device to device, which leads to the fact that the factory presets are not accurate enough for many applications [12]. To deal with the above issue, Burrus [13] presented a basic Kinect calibration algorithms by using camera calibration process based on OpenCV. However, it only calibrated the intrinsic parameters of the infrared camera. On the other hand, Hirotake et al. [14] tried to independently calibrate the intrinsic parameters of the depth sensor and color camera, and then register both in a common reference frame. Herrera et al. [15] proposed a color camera calibration method with high-precision to assist the Kinect calibration. Their approach can achieve a high accuracy. In addition, Zhang et al. [16] augmented Herrera's work with correspondences matching between the color and depth images, but they did not address distortions in the depth values. Smisek et al. [17] first considered the distortions in the projection and the depth estimation. After calibrating the internal and external parameters of the device, the depth distortion of each pixel was estimated by averaging the metric error. Moreover, focusing on the distortion for depth maps, Herrera et al. [18] proposed a joint depth and color camera calibration, and used the Lambert W function to solve the disparity distortion model. This process improved the calibration accuracy and corrected the depth distortion. However, their methods were generally limited to a single external camera, and could not be effectively employed with multiple devices. After that, Carolina et al. [19] and Guo et al. [20] improved the performance on the basis of the  [21] used two Kinects to form up a depth camera network, and accordingly achieved a fast and robust camera calibration process. Nonetheless, they still did not consider a joint calibration for multiple external cameras.
Current research only focuses on the calibration of a single external camera instead of the calibration of multiple external cameras. To this end, this paper aims at filling this gap. This paper introduces a novel method and a corresponding workflow framework, which can simultaneously calibrate a Kinect, three external cameras, and their relative positions. By optimizing the final cost function and adding corresponding weights to the external cameras in different locations, the joint calibration of the depth sensor in Kinect and multiple external high-resolution color cameras is realized. The paper is organized as follows: Section 2 introduces the calibration model; Section 3 proposes the approach to jointly calibrate the multiple sensors; Section 4 discusses the comparative experimental results and the conclusions are presented in the final session.

Color Camera Projection Model
In this paper, the intrinsic model of the color camera is similar to that in [22], which is described by a pinhole model with radial and tangential distortion coefficients. It is assumed that the color camera coordinate is C = [x c , y c , z c ] T , and it can be normalized as X n = [x n , y n ] T = [x c /z c , y c /z c ] T . In the pinhole model, a straight line may bend due to the effect of radial distortion [23], which can be solved by the following formula: x cor = x n 1 + k 1 r 2 + k 2 r 4 + k 3 r 6 y cor = y n 1 + k 1 r 2 + k 2 r 4 + k 3 r 6 (1) Similarly, tangential distortion happens when the camera lens is not perfectly parallel to the image plane, which causes some areas of the image to look closer than expected [24]. It can be solved by the following formula: x cor = x n + 2p 1 x n y n + p 2 r 2 + 2x 2 n y cor = y n + p 1 r 2 + 2y 2 n + 2p 2 x n y n where r 2 = x 2 n + y 2 n , (x cor , y cor ) represents the corrected coordinate point. k 1 , k 2 , k 3 and p 1 , p 2 are the radial and tangential distortion coefficients, respectively [25]. Therefore, K = [k 1 , k 2 , p 1 , p 2 , k 3 ] is used to represent the distortion coefficients.
Then, the image coordinates can be obtained by: where f = f x , f y is the focal length and P 0 = (u 0 , v 0 ) is the principal point of the image coordinate P = (u, v). The same model can be applied to the color and external cameras [26]. In this paper, the subscript c and d are used to distinguish the same parameters for the color camera and the depth camera, respectively. For example, f c = f cx , f cy represents the focal length of the color camera.

Depth Camera Intrinsic
The transformation relation between the depth camera coordinates and the depth image coordinates is similar to the model for the color camera. The distortion of the color camera is a forward model (i.e., from the world coordinates to the image coordinates), and for easy calculations, the geometric distortion of the depth camera uses the backward model [18] (i.e., from the image coordinates to the world coordinates). According to the imaging principle of the depth sensor, the relation between the obtained disparity value d k and the depth value z k can be expressed as: where z 0 is the distance from the reference point to the reference plane, f d is the focal length of the depth camera, and b is the baseline length, which is the distance between the infrared camera and the laser emitter. c 1 = 1/( f d b) and c 0 = 1/z 0 are part of the intrinsic parameters of the depth camera that are required to be calibrated. If the measured value of disparity d is directly substituted into Equation (4) for calibration (i.e., the disparity distortion correction is not performed). The depth information in the observation process produces a fixed error that could be corrected by adding a spatially varying offset Z δ . It can effectively reduce the re-projection error [17], where the depth value z kk can be re-expressed as: In order to improve the calibration accuracy, the method in [18] is used to directly correct the original disparity d. The method in [18] took the errors of all pixels from planes at several distances and normalized them. It can be found that the normalization error satisfies the exponential decay [19]. Therefore, a distortion model can be constructed to use an attenuated spatial offset to counteract the increasing disparity error. It can be expressed as: where d is the uncorrected disparity value obtained from Kinect, D δ is used to eliminate the influence of the distortion, and it represents the spatial distortion related to each pixel. α 0 , α 1 represent the decay of the distortion effect, and d k is the corrected disparity value. Equations (4) and (6) are used to calculate the disparity-to-depth transformation process, and the inverse of these equations can be used to calculate the re-projection error. According to the inverse of Equation (4), it is known that: Equation (6) has an exponential relationship, so its inverse is much more complex than the inverse of Equation (4). Therefore, we can use Guo's method that simplified Equation (6) by Taylor's equation [20]: Hence, The model for the depth camera is described by where the first three represent internal parameters of depth camera, and the last five are used to transform disparity-to-depth values.

Joint Calibration for Multi-Sensors
The block diagram of the proposed calibration method is presented in Figure 1. The proposed calibration method consists of three main consecutive steps: (1) selecting all the checkerboard corners by Zhang's method [27] to initially estimate the intrinsic parameters of camera, and the four corners of the calibration plane are extracted in a depth map to initially estimate the intrinsic parameters of depth camera; (2) using Herrera's method [18] to estimate the relative positions (extrinsic parameters) between the devices; and (3) initializing the disparity distortion parameters. Then, substituting all the parameters into the new proposed cost function and attaching different weights to iteratively calculate the nonlinear minimization.
In the workflow framework, Step 1 and Step 2 contribute to the initialization of the parameters. They introduce the new parameters to the cost function in Step 3 for nonlinear minimization. In Step 3, when the disparity distortion function is calculated with the least squares method, the cost function of disparity distortion is the same as the corresponding intermediate term of the new cost function and does not interact with the other parameters. Therefore, after providing the corresponding initial value, the nonlinear minimization of the parameters can be achieved by iteratively calculating the new cost function. When all the parameters meet a predefined range, the joint calibration results can be output. Otherwise, it will continue to the next loop until the maximum number of iterations is reached. Hence, The model for the depth camera is described by , where the first three represent internal parameters of depth camera, and the last five are used to transform disparity-to-depth values.

Joint Calibration for Multi-Sensors
The block diagram of the proposed calibration method is presented in Figure 1. The proposed calibration method consists of three main consecutive steps: (1) selecting all the checkerboard corners by Zhang's method [27] to initially estimate the intrinsic parameters of camera, and the four corners of the calibration plane are extracted in a depth map to initially estimate the intrinsic parameters of depth camera; (2) using Herrera's method [18] to estimate the relative positions (extrinsic parameters) between the devices; and (3) initializing the disparity distortion parameters. Then, substituting all the parameters into the new proposed cost function and attaching different weights to iteratively calculate the nonlinear minimization.
In the workflow framework, Step 1 and Step 2 contribute to the initialization of the parameters. They introduce the new parameters to the cost function in Step 3 for nonlinear minimization. In Step 3, when the disparity distortion function is calculated with the least squares method, the cost function of disparity distortion is the same as the corresponding intermediate term of the new cost function and does not interact with the other parameters. Therefore, after providing the corresponding initial value, the nonlinear minimization of the parameters can be achieved by iteratively calculating the new cost function. When all the parameters meet a predefined range, the joint calibration results can be output. Otherwise, it will continue to the next loop until the maximum number of iterations is reached.

Platform Setting and Preprocessing
The experimental platform with multiple sensors is shown in Figure 2. Kinect is located in front of children with Autism Spectrum Disorders (ASD), and the same place has an external color camera, which is called a Middle Camera (External Camera 0). Similarly, in the lower left corner and lower

Platform Setting and Preprocessing
The experimental platform with multiple sensors is shown in Figure 2. Kinect is located in front of children with Autism Spectrum Disorders (ASD), and the same place has an external color camera, which is called a Middle Camera (External Camera 0). Similarly, in the lower left corner and lower right corner are the other external color cameras, which are called Left Camera (External Camera 1) and Right Camera (External Camera 2), respectively. They are fixed on the same rigid platform and do not change the relative position during the course of the experiment, and the color camera of Kinect is set to coincide with the origin of the experimental frame coordinate system. At the same time, the direction of the experimental frame coordinate system is also shown in Figure 2.
right corner are the other external color cameras, which are called Left Camera (External Camera 1) and Right Camera (External Camera 2), respectively. They are fixed on the same rigid platform and do not change the relative position during the course of the experiment, and the color camera of Kinect is set to coincide with the origin of the experimental frame coordinate system. At the same time, the direction of the experimental frame coordinate system is also shown in Figure 2. In the process of selecting the checkerboard corners, Zhang's method [27] is used to initialize the parameters of the color camera in Kinect and three external cameras. Using a standard checkerboard grid with a width of 0.025 m, and there are nine and six corner points in the x-axis and y-axis directions, respectively. The detection of corners is shown in Figure 3a. When the number of the input images is larger than three, the unique solution of Equation (3) can be found by Zhang's method [27]. In this paper, in order to ensure the accuracy of the calibration results, when acquiring the image, three datasets are recorded at the distance of 0.8 m, 1.6 m and 2.4 m away from the camera frame plane. Each dataset is divided into five pictures, which include one picture of frontal plane, two pictures of the x-axis rotated plane and two pictures of the y-axis rotated plane. Generally speaking, the corners of the checkerboard cannot be displayed in the depth image, and we can only select four corners of the calibration plate in the depth image, as shown in Figure 3b. Although the accuracy of Kinect depth image is on the millimeter level, however, there is still a lot of noise in these corners. Consequently, the plane formed by the four selected corners can only be used to initially estimate the depth data of the calibration plate plane.

Relative Pose Estimation
In the relative position estimation, the color camera of Kinect is assumed to be the origin of the experimental frame coordinate system. All of the equipment is fixed on the same rigid frame during the whole experiment. All of the reference frames and transformations are illustrated in  In the process of selecting the checkerboard corners, Zhang's method [27] is used to initialize the parameters of the color camera in Kinect and three external cameras. Using a standard checkerboard grid with a width of 0.025 m, and there are nine and six corner points in the x-axis and y-axis directions, respectively. The detection of corners is shown in Figure 3a. When the number of the input images is larger than three, the unique solution of Equation (3) can be found by Zhang's method [27]. In this paper, in order to ensure the accuracy of the calibration results, when acquiring the image, three datasets are recorded at the distance of 0.8 m, 1.6 m and 2.4 m away from the camera frame plane. Each dataset is divided into five pictures, which include one picture of frontal plane, two pictures of the x-axis rotated plane and two pictures of the y-axis rotated plane. Generally speaking, the corners of the checkerboard cannot be displayed in the depth image, and we can only select four corners of the calibration plate in the depth image, as shown in Figure 3b. Although the accuracy of Kinect depth image is on the millimeter level, however, there is still a lot of noise in these corners. Consequently, the plane formed by the four selected corners can only be used to initially estimate the depth data of the calibration plate plane. right corner are the other external color cameras, which are called Left Camera (External Camera 1) and Right Camera (External Camera 2), respectively. They are fixed on the same rigid platform and do not change the relative position during the course of the experiment, and the color camera of Kinect is set to coincide with the origin of the experimental frame coordinate system. At the same time, the direction of the experimental frame coordinate system is also shown in Figure 2. In the process of selecting the checkerboard corners, Zhang's method [27] is used to initialize the parameters of the color camera in Kinect and three external cameras. Using a standard checkerboard grid with a width of 0.025 m, and there are nine and six corner points in the x-axis and y-axis directions, respectively. The detection of corners is shown in Figure 3a. When the number of the input images is larger than three, the unique solution of Equation (3) can be found by Zhang's method [27]. In this paper, in order to ensure the accuracy of the calibration results, when acquiring the image, three datasets are recorded at the distance of 0.8 m, 1.6 m and 2.4 m away from the camera frame plane. Each dataset is divided into five pictures, which include one picture of frontal plane, two pictures of the x-axis rotated plane and two pictures of the y-axis rotated plane. Generally speaking, the corners of the checkerboard cannot be displayed in the depth image, and we can only select four corners of the calibration plate in the depth image, as shown in Figure 3b. Although the accuracy of Kinect depth image is on the millimeter level, however, there is still a lot of noise in these corners. Consequently, the plane formed by the four selected corners can only be used to initially estimate the depth data of the calibration plate plane.

Relative Pose Estimation
In the relative position estimation, the color camera of Kinect is assumed to be the origin of the experimental frame coordinate system. All of the equipment is fixed on the same rigid frame during the whole experiment. All of the reference frames and transformations are illustrated in

Relative Pose Estimation
In the relative position estimation, the color camera of Kinect is assumed to be the origin of the experimental frame coordinate system. All of the equipment is fixed on the same rigid frame during the whole experiment. All of the reference frames and transformations are illustrated in  For example, T WC represents the transformation from the checkerboard to the color camera coordinate system, and a point X W in {W} can be transformed into {C} by the equation      [28]. Furthermore, the plane parameters vectors in the depth camera could also be represented by is shown as: Finally, the rotation matrix  The above formulas can achieve the conversion of most coordinate systems, such as T WC , T WE and T VD , but they cannot describe the relationship between {D} and {C}. Here, we use Herrera's [18] method. Since the calibration plate ({V}) and the checkerboard ({W}) have coplanar characteristics, and T WC , T VD are known. Hence, we can get T DC . Specific steps are as follows, and we define a plane with Formula (10) in each reference frames ({W}, {V}): where n is the unit normal and δ is the distance to the origin. In addition, if the rotation matrix is defined as R = (r 1 , r 2 , r 3 ), and the parameters of the plane in both frames are chosen as n = [0, 0, 1] T and δ = 0, then the plane parameters in the color camera coordinate system ({C}) are n = r 3 and δ = r T 3 t where it can use R WC , t WC for the color camera and R VD , t VD for the depth camera [18,20]. The plane parameters' vectors for each color image could be concatenated by the matrices: M C = [n c1 , n c2 , · · ·, n cn ] and b C = [δ c1 , δ c2 , · · ·, δ cn ] [28]. Furthermore, the plane parameters vectors in the depth camera could also be represented by M D and b D . Then, the relative transformation T CD = {R CD , t CD } is shown as: Finally, the rotation matrix R CD = UV T is obtained by singular value decomposition (SVD), where USV T is the SVD of R CD . T DC can also be obtained by T CD . Now, the relative position between the three external cameras and the color camera of Kinect can be obtained directly.

Nonlinear Minimization
Least square method is a basic, practical, and widely used mathematical model [29], by minimizing the sum of squares of the error between samples and its reconstruct samples to find the best cost function. During the camera calibration, the core of the calibration method aims to minimize the weighted sum of squares of the measurement re-projection errors over all parameters. The re-projection error for the color camera and external camera are the Euclidean distance between the measured corner position and its re-projected position. We assume that the re-projection positions of the color camera and the external camera arep c ,p e , respectively, and their actual measurement positions are p c , p e , respectively. For the depth camera, the re-projection error is the difference between the original disparity measurement value d and the re-projection valued (i.e., the estimated value of the original disparity) of the disparity. In Formula (4), c 0 and c 1 are the internal parameters of the depth camera, z k can be obtained by the depth information, and then we can get the original disparity estimated valued. The method of [30] can be used to obtain the parameter Z kk in Equation (5), and the original disparity measurement value d can also be obtained. At this point, we have a preliminary cost function: where σ 2 c , σ 2 d and σ 2 e are the variances of the measurement error of color camera, depth camera and external camera, respectively. Obviously, Formula (14) does not comply fully with our requirements. For example, some external camera parameters are completely not used. Hence, Equation (14) needs to be modified.
First of all, taking into account the disparity distortion correction of the depth camera, the estimated valued of the original disparity is replaced byd k corrected by Equation (7). The measurement value d of original disparity is replaced by d k corrected by Equation (8). In Equation (8), the parameters D δ and α = {α 0 , α 1 } are independent from all of the other parameters. They only depend on the observed values of the pixel (u, v). Therefore, it can be optimized through least squares method individually, and the cost function of disparity distortion can be described as Equation (15). The initial values of D δ and α are provided, and then the optimal solution by iteration is achieved: Secondly, the cost Function (14) cannot achieve the simultaneous calibration of all external cameras. On this basis, we extend the intermediate term in Equation (14) that is associated with external cameras. Meanwhile, adding different weights to the external cameras, that is, adding coefficients β i (i = 0, 1, 2, 3 ...) to their corresponding re-projection errors. It can be found that the additional weights are related to the distance from the external cameras to the Kinect. Figure 5 shows the top view of the experimental framework during the image acquisition process. There are multiple rotation direction of the checkerboard plane, and the frontal plane is selected as the analysis object. The distance between points A and B is the total width of the checkerboard, and the distance between points B, C and points B, D are the width of the checkerboard shown in the pictures, which is taken by the external camera 0 and 1, respectively. Apparently, the distance between points B and C is longer than the distance between points B and D [31]. In other words, under the same condition, the checkerboard area occupies more pixels in the picture taken by the external camera 0. That is, the pictures that are taken by the external camera 0 contain more calibration information [32]. Therefore, it is believed that it should have a higher weight. That is to say, in the calibration process, when attaching a high weight to the camera0 that comes closer to the Kinect, and attaching low weights to camera1 and camera2 that are far from the Kinect, the calibration results are more accurate.  X, Y and Z are the corresponding coordinates in the experimental frame coordinate system, respectively; I is the spatial distance of the color camera and the corresponding external camera in the coordinate system;  is the corresponding coefficient of the external camera. This paper uses I to represent the spatial distance between the color camera and the corresponding external cameras on the experimental frame. By analyzing a large number of calibration results, the relationship between the spatial distance I and the correspondence coefficient  can be summed up. When the value of I for all of the external cameras is less than 600 mm, the value of coefficients  does not vary with I, and ; when the value of I for one or more external cameras is greater than 600 mm, it can be defined that and A is a natural number (e.g., 1.1 calculated as 2). At the same time, in order to reduce the influence of the external cameras on Kinect internal parameters calibration, we specify 1 ... [33], and the other external cameras for which the value of I is less than 600 mm have the same value of coefficient; when the value of I for all of the external cameras is greater than 600 mm, all the external cameras coefficients are processed according to the same formula In this paper, the relative position between each corresponding external cameras and color camera can been calculated, and the corresponding external cameras coefficients as shown in Table 1. After analysis of the external cameras, the modified optimized cost function can also be obtained: It is easy to see that Formula (15) is the same as the corresponding intermediate term of the new cost Function (16) and does not interact with the others parameters. Therefore, we can directly replace the corresponding initial value in Equation (16). The nonlinear minimization of the parameters can be achieved by iteratively calculating the new cost function. The specific iteration process is as follows: the first step is to keep  D as a constant while assigning the coefficients  This paper uses I to represent the spatial distance between the color camera and the corresponding external cameras on the experimental frame. By analyzing a large number of calibration results, the relationship between the spatial distance I and the correspondence coefficient β can be summed up. When the value of I for all of the external cameras is less than 600 mm, the value of coefficients β does not vary with I, and β i = 1; when the value of I for one or more external cameras is greater than 600 mm, it can be defined that A = (I − 600)/50, β = 1 − 0.02 × A, and A is a natural number (e.g., 1.1 calculated as 2). At the same time, in order to reduce the influence of the external cameras on Kinect internal parameters calibration, we specify β 0 + β 1 + ... + β i = i + 1 [33], and the other external cameras for which the value of I is less than 600 mm have the same value of coefficient; when the value of I for all of the external cameras is greater than 600 mm, all the external cameras coefficients are processed according to the same formula A = (I − 600)/50, β = 1 − 0.02 × A. In this paper, the relative position between each corresponding external cameras and color camera can been calculated, and the corresponding external cameras coefficients as shown in Table 1. After analysis of the external cameras, the modified optimized cost function can also be obtained: It is easy to see that Formula (15) is the same as the corresponding intermediate term of the new cost Function (16) and does not interact with the others parameters. Therefore, we can directly replace the corresponding initial value in Equation (16). The nonlinear minimization of the parameters can be achieved by iteratively calculating the new cost function. The specific iteration process is as follows: the first step is to keep D δ as a constant while assigning the coefficients β 0 , β 1 and β 2 by 1.2, 0.9 and 0.9, respectively. Then, all the other parameters are substituted into Equation (16) to minimize the value of c. In the second step, the initial values of α 0 , α 1 and D δ in the depth distortion model are assigned to zero, and then they are taken into Equation (15) to optimize the disparity distortion parameter D δ for each pixel individually. Once the new value D δ is obtained, the old value D δ is replaced in the first step. Repeat Steps 1 and 2 as many times as necessary until the residuals converge to a minimum. X, Y and Z are the corresponding coordinates in the experimental frame coordinate system, respectively; I is the spatial distance of the color camera and the corresponding external camera in the coordinate system; β is the corresponding coefficient of the external camera.

Experiments
In order to demonstrate the performance of the proposed method in the real project, all of the input images in this experiment come from the same database, which were collected and produced by our existing experimental equipment. All pictures were collected in the way described in Section 3.1 and saved in JPG format. For comparison with Herrera's method, all depth images in this experiment are saved in the same PGM format as in Herrera's method. In addition, since Herrera's method had a strong dependency on the number of input pictures, the results were random when the number of pictures was less than 20 [19], and the joint calibration method proposed in this paper only needs 15 pictures. The devices' intrinsic parameters calculated by our method are shown in Tables 2 and 3, wherein C.C. represents Color Camera and E.C. represents External Camera. This table shows the focal length f cx , f cy , the principal point (u c0 , v c0 ) and the distortion coefficient K c = [k c1 k c2 p c1 p c2 k c3 ], respectively, wherein C.C. and E.C. represents Color and External Camera, respectively. This table shows the focal length ( f dx , f dy ), the principal point (u d0 , v 0 ), the distortion coefficient K d = [k d1 k d2 p d1 p d2 k d3 ], the depth parameters (c 0 , c 1 ) and the depth distortion (α 0 , α 1 ), respectively.

Herrera's Method Results for Comparison
In our results, each device corresponds to a unique set of values. In this paper, the Herrera's method results are used to compare with the proposed method. However, Herrera's calibration method is limited to a single external camera and could not be effectively employed in multiple devices. We can only calibrate each external camera one by one. Therefore, in the actual calibration process, each of the different external cameras will correspond to a new set of Kinect data. How to choose from multiple sets of Kinect parameters is also a problem. In the actual comparison process, Herrera's method is still used to calibrate the external camera 0, 1, 2, and there are three different sets of Kinect parameter values.
In the process of selecting Kinect parameters for Herrera's method, the re-projection error value of color camera and depth camera is an important reference, the smaller the value is, the greater the selectivity of this set of Kinect parameters will be. Then, we select the single set of Kinect parameters, based on which the lowest re-projection error summed over all three external cameras is calculated. In addition, we can also put each set of Kinect parameter values into the 3D reconstruction module, respectively. By observing the effect of 3D reconstruction, the best group of values for Herrera's method is chosen. However, the randomness of this method is too large, and the choice of Kinect parameters may be affected by the observation error. Therefore, this paper selects the Kinect parameters by the first method described above.
In order to visually present the difference between the two methods, in this paper, the corresponding rotation, translation and distortion correction are made to the original depth maps, and overlaid it on the corresponding color image [34]. The overlaid depth maps and the corresponding 3D colored point cloud images obtained by the proposed method and Herrera's method are shown in Figures 6 and 7, respectively.   Sensors 2017, 17, 1491 10 of 16 method is chosen. However, the randomness of this method is too large, and the choice of Kinect parameters may be affected by the observation error. Therefore, this paper selects the Kinect parameters by the first method described above. In order to visually present the difference between the two methods, in this paper, the corresponding rotation, translation and distortion correction are made to the original depth maps, and overlaid it on the corresponding color image [34]. The overlaid depth maps and the corresponding 3D colored point cloud images obtained by the proposed method and Herrera's method are shown in Figures 6 and 7, respectively.  It can be clearly observed that the proposed method shows very accurate results in the corresponding overlaid depth maps and 3D colored point cloud images. Herrera's method only satisfies partial accuracy in the corresponding overlaid depth maps and 3D colored point cloud images. For example, in Figure 7f, a large black point cloud appears on the white desktop, which is not allowed.
By analyzing the calibration results of the two methods for the same dataset, standard deviation is compared for the re-projection error as shown in Table 4. Here, the standard deviation of each reprojection error can be regarded as the actual value of the corresponding intermediate term after the nonlinear minimization by Formula (16). Therefore, the actual value of c in Equation (16) indirectly reflects the accuracy of the calibration, and it can be a reference to evaluate the accuracy of calibration, but it is by no means a direct standard [35]. The smaller the value is, the higher the calibration accuracy of the corresponding device will become. In Herrera's method, the minimum value of the It can be clearly observed that the proposed method shows very accurate results in the corresponding overlaid depth maps and 3D colored point cloud images. Herrera's method only satisfies partial accuracy in the corresponding overlaid depth maps and 3D colored point cloud images. For example, in Figure 7f, a large black point cloud appears on the white desktop, which is not allowed.
By analyzing the calibration results of the two methods for the same dataset, standard deviation is compared for the re-projection error as shown in Table 4. Here, the standard deviation of each re-projection error can be regarded as the actual value of the corresponding intermediate term after the nonlinear minimization by Formula (16). Therefore, the actual value of c in Equation (16) indirectly reflects the accuracy of the calibration, and it can be a reference to evaluate the accuracy of calibration, but it is by no means a direct standard [35]. The smaller the value is, the higher the calibration accuracy of the corresponding device will become. In Herrera's method, the minimum value of the standard deviation of the color camera and the depth camera is found to be 0.1272 and 0.7343, respectively, and the standard deviation of the three external cameras is unique. This moment, the actual value of c is c Her = 6.04436 by Herrera's method. Similarly, the c value of proposed method is c Pro = 5.93022. It can be found intuitively that the results of these two methods are very close. Both methods achieve accurate calibration, and the data shows that the proposed joint calibration method is more accurate. Therefore, our method does not only realize the joint calibration of the depth sensor and multiple external cameras, but also improves the accuracy of calibration and reduces the dependence on the number of input images. Wherein C.C. represents Color Camera; E.C. represents External Camera and D.C. represents Depth Camera; c is the parameter in Equation (16). To compare the data sets, the variances were kept constant (σ c = 0.02 px, σ d = 0.75 kud, σ ei = 0.40 px).

3D Reconstruction
In addition, in order to provide data support for the 3D reconstruction module, the results of the two methods are also implemented into a real project platform, respectively [36]. The overlaid depth maps and the corresponding joint 3D reconstruction results of the proposed method and Herrera's method are shown in Figure 8. Color images captured in different cameras are superimposed on the same space. The color images view comes from the color camera in Kinect and the external camera 0, which are covered in the same 3D point cloud space. The completeness of the reconstruction between them can reflect the accuracy of the joint calibration results. By observing the effects of the overlaid depth maps and the corresponding joint reconstructed 3D images, it can be found that both of these two methods ensure the integrity of the depth information, and the details of the scene are also reflected in the reconstructed 3D images. Comparing the details of these two image sets, the proposed method works better on the overlaid depth maps and the corresponding joint reconstructed 3D images. For example, in Figure 8a,c, comparing the left palm edge of the observed object, it is clear that the color and depth information were superimposed more accurately by proposed method; in Figure 8b,d, comparing the right shoulder of the observed object, the contours by the proposed method are clearer. In Herrera's method, it contains a larger area of the clothing pattern on the surface of the brown storage locker due to the data deviation.

3D Ground Truth
In order to visually demonstrate the calibration result of the two methods, we also collected a set of data as a test set. As shown in Figure 9, the test set contains six sets of standard chessboard images with different angles, and each set of images contains a checkerboard image under Kinect view and a corresponding depth image. First of all, the coordinates of the checkerboard corners of the test set are determined, and the number is in Figure 3a. The actual distance between the corners of the checkerboard is 25 mm. Then, the Kinect intrinsics of these two methods are used to reconstruct the test set, respectively. The calibration accuracy is evaluated by analyzing the distance error between the reconstructed points. In theory, the closer the actual distance and the calculated distance of the adjacent checkerboard corners are, the higher the calibration accuracy of the corresponding calibration method will be [37]. In order to reduce the relative error, the maximum known distances of the x-axis and the y-axis are measured separately. In other words, the distances between the checkerboard corners numbered 1, 9 and 1, 46 are calculated, respectively. Table 5 shows the distance error between the reconstructed points in the x-axis and y-axis directions. It is clear that the proposed method is closer to the true distance with a higher calibration accuracy.

3D Ground Truth
In order to visually demonstrate the calibration result of the two methods, we also collected a set of data as a test set. As shown in Figure 9, the test set contains six sets of standard chessboard images with different angles, and each set of images contains a checkerboard image under Kinect view and a corresponding depth image. First of all, the coordinates of the checkerboard corners of the test set are determined, and the number is in Figure 3a. The actual distance between the corners of the checkerboard is 25 mm. Then, the Kinect intrinsics of these two methods are used to reconstruct the test set, respectively. The calibration accuracy is evaluated by analyzing the distance error between the reconstructed points. In theory, the closer the actual distance and the calculated distance of the adjacent checkerboard corners are, the higher the calibration accuracy of the corresponding calibration method will be [37]. In order to reduce the relative error, the maximum known distances of the x-axis and the y-axis are measured separately. In other words, the distances between the checkerboard corners numbered 1, 9 and 1, 46 are calculated, respectively. Table 5 shows the distance error between the reconstructed points in the x-axis and y-axis directions. It is clear that the proposed method is closer to the true distance with a higher calibration accuracy.

3D Ground Truth
In order to visually demonstrate the calibration result of the two methods, we also collected a set of data as a test set. As shown in Figure 9, the test set contains six sets of standard chessboard images with different angles, and each set of images contains a checkerboard image under Kinect view and a corresponding depth image. First of all, the coordinates of the checkerboard corners of the test set are determined, and the number is in Figure 3a. The actual distance between the corners of the checkerboard is 25 mm. Then, the Kinect intrinsics of these two methods are used to reconstruct the test set, respectively. The calibration accuracy is evaluated by analyzing the distance error between the reconstructed points. In theory, the closer the actual distance and the calculated distance of the adjacent checkerboard corners are, the higher the calibration accuracy of the corresponding calibration method will be [37]. In order to reduce the relative error, the maximum known distances of the x-axis and the y-axis are measured separately. In other words, the distances between the checkerboard corners numbered 1, 9 and 1, 46 are calculated, respectively. Table 5 shows the distance error between the reconstructed points in the x-axis and y-axis directions. It is clear that the proposed method is closer to the true distance with a higher calibration accuracy.   Lx and Ly represent the calculated distances of the adjacent checkerboard corners in the x-axis and y-axis directions, respectively. Lx-25 and Ly-25 represent the error between the calculated distance and the actual distance in the x-axis and y-axis directions, respectively. M represents the arithmetic mean of the absolute value of the distance error.

Conclusions
Considering the problem that current research only focuses on the calibration of a single external camera instead of multiple external cameras, we present a novel method and a corresponding workflow framework that can simultaneously calibrate relative poses of a Kinect and three external cameras. By optimizing the final cost function and adding corresponding weights to the external cameras in different locations, the joint calibration of multiple devices is efficiently constructed. At the same time, the validity and accuracy of the method are verified with comparative experiments. Experimental results show that the proposed method improves the accuracy of calibration. It also shows that the proposed method does not only reduce the dependence on the number of input pictures, but also improves the accuracy of joint 3D reconstruction. In this paper, camera calibration technology is used to provide data support and has been successfully applied in a practical real-time project, with important practical value.