A New Calibration Method for Commercial RGB-D Sensors

Commercial RGB-D sensors such as Kinect and Structure Sensors have been widely used in the game industry, where geometric fidelity is not of utmost importance. For applications in which high quality 3D is required, i.e., 3D building models of centimeter-level accuracy, accurate and reliable calibrations of these sensors are required. This paper presents a new model for calibrating the depth measurements of RGB-D sensors based on the structured light concept. Additionally, a new automatic method is proposed for the calibration of all RGB-D parameters, including internal calibration parameters for all cameras, the baseline between the infrared and RGB cameras, and the depth error model. When compared with traditional calibration methods, this new model shows a significant improvement in depth precision for both near and far ranges.


Introduction
In recent years, RGB-D sensors have attracted great attention in 3D modeling due to their low cost. Two major concepts, time of flight (ToF) and structured light (SL), are used in RGB-D sensors. Many devices released in the market are based on these concepts; for example, Kinect v1, Structure Sensor [1] and ASUS Xtion Pro Live [2] are based on the SL concept, while Kinect v2 [3] is based on the ToF concept [4] The RGB-D sensor consists of three different sensors: an RGB camera, an infrared (IR) camera and an IR projector. In Kinect and ASUS, all three sensors are manufactured in a fixed frame, whereas the Structure Sensor combines two IR sensors only and is designed to be attached to a portable device with an RGB camera.
Although RGB-D sensors were originally designed to be used for gaming purposes such as remote controlling, they have recently made an important contribution to surveying and navigation applications, such as building information modeling, indoor navigation, and indoor 3D modeling [5,6] Although the accuracy required for gaming applications is not high, to extend the use of RGB-D sensors to surveying-type applications, accurate calibration of the device's geometric parameters (i.e., camera focal lengths and baselines between cameras) and modeling the sensor errors (i.e., lens distortions and systematic depth errors) are necessary to produce high quality-spatial data and 3D models (e.g., cm-level precision). compute the external baseline between the cameras and the distortion parameters of the RGB camera. The second step focused on the depth sensor calibration. First, the in-factory calibration parameters were updated to eliminate the systematic error resulting from the baseline between the IR camera and projector. Second, a combined distortion model was used to compensate for the distortion and systematic errors resulting from both the IR camera and projector. The experimental design and results are discussed in comparison with the conventional calibration method, and concluding remarks are made.

A Distortion Calibration Model for Depth Sensor
Numerous RGB-D sensors were released on the market based on a structured light concept consisting of an IR camera and an IR projector. In addition to IR sensors, an optional RGB camera acquires the color information of the observed scene. The IR projector emits a predefined pattern and the IR camera receives it [28]. The depth of the image is obtained by triangulation based on the distance between their camera and projector. Figure 1 shows the main elements of the sensors, which use the SL concept. Two sensors are shown: Kinect v1 and the Structure Sensor. The main difference between both sensors is the baseline between the IR camera and projector. The longer the sensor's baseline, the longer working distance can be achieved. The working range of Kinect v1 is 0.80 m to 4.0 m, while it is 0.35 m to 3.5 m for Structure Sensor. IR camera and projector. Second, a combined distortion model was used to compensate for the distortion and systematic errors resulting from both the IR camera and projector. The experimental design and results are discussed in comparison with the conventional calibration method, and concluding remarks are made.

A Distortion Calibration Model for Depth Sensor
Numerous RGB-D sensors were released on the market based on a structured light concept consisting of an IR camera and an IR projector. In addition to IR sensors, an optional RGB camera acquires the color information of the observed scene. The IR projector emits a predefined pattern and the IR camera receives it [28]. The depth of the image is obtained by triangulation based on the distance between their camera and projector. Figure 1 shows the main elements of the sensors, which use the SL concept. Two sensors are shown: Kinect v1 and the Structure Sensor. The main difference between both sensors is the baseline between the IR camera and projector. The longer the sensor's baseline, the longer working distance can be achieved. The working range of Kinect v1 is 0.80 m to 4.0 m, while it is 0.35 m to 3.5 m for Structure Sensor. The principle of depth computation for RGB-D sensors is shown in Figure 2, where both IR sensors are integrated together to produce a pixel-shifted distance called disparity. The infrared projector pattern on a predefined plane (Z0) used in in-factory calibration [14] was stored in the sensor firmware. While capturing the real feature (Qi), both the IR standard projector pattern ( c i x 0 , ) and the real IR projector captured by IR camera ( c i x ) are identified. The difference between both locations is called disparity Using the disparity value and the predefined configuration information, including the focal length (f) of the IR sensors, the baseline between the IR projector and camera (w), and the standard depth of the projected pattern (Z0) [5,29], we can compute the depth of the feature point (Qi ).  The principle of depth computation for RGB-D sensors is shown in Figure 2, where both IR sensors are integrated together to produce a pixel-shifted distance called disparity. The infrared projector pattern on a predefined plane (Z 0 ) used in in-factory calibration [14] was stored in the sensor firmware. While capturing the real feature (Q i ), both the IR standard projector pattern (x c i,0 ) and the real IR projector captured by IR camera (x c i ) are identified. The difference between both locations is called disparity d i = x c i − x c i,0 . Using the disparity value and the predefined configuration information, including the focal length (f ) of the IR sensors, the baseline between the IR projector and camera (w), and the standard depth of the projected pattern (Z0) [5,29], we can compute the depth of the feature point (Q i ). IR camera and projector. Second, a combined distortion model was used to compensate for the distortion and systematic errors resulting from both the IR camera and projector. The experimental design and results are discussed in comparison with the conventional calibration method, and concluding remarks are made.

A Distortion Calibration Model for Depth Sensor
Numerous RGB-D sensors were released on the market based on a structured light concept consisting of an IR camera and an IR projector. In addition to IR sensors, an optional RGB camera acquires the color information of the observed scene. The IR projector emits a predefined pattern and the IR camera receives it [28]. The depth of the image is obtained by triangulation based on the distance between their camera and projector. Figure 1 shows the main elements of the sensors, which use the SL concept. Two sensors are shown: Kinect v1 and the Structure Sensor. The main difference between both sensors is the baseline between the IR camera and projector. The longer the sensor's baseline, the longer working distance can be achieved. The working range of Kinect v1 is 0.80 m to 4.0 m, while it is 0.35 m to 3.5 m for Structure Sensor. The principle of depth computation for RGB-D sensors is shown in Figure 2, where both IR sensors are integrated together to produce a pixel-shifted distance called disparity. The infrared projector pattern on a predefined plane (Z0) used in in-factory calibration [14] was stored in the sensor firmware. While capturing the real feature (Qi), both the IR standard projector pattern ( . Using the disparity value and the predefined configuration information, including the focal length (f) of the IR sensors, the baseline between the IR projector and camera (w), and the standard depth of the projected pattern (Z0) [5,29], we can compute the depth of the feature point (Qi ).   Using the triangle similarity, the relationship between the standard pattern location (x c i,0 ), the real projector pattern location (x c i ) in the IR camera space and the IR projector location (x p i ) can be written as: Applying the disparity d i = x c i − x c i,0 , the fundamental equation for computing the depth value for a feature point (Q i ) can be written as follows: As noted previously, the disparity value is measured by the firmware of the RGB-D sensor. However, the output value by the sensor is the normalized value (d n i ), which ranges from 0 to 2047. The sensor uses (α) and (β) as two linear factors to normalize the measured disparity (d i ). The normalized disparity is stated as (3) and combining all of the constants to assigned factors a and b, Equation (3) becomes: where a and b are constants: The final coordinates X i , Y i , and Z i for the acquired feature (Q i ) are computed as: The formula presented in Equation (4) with the factors a and b is called the manufacturer's equation or the mapping function, which produces the depth information from the normalized disparity. The a and b coefficients are a function of the design parameters of the RGB-D sensor, which are the standard plane distance, the baseline, the focal length, and the linear parameters that convert the measured disparity to the normal disparity.
The baseline between IR camera and IR projector has a great influence in depth precision, using the covariance error propagation concept to estimate the variance of the depth resulting from SL RGB-D sensor. Using Equation (3) to estimate the depth variance, it can be figured out that σ z = z 2 ( f w) −1 σ d , where σ z and σ d are the precision of depth and disparity, respectively. For a certain SL RGB-D sensor, if the baseline (w) between IR camera and IR projector was increased to double, the precision of depth will be improved by 50% assuming all other variables were constants.
The disparity value can be computed from three constants, f, w, and Z 0 , and two measured quantities, x c i and x p i . The measured quantities are affected by the distortion of the IR camera and projector, respectively. The systematic error can be assumed to be a function of the distortion parameters. The general expression that combines the effect of the systematic error and the distortion effect can be written as follows: where d ti is the true disparity, d i is the measured disparity, and d e represents the error resulting from the effect of lens distortion and systematic error for the IR sensors. The disparity error can be expressed as a function of the distortion effect for both the IR camera and projector for both tangential and radial distortions: where: δ c tan g. and δ p tan g. are the tangential distortion effect for the IR camera and projector, respectively, and δ c rad. and δ p rad. are the radial distortion effect for the IR camera and projector, respectively. The radial distortion quantifies the lens quality, which is caused by bending the ray linked object, image, and focal points. The Brown model [30] with two factors (K 1 and K 2 ) is applied to compensate for the radial distortion. The tangential other model describes the distortion resulting from the inaccurate location of the lens with respect to the focal point, and the effect is compensated for using another two factors (P 1 and P 2 ) [12,31]: where P 1 and P 2 are the factors represent the tangential distortion model, K 1 and K 2 are the factors represent the radial distortion model and x t and y t are the free distortion coordinates of the image point As the disparity value is computed from the horizontal shift from the projected pattern to the standard one, the relation between the measured point coordinate and the true point coordinate can be expressed as follows: where x m is the measured x coordinate for the sensor and x t is the true x coordinate for the sensor. Inserting Equation (9) into Equation (8), the full disparity error model considering both IR camera and projector distortions can be expressed in Equation (11) with eight parameters: where p refers to the IR projector and c refers to the IR camera.
To further simplify Equation (11), we used two parameters (W 1 and W 2 ) to describe the tangential distortion by correcting the relative orientation between the IR camera and projector lenses and applied another two parameters (W 3 and W 4 ) to describe the combined the radial distortion for both the IR camera and the projector, which can be considered as one lens combining the overlaying IR camera and projector lenses. As the relative orientation between the camera and projector is well fixed (as a rigid body) and pre-calibrated (mapping function calibration), the y axis for IR camera and IR projector can be assumed to be identical. The third and fourth terms of Equation (11) represent the radial distortion in the x direction resulting from the IR camera and projector. Due to the unknown projector data, we used the gross combined radial distortion, known as Seidal aberrations [12] and the IR camera's pixel location to assign x distortion effect. This gives the two constraints shown in Equation (12): Equation (11) can then be simplified as: Equation (13) is the fundamental equation that describes the depth sensor distortion with only four parameters. As x c − x p = d t and x c + x p = 2x c − d t , we have: Finally, the full distortion model for the RGB-D sensor can be given as: In Equation (15), the distortion model uses four parameters that consider both the radial and tangential distortions for both IR camera and projector lenses.

RGB-D Sensor Calibration
The calibration procedure was divided into two steps, with the first step handling the calibration of the RGB and IR cameras' geometric parameters, including the focal length, principal point, and distortion parameters, and calibrated the RGB-IR baseline. The second step deals with the depth sensor calibration, whereby the manufacturer's parameters and the proposed combined distortion model parameters are computed. In this section, we illustrate the methodology we used to achieve the two calibration steps.

RGB-D Joint Calibration
The methodology for jointly calibrating the RGB and IR cameras is indicated in the following chart ( Figure 3). It started from stereo images produced by the RGB and IR cameras while the IR projector was switched off. After acquiring the stereo images, automatic corner extraction for a conventional A3 checkerboard was performed to extract the image points for each image. Using the ground coordinates of the checkerboard and extracted image points, a pinhole camera model was applied based on the calibration method discussed in [27] with five distortion parameters for each camera (three [K 1 K 2 K 3 ] represented the tangential distortion and two [P 1 P 2 ] represented the radial distortion). After we optimized the external and internal parameters for both cameras, the global refinement was applied to enhance the estimated geometric parameters for each camera as well as the baseline between the RGB and IR cameras using Equation (16) [32]: where N is total number of images, M is the total number of points, P mn is the point pixel coordinates, P mn is the ground point coordinates, R n and T n are the rotation and translation matrix and K is the camera intrinsic matrix

Error and Distortion Model
The parameters calibrated in this step included the manufacturer constants a and b and the distortion parameters for depth sensors indicated in Equation (15). In this stage, the IR projector was switched on and the information produced by the depth sensor, presented in normal disparity and depth images, were acquired. After that, the recovery of manufacturer constants was estimated by applying Equation (4). Using the true depth and normal disparity, the manufacturer constants were calibrated using the least squares method, and the normal disparity was corrected based on calibrated a and b. The proposed distortion model was adopted to model the remaining distortion error resulted from the difference between true disparity and normal disparity. Using least square method with a weight matrix based on depth information, Equation (15) was solved and W's parameters were computed. The methodology of the error model and distortion effect is introduced in Figure 4.

Error and Distortion Model
The parameters calibrated in this step included the manufacturer constants a and b and the distortion parameters for depth sensors indicated in Equation (15). In this stage, the IR projector was switched on and the information produced by the depth sensor, presented in normal disparity and depth images, were acquired. After that, the recovery of manufacturer constants was estimated by applying Equation (4). Using the true depth and normal disparity, the manufacturer constants were calibrated using the least squares method, and the normal disparity was corrected based on calibrated a and b. The proposed distortion model was adopted to model the remaining distortion error resulted from the difference between true disparity and normal disparity. Using least square method with a weight matrix based on depth information, Equation (15) was solved and W's parameters were computed. The methodology of the error model and distortion effect is introduced in Figure 4.

Experimental Design and Data Collection
Our experiments were designed to achieve the full calibration parameters for the RGB-D sensor. The calibration parameters were divided into three sets. The first set was the baseline between the RGB and IR cameras with the RGB camera's internal parameters. The second set was the calibration of the mapping function (Equation (4)), which could consume the systematic error resulting from the baseline between the IR camera and projector, the standard depth, and the focal length of the depth sensor. The third set contained the distortion model parameters, which corrected the relative distortion resulting from the IR camera and projector. Dealing with the true value of observed plane, each sensor was attached to IPad device and setup on a movable tripod, a four control points were staked on each IPad. Using 0.50-m interval and starting from 0.50 to 3.00 m, six stations were identified. For each station, the control points in IPad were acquired using a total station, the distance between the sensor and observed plane was computed for the designated stations. Between the six stations, we collected several true depth images to enrich the observed data set. Based on the previously mentioned three steps, the data were collected in two phases. In the first phase, we collected a stereo-pairs image of the RGB and IR cameras for an ordinary chessboard to achieve the first set of calibration parameters. In the second phase, we collected pairs of depth and disparity images for a planar surface that we used to calibrate the depth sensor for the calibration parameters of sets two and three. We conducted the experiments on two different Structure Sensors. Table 1 shows the number of images collected to calibrate each sensor for two phases. Sensor 1 was attached to an iPad Air with serial number (S.N. 26779), while Sensor 2 was attached to an iPad Air 2 with serial number (S.N. 27414).

Experimental Design and Data Collection
Our experiments were designed to achieve the full calibration parameters for the RGB-D sensor. The calibration parameters were divided into three sets. The first set was the baseline between the RGB and IR cameras with the RGB camera's internal parameters. The second set was the calibration of the mapping function (Equation (4)), which could consume the systematic error resulting from the baseline between the IR camera and projector, the standard depth, and the focal length of the depth sensor. The third set contained the distortion model parameters, which corrected the relative distortion resulting from the IR camera and projector. Dealing with the true value of observed plane, each sensor was attached to IPad device and setup on a movable tripod, a four control points were staked on each IPad. Using 0.50-m interval and starting from 0.50 to 3.00 m, six stations were identified. For each station, the control points in IPad were acquired using a total station, the distance between the sensor and observed plane was computed for the designated stations. Between the six stations, we collected several true depth images to enrich the observed data set. Based on the previously mentioned three steps, the data were collected in two phases. In the first phase, we collected a stereo-pairs image of the RGB and IR cameras for an ordinary chessboard to achieve the first set of calibration parameters. In the second phase, we collected pairs of depth and disparity images for a planar surface that we used to calibrate the depth sensor for the calibration parameters of sets two and three. We conducted the experiments on two different Structure Sensors. Table 1 shows the number of images collected to calibrate each sensor for two phases. Sensor 1 was attached to an iPad Air with serial number (S.N. 26779), while Sensor 2 was attached to an iPad Air 2 with serial number (S.N. 27414).

Calibration Results
The phase 1 data for each sensor were processed to compute the calibrated baseline between the RGB and IR cameras, while the phase 2 data were processed to calibrate the depth sensor. Tables 2 and 3 are the results of phase1 for sensors 1 and 2. The output data are the internal parameters for the RGB and IR cameras. The internal parameters are represented as camera focal lengths (Fx, Fy) in pixels and principal point (Cx, Cy) in pixels and five distortion vectors (K 1 , K 2 , P 1 , P 2 , and K 3 ), where K's are introduced to consume the radial distortion effect and P's are presented to eliminate the tangential distortion effect. The IR-RGB camera baseline is expressed in six parameters, including three translations (dx, dy, and dz) in mm and three rotation Euler angles (Rx, Ry, and Rz) in radians.   Table 4 shows the default parameters used by the firmware of the structure sensor, stated as the internal parameters for the depth sensor and color camera. The focal length and principal point for both sensors are the same, while the distortion parameters for both RGB and IR cameras are set are to zero. After calibrating the baseline between the RGB and IR cameras and the internal parameters for the RGB camera, the phase 2 data were processed to calibrate the depth sensor. The two steps, including the mapping function calibration and distortion model, were conducted to deliver a high-precision depth measurement from the sensor. Table 5 shows the calibration result for mapping the function calibration. a and b are the mapping parameters mentioned in Equation (4). After computing the calibrated mapping function, we modified the measured depth and disparity information to correct the systematic error resulting from the mapping function error, then continued to compute the distortion model parameters. Figure 5 shows the distortion model parameters set for both Structure Sensors. After computing the calibrated mapping function, we modified the measured depth and disparity information to correct the systematic error resulting from the mapping function error, then continued to compute the distortion model parameters. Figure 5 shows the distortion model parameters set for both Structure Sensors.  Figure 5 shows the main conclusion for the calibration procedure. Although the distortion parameters (W1, W2, W3, and W4) change with the measured distance, after 2.50 m, each distortion model parameter tends to be the same value. This means that for the full calibration parameters, it could be sufficient to collect the depth data with corresponding disparity images up to 2.50 m.

Accuracy Assessment of the Calibration Models
To examine our calibration methodology as well as the distortion model performance on the depth accuracy, we captured a new dataset for each sensor and applied the calibration results. To examine the IR-RGB baseline calibration, two images (depth and RGB) were collected and aligned using the calibrated parameters. Figure 6 shows the effect of the calibration parameters.  Figure 5 shows the main conclusion for the calibration procedure. Although the distortion parameters (W 1 , W 2 , W 3 , and W 4 ) change with the measured distance, after 2.50 m, each distortion model parameter tends to be the same value. This means that for the full calibration parameters, it could be sufficient to collect the depth data with corresponding disparity images up to 2.50 m.

Accuracy Assessment of the Calibration Models
To examine our calibration methodology as well as the distortion model performance on the depth accuracy, we captured a new dataset for each sensor and applied the calibration results. To examine the IR-RGB baseline calibration, two images (depth and RGB) were collected and aligned using the calibrated parameters. Figure 6 shows the effect of the calibration parameters.  Figure 5 shows the main conclusion for the calibration procedure. Although the distortion parameters (W1, W2, W3, and W4) change with the measured distance, after 2.50 m, each distortion model parameter tends to be the same value. This means that for the full calibration parameters, it could be sufficient to collect the depth data with corresponding disparity images up to 2.50 m.

Accuracy Assessment of the Calibration Models
To examine our calibration methodology as well as the distortion model performance on the depth accuracy, we captured a new dataset for each sensor and applied the calibration results. To examine the IR-RGB baseline calibration, two images (depth and RGB) were collected and aligned using the calibrated parameters. Figure 6 shows the effect of the calibration parameters.  To examine the performance of the depth calibration parameters, including the calibration of the mapping function and the distortion model, we collected several depth images for a planar surface and applied our calibration parameters. The examination criterion was based on the same procedure illustrated in [8,9]. Compared with the planes resulted from the total station, the RMSE of the fitted plane surface was used to describe the measured depth precision. Figures 7 and 8 show the depth precision performance evaluation for sensors 1 and 2, respectively. The left side introduces the full range performance, and the right side zooms in to display the near range.
In addition to the planar surfaces assessment, two perpendicular planar surfaces (part of wall and ceiling) were captured using one of the calibrated sensors, the average distance between the sensor and the observed planes is 2.00 m (with minimum of 1.20 m and maximum of 3.00 m). The data were processed using the default parameters provided by the manufacturer and processed again after applying our distortion model. To examine the performance of the depth calibration parameters, including the calibration of the mapping function and the distortion model, we collected several depth images for a planar surface and applied our calibration parameters. The examination criterion was based on the same procedure illustrated in [8,9]. Compared with the planes resulted from the total station, the RMSE of the fitted plane surface was used to describe the measured depth precision. Figures 7 and 8 show the depth precision performance evaluation for sensors 1 and 2, respectively. The left side introduces the full range performance, and the right side zooms in to display the near range.  In addition to the planar surfaces assessment, two perpendicular planar surfaces (part of wall and ceiling) were captured using one of the calibrated sensors, the average distance between the sensor and the observed planes is 2.00 m (with minimum of 1.20 m and maximum of 3.00 m). The data were processed using the default parameters provided by the manufacturer and processed again after applying our distortion model. Figure 9 shows the point cloud resulted from the RGB-D sensor before and after applying our distortion model. It is clearly seen that the warp in the wall was removed after calibration also the distortion on the depth image corners was significantly compensated. Comparison between the computed angle before and after applying our distortion model was shown in Table 6. Using different threshold for RANSAC to extract planes, the recovered average angle using our method is 89.897 ± 0.37 compared to 90.812 ± 7.17 using the default depth.  Figure 9 shows the point cloud resulted from the RGB-D sensor before and after applying our distortion model. It is clearly seen that the warp in the wall was removed after calibration also the distortion on the depth image corners was significantly compensated. Comparison between the computed angle before and after applying our distortion model was shown in Table 6. Using different threshold for RANSAC to extract planes, the recovered average angle using our method is 89.897 ± 0.37 compared to 90.812 ± 7.17 using the default depth.  Comparing our results with those given in [9] our calibration method achieved nearly similar accuracy for the near range and a significant improvement in accuracy for the far range. However, our calibration method is simpler than the method given in [9] and the error model is based on a mathematical concept of the lens distortion effect.  Comparing our results with those given in [9] our calibration method achieved nearly similar accuracy for the near range and a significant improvement in accuracy for the far range. However, our calibration method is simpler than the method given in [9] and the error model is based on a mathematical concept of the lens distortion effect.

Conclusions and Future Work
In this study, we propose a new method for calibrating the RGB-D sensor. This method can be applied to either three fixed cameras such as Kinect or separate systems like the Structure Sensor. The method was fully automated for both steps, which included calibrating the external RGB-IR baseline and modeling the distortion and depth error for a depth sensor. Based on the structured light concept, we also proposed a new distortion error model to compensate for the systematic and distortion effects for the RGB-D sensor. Better accuracy could be achieved for both the near and far ranges, compared with traditional calibration methods that used ordinary stereo calibration to produce a distortion model for a depth sensor or applied an empirical distortion model. Traditional calibration procedures can be used to achieve a significant improvement in depth accuracy for the near range (up to 50% accuracy improvement), which was already achieved in this procedure. For a far range, the traditional methods cannot be used to achieve any significant improvement in depth precision compared with our method; we could achieve an accuracy improvement of around 40% for the far range. The RGB-D sensor can extend up to 9.0 m in full depth range. Future research should focus on improving the depth uncertainty result from a small baseline between IR sensors to achieve a better depth precision for a far range. Calibration parameter stability over time and different light conditions must also be examined.