Calibrate Multiple Consumer RGB-D Cameras for Low-Cost and Efficient 3D Indoor Mapping

Traditional indoor laser scanning trolley/backpacks with multi-laser scanner, panorama cameras, and an inertial measurement unit (IMU) installed are a popular solution to the 3D indoor mapping problem. However, the cost of those mapping suits is quite expensive, and can hardly be replicated by consumer electronic components. The consumer RGB-Depth (RGB-D) camera (e.g., Kinect V2) is a low-cost option for gathering 3D point clouds. However, because of the narrow field of view (FOV), its collection efficiency and data coverages are lower than that of laser scanners. Additionally, the limited FOV leads to an increase of the scanning workload, data processing burden, and risk of visual odometry (VO)/simultaneous localization and mapping (SLAM) failure. To find an efficient and low-cost way to collect 3D point clouds data with auxiliary information (i.e., color) for indoor mapping, in this paper we present a prototype indoor mapping solution that is built upon the calibration of multiple RGB-D sensors to construct an array with large FOV. Three time-of-flight (ToF)-based Kinect V2 RGB-D cameras are mounted on a rig with different view directions in order to form a large field of view. The three RGB-D data streams are synchronized and gathered by the OpenKinect driver. The intrinsic calibration that involves the geometry and depth calibration of single RGB-D cameras are solved by homography-based method and ray correction followed by range biases correction based on pixel-wise spline line functions, respectively. The extrinsic calibration is achieved through a coarse-to-fine scheme that solves the initial exterior orientation parameters (EoPs) from sparse control markers and further refines the initial value by an iterative closest point (ICP) variant minimizing the distance between the RGB-D point clouds and the referenced laser point clouds. The effectiveness and accuracy of the proposed prototype and calibration method are evaluated by comparing the point clouds derived from the prototype with ground truth data collected by a terrestrial laser scanner (TLS). The overall analysis of the results shows that the proposed method achieves the seamless integration of multiple point clouds from three Kinect V2 cameras collected at 30 frames per second, resulting in low-cost, efficient, and high-coverage 3D color point cloud collection for indoor mapping applications.


Introduction
Driven by the miniaturization and light weight of positioning and remote sensing sensors, as well as the need of fusing indoor and outdoor maps for next-generation navigation, 3D indoor mapping that captures the surface characteristics around the points [30]. Sufficient overlap between depth maps/point clouds should be maintained in those approaches to guarantee the convergence of the ICP algorithm. Matterport3D marketed an RGB-D scanner consisting of three RGB-D sensors pointing in three different directions for indoor modeling and virtual reality applications [31]. The sensor calibration is the basis of and the key to constructing an RGB-D sensor array. This paper thus primarily focuses on the calibration of multiple RGB-D sensors.
The intrinsic calibration of single RGB-D sensors solves the camera geometry and the depth correction. Most of the consumer RGB-D sensors (namely the Kinect and Intel RealSense) are originally designed for human-machine interaction in applications such as gaming, where metric measurements are not the focus. Thus, consumer RGB-D sensors need to be calibrated individually to refine the manufacturer's batch calibration parameters [10]. RGB-D sensors are a new kind of sensor. Fewer studies have been done on their intrinsic calibration [32][33][34], compared to the mature camera calibration research. The main reason is that the measurements derived from SL/ToF based RGB-D cameras are distorted by several phenomena. Considering especially the ToF based KinectV2 used in this paper, the analysis of related error sources is reported in existing surveys [35,36]. The IR and color optical sensors inside the Kinect V2 could be modeled using the pinhole camera model and solved by well-established RGB camera calibration methods [9,34,37]. The calibration of the relative pose parameters between the IR and RGB cameras is similar to stereo-camera calibration that could be solved by measuring geometrical targets in the shared FOV (e.g., 2D planar pattern [38,39], circle grid [40], 1D target [41,42]), appearance-based methods [43], or self-calibration used in robot navigation [44][45][46]. For depth calibration, additional depth error models based on 3D look-up tables [47,48] interpolating the related deviations, or on curve approximation with single/multiple B-splines [49,50] or polynomials [49] to model the distance deviations have been designed to enhance the quality of depth data.
The extrinsic calibration of multiple RGB-D sensors calculates the relative rotation (boresight) and translation (lever-arm) between individual sensors. In the context of the extrinsic calibration of multiple RGB-D sensors, existing methods can be classified into three categories: (1) moving calibration targets-based; (2) shared view-based; and (3) camera tracking-based methods. The method based on moving calibration targets solves the exterior orientation parameters (EoPs) of each sensor in the array by moving calibration targets with known trajectory across the view of individual sensors. The trajectory of the calibration target is provided by external optical tracking systems. Beck and Froehlich [51] employ a tracked checkerboard to establish correspondences in RGB-D camera space and the world coordinate defined by the tracking system to register multiple RGB-D sensors into a joint coordinate system. Avetisyan et al. [52] calibrated an array of multiple stationary color and depth cameras using an optical tracking system tracking a checkerboard target with four additional markers attached. The use of an optical tracking system in those methods limits their application scope to where a tracking device is already available (e.g., 3D telepresence systems). The shared view-based method solves the extrinsic camera calibration by the detection and matching of control points in the overlapping regions of different cameras. OmniKinect [53] obtains the extrinsic parameters using a calibration target with Kinect-visible markers that are simultaneously detected in the view of the mounted Kinects. Fernández-Moral et al. [54] use an overlap of features in the surrounding planar environment (e.g., walls, ceiling) for the extrinsic calibration of the range camera. However, these methods have the disadvantage that the overlap field of view constraint must be fulfilled. Besides, it is more complicated to match features in depth images than in RGB images. The multiple camera calibration approaches in this category are inappropriate for camera arrays designed with barely or not overlapping field of view like the one proposed in this paper. Camera tracking-based methods do not require an optical tracking system or a sufficient overlap in field of view. They rely on the portability of the cameras and track the ego-motion of the cameras independently. SLAM or VO techniques are applied to determine the extrinsic parameters of multiple generic cameras [55,56]. The final calibration results are correlated to the robustness and accuracy of the used SLAM/VO algorithm, which depends highly on the environment. For indoor scenes filled with texture-less regions or repetitive geometry, this kind of approach is not the optimal choice.

Methodology
The proposed RGB-D camera array aims to collect 3D point cloud data with auxiliary information (i.e., color) for indoor mapping in an efficient and low-cost manner. The RGB-D camera array is built upon multiple Kinect V2 cameras, and the calibration method solves intrinsic and extrinsic parameters of each Kinect V2 in the array. Two key components are encompassed in the proposed calibration process: (1) Calibrating the RGB/IR camera geometry of each RGB-D camera by a homography-based method using checkerboard, correcting the infrared ray to obtain the calibrated range-to-3D transformation, and modeling range biases with pixel-wise spline line functions to correct the wiggling error along the range measurement by capturing a planar scene (e.g., a wall) from different depths; (2) Solving the initial EoPs of each RGB-D camera from sparse control markers and further refining the initial value by an ICP variant minimizing the distance between the RGB-D point clouds and the referenced laser point clouds. Figure 1 illustrates the RGB-D camera array setup and the framework of the proposed calibration method. The key components of the sensor array and calibration method are described below.
Remote Sens. 2018, 10, x FOR PEER REVIEW 5 of 29 array is built upon multiple Kinect V2 cameras, and the calibration method solves intrinsic and extrinsic parameters of each Kinect V2 in the array. Two key components are encompassed in the proposed calibration process: (1) Calibrating the RGB/IR camera geometry of each RGB-D camera by a homography-based method using checkerboard, correcting the infrared ray to obtain the calibrated range-to-3D transformation, and modeling range biases with pixel-wise spline line functions to correct the wiggling error along the range measurement by capturing a planar scene (e.g., a wall) from different depths; (2) Solving the initial EoPs of each RGB-D camera from sparse control markers and further refining the initial value by an ICP variant minimizing the distance between the RGB-D point clouds and the referenced laser point clouds. Figure 1 illustrates the RGB-D camera array setup and the framework of the proposed calibration method. The key components of the sensor array and calibration method are described below.

RGB-D Camera Array System Setup
There are three types of RGB-D camera, including stereo camera, structured light (SL) camera, and time-of-flight (ToF) camera, distinguished by the measurement principle. The stereo camera and structured light camera use parallax theory to calculate depth, while the ToF camera is based on a beam distance measurement principle, calculating the distance from the travel time of a modulated beam between sensor and object. Among the mainstream ToF camera models, including Microsoft Kinect V2 [57], CubeEye [58], SwissRanger SR4000 [58], and PMD CamCube [59], the Kinect V2 is selected as the sensor array component in this paper for its wider FOV, higher resolution, longer ranging [35], and easier accessibility.
In contrast to the open spaces of outdoor environments, indoor data collections are often impeded by confined and small spaces; therefore, both horizontal and vertical rotation and translation of the camera is required to achieve full data coverage, which leads to a higher risk of feature tracking failure in VO/SLAM algorithms and more data collection/processing workload. Typical indoor environments consist of orthogonal walls and ceilings which follow the Manhattan World Assumption [60]. According to the "Manhattan World" characteristics of the indoor environment, we proposed to mount the sensors vertically with different pitching angles that orient the sensors forward, obliquely upward and downward, as shown in the Figure 1. The pitch angles of the three Kinect V2 sensors were designed to be −50 • , 0 • and 50 • . The roll angles of the sensors were design to be kept consistent at 0 • . There was 10 • overlap in the FOV of the adjacent sensors. The RGB-D array could theoretically provide 160 • × 70 • FOV. According to this design, there will be 89% vertical coverage of the scene in front of the array, which is considered adequate for most indoor environments. The coverage is 2.7 times that using a single RGB-D camera (33% coverage).
All sensors were locked on a tripod bar using a ball head, and connected to a mobile work station through a USB3.0 interface. Every sensor was allocated 5 Gbps bandwidth with a USB3.0 expansion card plugged in separate PCI-E slots; the connection is shown in Figure 2. Since the official software development kit (SDK) for Kinect V2 does not support more than one sensor, the open source drive OpenKinect [21] was used to power the three Kinect v2s to collect RGB-D data streams simultaneously. Although the Kinect V2 can obtain depth values for distances between 4.5 m to 9 m using the OpenKinect driver, those measurements (range > 4.5 m) are considered as low-quality data and are filtered out in the experiments using the threshold of 4.5 m suggested by the official driver. The intrinsic parameters of the Kinect V2 were provided by the manufacturer. However, according to published studies on the Kinect V2 [61], it is recommended to recalibrate those intrinsic parameters to guarantee the data accuracy. Additionally, sensor installation of the array involves assembly errors. Thus, intrinsic calibration of each individual Kinect and extrinsic calibration of the array is essential before use.

RGB-D Camera Array System Setup
There are three types of RGB-D camera, including stereo camera, structured light (SL) camera, and time-of-flight (ToF) camera, distinguished by the measurement principle. The stereo camera and structured light camera use parallax theory to calculate depth, while the ToF camera is based on a beam distance measurement principle, calculating the distance from the travel time of a modulated beam between sensor and object. Among the mainstream ToF camera models, including Microsoft Kinect V2 [57], CubeEye [58], SwissRanger SR4000 [58], and PMD CamCube [59], the Kinect V2 is selected as the sensor array component in this paper for its wider FOV, higher resolution, longer ranging [35], and easier accessibility.
In contrast to the open spaces of outdoor environments, indoor data collections are often impeded by confined and small spaces; therefore, both horizontal and vertical rotation and translation of the camera is required to achieve full data coverage, which leads to a higher risk of feature tracking failure in VO/SLAM algorithms and more data collection/processing workload. Typical indoor environments consist of orthogonal walls and ceilings which follow the Manhattan World Assumption [60]. According to the "Manhattan World" characteristics of the indoor environment, we proposed to mount the sensors vertically with different pitching angles that orient the sensors forward, obliquely upward and downward, as shown in the Figure 1. The pitch angles of the three Kinect V2 sensors were designed to be −50°, 0° and 50°. The roll angles of the sensors were design to be kept consistent at 0°. There was 10° overlap in the FOV of the adjacent sensors. The RGB-D array could theoretically provide 160° × 70° FOV. According to this design, there will be 89% vertical coverage of the scene in front of the array, which is considered adequate for most indoor environments. The coverage is 2.7 times that using a single RGB-D camera (33% coverage).
All sensors were locked on a tripod bar using a ball head, and connected to a mobile work station through a USB3.0 interface. Every sensor was allocated 5 Gbps bandwidth with a USB3.0 expansion card plugged in separate PCI-E slots; the connection is shown in Figure 2. Since the official software development kit (SDK) for Kinect V2 does not support more than one sensor, the open source drive OpenKinect [21] was used to power the three Kinect v2s to collect RGB-D data streams simultaneously. Although the Kinect V2 can obtain depth values for distances between 4.5 m to 9 m using the OpenKinect driver, those measurements (range > 4.5 m) are considered as low-quality data and are filtered out in the experiments using the threshold of 4.5 m suggested by the official driver. The intrinsic parameters of the Kinect V2 were provided by the manufacturer. However, according to published studies on the Kinect V2 [61], it is recommended to recalibrate those intrinsic parameters to guarantee the data accuracy. Additionally, sensor installation of the array involves assembly errors. Thus, intrinsic calibration of each individual Kinect and extrinsic calibration of the array is essential before use.

Intrinsic Calibration of Single RGB-D Sensor
The optical system of the Kinect V2 is composed of a color camera with a resolution of 1920 × 1080 pixels, an infrared camera with resolution of a 512 × 424 pixels, and an infrared (IR) illuminator, as shown in Figure 3. The specifications of the Kinect V2 are listed in Table 1.

Intrinsic Calibration of Single RGB-D Sensor
The optical system of the Kinect V2 is composed of a color camera with a resolution of 1920 × 1080 pixels, an infrared camera with resolution of a 512 × 424 pixels, and an infrared (IR) illuminator, as shown in Figure 3. The specifications of the Kinect V2 are listed in Table 1.  The Kinect V2 utilizes continuous wave (CW) ToF technology to measure the range. This is one of the two main ToF technologies on the market, while the other is pulsed-based ToF technology. The main idea is to actively illuminate the scene using near-infrared (NIR) intensity-modulated periodic light emitted from the IR illuminator. The emitted light is reflected by the scene, and phase detection is used to measure the time it takes light to travel from the IR illuminator to the object and back to the infrared camera, and distance is calculated from the results. For the detailed depth measurement principles of the Kinect v2, see [63]. There are three data streams output from the Kinect V2; namely, the infrared images from IR camera, the depth maps generated from the phase detection utilizing the IR camera, and the color images from the RGB camera. Two lenses-the IR camera lens and the RGB camera lens-are involved in the imaging and range sensing system; thus, individual geometry calibration of the IR and RGB camera is required. To achieve precise overlay of the depth map derived from the IR camera and the RGB image, the relative pose of the IR camera and RGB camera needs to be solved. Despite the above-mentioned camera geometry-related parameters, the depth measurements generated by ToF sensors suffer from random noise and systematic bias [50]. Thus, the single RGB-D sensor calibration is done in two steps: (1) geometry calibration, and (2) depth calibration.

Geometry Calibration
Separate and conjunct calibration of the RGB camera and IR camera is required to determine the camera intrinsic parameter of each camera as well as the relative pose of the two cameras. Precise intrinsic parameters can correct image distortion, and are important to enhance the accuracy of depth and color image fusion [64]. Checkerboard was used as the calibration pattern. The corners of the checkerboard were detected in both RGB images and IR images, and the intrinsic parameters of the cameras were obtained using a homography-based calibration method [65]. The relative pose was calculated by minimizing the pixel projection errors of the corners detected on RGB/IR images.  The Kinect V2 utilizes continuous wave (CW) ToF technology to measure the range. This is one of the two main ToF technologies on the market, while the other is pulsed-based ToF technology. The main idea is to actively illuminate the scene using near-infrared (NIR) intensity-modulated periodic light emitted from the IR illuminator. The emitted light is reflected by the scene, and phase detection is used to measure the time it takes light to travel from the IR illuminator to the object and back to the infrared camera, and distance is calculated from the results. For the detailed depth measurement principles of the Kinect v2, see [63]. There are three data streams output from the Kinect V2; namely, the infrared images from IR camera, the depth maps generated from the phase detection utilizing the IR camera, and the color images from the RGB camera. Two lenses-the IR camera lens and the RGB camera lens-are involved in the imaging and range sensing system; thus, individual geometry calibration of the IR and RGB camera is required. To achieve precise overlay of the depth map derived from the IR camera and the RGB image, the relative pose of the IR camera and RGB camera needs to be solved. Despite the above-mentioned camera geometry-related parameters, the depth measurements generated by ToF sensors suffer from random noise and systematic bias [50]. Thus, the single RGB-D sensor calibration is done in two steps: (1) geometry calibration, and (2) depth calibration.

Geometry Calibration
Separate and conjunct calibration of the RGB camera and IR camera is required to determine the camera intrinsic parameter of each camera as well as the relative pose of the two cameras. Precise intrinsic parameters can correct image distortion, and are important to enhance the accuracy of depth and color image fusion [64]. Checkerboard was used as the calibration pattern. The corners of the checkerboard were detected in both RGB images and IR images, and the intrinsic parameters of the cameras were obtained using a homography-based calibration method [65]. The relative pose was calculated by minimizing the pixel projection errors of the corners detected on RGB/IR images.
We used the pinhole camera model and the Brown's distortion model [66] to describe the intrinsic parameters of the camera. The normalized coordinate of a 3D point on the RGB/IR image plane is denoted as x n = [x, y, 1] T . The Brown distortion is modeled as: where r 2 = x 2 + y 2 , (x u , y u ) is the corrected coordinate,(k 1 , k 2 , k 3 ) and (p 1 , p 2 ) are vectors containing the radial and tangential distortion coefficients, respectively. The pixel coordinate p = [u, v] T is calculated as:  where K is the intrinsic matrix that includes the focal length f = f x , f y and principal point Utilizing the above-mentioned camera and distortion model, the intrinsic parameters are solved by the homography based calibration method [65] with the checkerboard pattern. To precisely overlay the color texture onto the depth image, the relative pose parameters between the RGB and IR camera need to be calculated accurately. The transformation between the RGB and IR image coordinate systems can be written as: where C ir and C c are the image coordinate system of the RGB and IR cameras, respectively, R c ir ,T c ir are the rotation and translation parameters of the RGB camera with respect to the IR camera. For a 3D point X in the world coordinate, let its projection onto the ith IR image be p ir i and that onto the ith RGB image be p c i . The pinhole imaging process can be formulated as: where K ir and K c are, respectively, the intrinsic camera matrix of the RGB/IR camera as defined in Equation (3). The radial distortion and tangential distortion are modeled as in Equations (1)- (2). In light of Equations (4)-(6), the rotation matrix R ir2c and the translation matrix T ir2c calculated according to the ith image can be written as: where R ir ,T ir ,[R c ,T c ] are assigned as the extrinsic parameters of the ith image. Thus, for each RGB and IR image pair with conjunct checkerboard corner observation, R ir2c ,T ir2c can be calculated.
To obtain the optimized R ir2c ,T ir2c utilizing the whole calibration image set rather than using a single image, the pixel projection errors minimization process is illustrated as in Equation (8): where p i ir and p i c are the detected checkerboard corner in the ith IR and RGB image, respectively. The p i c is calculated as in Equation (9): Remote Sens. 2018, 10, 328 9 of 28 The w ir , w c are the weights of the projection errors of the IR images and RGB images, which are determined by the resolution ratio of the IR and RGB cameras. The RGB image size is 1920 × 1080 pixels and the focal length of the camera ( f c ) is about 1047 pixels. The IR image size is 512 × 424 pixels, and the focal length of the camera ( f ir ) is about 360 pixels. The layout of the IR and RGB camera inside the Kinect V2 is shown in Figure 4. According to the similar triangle principle, the ratio of the Res ir2c and Res ir can be formulated as: Res ir is set to be the baseline; that is, 1. f c and f ir are set as 1047 pixels and 360 pixels, respectively. Res ir2c , which means the number of pixels that one IR pixel projected on the color image, can be calculated as Res ir2c = Res ir f c / f ir = 2.91. The result means that there are 2.91 color pixels corresponding to 1 pixel on the IR image. Thus,w c is set as 1 and w ir is set as 2.91, which means 2.96 pixel error in a color image is considered as the same as 1-pixel error on an IR image. Levenberg-Marquardt optimization is used to minimize Equation (8) to find the best relative placement parameters R ir2c ,T ir2c between the IR and RGB cameras.
The ir w , c w are the weights of the projection errors of the IR images and RGB images, which are determined by the resolution ratio of the IR and RGB cameras. The RGB image size is 1920 × 1080 pixels and the focal length of the camera ( c f ) is about 1047 pixels. The IR image size is 512 × 424 pixels, and the focal length of the camera ( ir f ) is about 360 pixels. The layout of the IR and RGB camera inside the Kinect V2 is shown in Figure 4. According to the similar triangle principle, the ratio of the

Depth Calibration
The ToF-based Kinect V2 is an active imaging system that uses optics to focus the emitted IR lights onto the chip to generate the range measurements. In spite of the typical optical error sources (e.g., shifted optical center and lens distortion) that can be corrected by the abovementioned geometry calibration, there are several other error sources that will affect the accuracy of its 3D measurements. A detailed review is given by [36]. The listed error sources of the ToF camera (e.g., Kinect V2) include temperature drift, multi-path effects, systematic distance error, intensity-related distance error, etc. Major parts of the error sources are environment-related (e.g., intensity-related distance error), and thus cannot be solved through calibration processes. Existing research on depth calibration of the ToF RGB-D camera mainly deals with the systematic range bias, also known as the wiggling error [7,9,49,50]. The depth calibration proposed in this paper is a two-step process: (1) Infrared ray correction solves the correct range measurements to 3D point clouds transformation according to the geometry calibration results to remove the optical-related errors affecting the final point clouds; (2) Range biases are modeled with pixel-wise spline line functions to correct the error along the range measurement.

Depth Calibration
The ToF-based Kinect V2 is an active imaging system that uses optics to focus the emitted IR lights onto the chip to generate the range measurements. In spite of the typical optical error sources (e.g., shifted optical center and lens distortion) that can be corrected by the abovementioned geometry calibration, there are several other error sources that will affect the accuracy of its 3D measurements. A detailed review is given by [36]. The listed error sources of the ToF camera (e.g., Kinect V2) include temperature drift, multi-path effects, systematic distance error, intensity-related distance error, etc. Major parts of the error sources are environment-related (e.g., intensity-related distance error), and thus cannot be solved through calibration processes. Existing research on depth calibration of the ToF RGB-D camera mainly deals with the systematic range bias, also known as the wiggling error [7,9,49,50]. The depth calibration proposed in this paper is a two-step process: (1) Infrared ray correction solves the correct range measurements to 3D point clouds transformation according to the geometry calibration results to remove the optical-related errors affecting the final point clouds; (2) Range biases are modeled with pixel-wise spline line functions to correct the error along the range measurement.
(1) Infrared ray correction A depth map which stores the range measurements in polar coordinate system is the raw output of a ToF RGB-D camera. The 3D point clouds that are required in applications such as indoor mapping Remote Sens. 2018, 10, 328 10 of 28 are actually transformed values represented in 3D Cartesian coordinate system. The transformation applies the perspective projection relationships between the object points in the world coordinate and corresponding imaging pixels on the image plane. Thus, it is related to the camera's intrinsic parameters. This transformation of the Kinect V2 is predefined by the manufacturer in a form of mapping function provided by the official/open source SDK, and is in accord with the factory camera intrinsic calibration results. However, because of the mechanical differences in the manufacturing process, the predefined transformation/camera intrinsic calibration is inaccurate to individual sensors. Therefore, we recalculate the transformation according to the intrinsic parameters estimated by the abovementioned geometry calibration so that its infrared ray direction coincides with the calibrated camera model before further range errors correction.
For a pixel [u, v] T , the ray direction is defined by its normalized undistorted coordinate [x u , y u ] T in the image coordinate system that is calculated by the intrinsic parameters (k 1 , The range measurement is denoted as R. The transformation between the ranges and the 3D point clouds can be written as Equation (11), following the perspective projection principles: where R bias is the range bias and s = 1/ x 2 u + y 2 u + 1 which is a normalization factor. R bias is a small value with respect to the measured range R, and is thus considered as zero in the transformation calculation.
(2) Range bias correction The main systematic errors = of a ToF sensor (including Kinect V2) is the systematic wiggling error that moves the range measurements toward or away from the sensor [49]. The Kinect V2 suffers from this kind of error in its range measurements because it calculates the distances from the phase difference of the returned signal. The reference signal is based on the theoretical assumption of a sinusoidal signal shape. Due to the limited electronics, the shapes of the modulated signals are not exactly sinusoidal, and in reality contain odd harmonic components. As a result, range bias appears as shown in Figure 5 (left column). The systematic wiggling range bias R bias has been proven to be related to the object distance R and spatial distribution of the pixel location (x, y) deriving the range measurement on the image plane [9,50], and thus the range bias can be formulated as: Existing studies on correcting the systematic wiggling error of the ToF sensors R bias mainly focus on modeling the range bias of all the measurements using a single global B-spline function [67] or clustering the measurements into groups in order to model the biases using multiple B-spline functions [50] on different clusters. ToF cameras including the Kinect V2 are working under active IR illumination. Because of the IR light cone, the illumination of the scene is not homogeneous, so the range bias at each pixel is different. In this context, the range biases are considered per-pixel unique and independent. Thus, we proposed to fit the biases with pixel-wise B-spline functions through a training scheme. The range bias of all the pixels on the image plane at different ranges can be written as a range bias matrix R bias Matrixs: bias of pixel location (m, n) at different ranges (R) is fitted with a B-splines function Bspline mn (·), as written by Equation (14): Thus, in light of Equations (12)- (14), taking R as the independent variables, R bias Matrixs can be rewritten as a matrix of continuous B-splines functions (BSM(·)) as in Equation (15).
To solve the w · h B-splines functions in Equation (15), range errors were measured at several discrete distances (R) in a controlled environment to minimize other error sources, such as temperature drift (warm-up time: 30 min). To estimate the range biases, a wall was chosen as the planar observation target, as proposed by most calibration studies [9,50]. The wall was observed by the Kinect V2 at different distances from 0.818 m to 2.674 m, with regular step equal to 0.116 m. At each distance, 100 data frames were captured and averaged to minimize the influence of the random noise on the range measurement to obtain a reliable representative frame of the wall at the observed distance. The acquired range measurements were modeled as a plane through a RANSAC approach [68]. The distance between the range measurement and the fitted plane along the ray was taken as the range errors/range biases. The observed range errors were thus taken as the range biases and then used to fit B-splines functions at each pixel location. Figure 5 shows the range bias correction at different pixel locations (along the diagonal of the depth map) utilizing pixel-wise B-spline functions. The range biases before the correction ( Figure 5, left row) were not uniform and randomly distributed around the statistically estimated distance, showing a wide range of variance. While, after correction ( Figure 5 right row), the range biases were corrected by pixel-wise B-splines functions so that the distributions of the biases were zero-mean Gaussian-like. However, the range bias correction process does not change the case that corner pixels are less accurate than the center ones ( Figure 5 right row).
After the range bias correction, the wiggling range offsets were removed and the error distribution represented characteristics of a zero-mean Gaussian. The correction enhanced the data quality derived from the Kinect V2 ToF sensor and thus develops its potential of being better adopted in applications such as close-range archaeology modeling [9] and indoor SLAM [10], where high depth accuracy is essential. Once the individual geometry and depth calibration of the Kinect V2s were completed, the exterior orientation parameters (EoPs) of the Kinect V2s in the RGB-D sensor array were calculated in a coarse-to-fine approach.

R Matrixs
can be rewritten as a matrix of continuous B-splines functions ( ( ) BSM ⋅ ) as in Equation (15). 11 1 To solve the w h ⋅ B-splines functions in Equation (15), range errors were measured at several discrete distances ( R ) in a controlled environment to minimize other error sources, such as temperature drift (warm-up time: 30 min). To estimate the range biases, a wall was chosen as the planar observation target, as proposed by most calibration studies [9,50]. The wall was observed by the Kinect V2 at different distances from 0.818 m to 2.674 m, with regular step equal to 0.116 m. At each distance, 100 data frames were captured and averaged to minimize the influence of the random noise on the range measurement to obtain a reliable representative frame of the wall at the observed distance. The acquired range measurements were modeled as a plane through a RANSAC approach [68]. The distance between the range measurement and the fitted plane along the ray was taken as the range errors/range biases. The observed range errors were thus taken as the range biases and then used to fit B-splines functions at each pixel location. Figure 5 shows the range bias correction at different pixel locations (along the diagonal of the depth map) utilizing pixel-wise B-spline functions. The range biases before the correction ( Figure 5, left row) were not uniform and randomly distributed around the statistically estimated distance, showing a wide range of variance. While, after correction ( Figure 5 right row), the range biases were corrected by pixel-wise B-splines functions so that the distributions of the biases were zero-mean Gaussian-like. However, the range bias correction process does not change the case that corner pixels are less accurate than the center ones ( Figure 5 right row).

Extrinsic Calibration of RGB-D Sensor Array
The point clouds from individual Kinect V2s are in their own 3D Cartesian coordinate system defined by Equation (11). In order to fuse the three individual data streams from the Kinect V2s, the EoPs of the Kinect V2s with respect to each other need to be solved. The three Kinect V2s of the proposed sensor array were mounted on a rig (shown by Figure 1) through fixing screws on the bases of the sensor. Because of the mechanical error and the flexible mounting base of the Kinect, the relative pose of the sensor cannot be accurately measured beforehand. Thus, we propose a coarse-to-fine extrinsic calibration method to solve the relative poses of the Kinects in the RGB-D camera array. The proposed method is twofold: (1) a calibration field equipped with sparse high-reflectance markers is set up and measured using a high-precision terrestrial laser scanner (TLS), and the coarse EoPs of the Kinects are calculated through control marker pairs in RGB-D and laser point clouds; (2) the coarse EoPs are then refined by a variant of ICP which minimizes the distances between the points clouds from each of the Kinects and their corresponding nearest points in the TLS point clouds to find the optimal transformation parameters. The Cartesian coordinate system of the RGB-D point clouds derived from the horizontally placed RGB-D camera is taken as the reference of the other two Kinects. The coordinate transformation between the upward (C a ), forward (C b ), and downward (C b ) camera in normalized coordinates can be described as in Equation (16). The C b which originated at the principle point of the IR camera of the forward Kinect is taken as the reference coordinate frame of the array: where RT 1 /RT 2 is a 4 × 4 transformation matrix that rotates and translates the coordinate system of the upward and downward RGB-D cameras to the forward RGB-D camera reference frame. At least three pairs of conjugate points are needed to estimate a single coordinate transformation formulated by Equation (16). A typical indoor scene was chosen as the calibration field to make the process practical and easy to set up. Control point patterns made from rough rubber and smooth high-reflective plastic that guarantee high significance and distinctness in both point clouds and images were placed in the scene. The geometry of the calibration field was collected using a TLS Riegl VZ-400 in the form of dense point clouds and taken as the ground truth. Let C t be the reference frame of the TLS point clouds. The transformation of the coordinate system of the three RGB-D cameras to C t is formulated as Equation (17): For point X b in C b , its conjugate point X t in C t is calculated as Equation (18).
To solve the transformation between the C b and C t (RT tb ) in Equation (18), conjugate control point patterns in laser point cloud and RGB-D point clouds are chosen, and singular value decomposition (SVD) [69] is applied to calculate transformation parameters using the picked control point pairs. The transformation between the C a , C t (RT ta ) and C c , C t (RT tc ) is solved in the same manner.
The precision of the registration is not enough with only the manually measured reference target [36,70]. The density of the RGB-D point clouds are low compared to the TLS point clouds. The precision of the manually selected control point is limited, especially in RGB-D point clouds. The number of markers used as the control point targets are confined by the space of the calibration field, and thus are sparsely distributed. Moreover, affected by the multi-path interference effects [71] and intensity-related depth error, the center points of the control point patterns are deformed in RGB-D point clouds derived from the ToF camera. In all, it is difficult to achieve high-precision extrinsic calibration through a small number of handpicked control targets. However, the calculated results (RT c ta , RT c tb , RT c tc ) can be taken as the initial transformation to roughly align the RGB-D point clouds and the TLS point clouds for further refinement.
In this case, redundant measurements are required to reduce the coarse calibration errors. In this study, to refine the coarse transformation, a variant of standard ICP is proposed. It minimizes the distances between the RGB-D point clouds and their corresponding nearest points in the TLS laser point clouds, considering the raw point clouds (P RGB−D , P TLS ) as well as the manually picked control points (CP RGB−D , CP TLS ), and thus finds the optimal transformation parameters. The refinement is written as Equation (19): where η i , η j is the normal vector at P i TLS , CP  (19)) which is a modification of the originally ICP error function is minimized with adjustable weights of the raw point clouds and the control points to acquire the optimal transformation. The weight λ is empirically set to 0.5, which means that the control targets and the raw point clouds contribute equally in the minimization process. The abovementioned two steps are iteratively repeated until convergence, resulting in the desired transformation.
The ICP variant refinement is applied for each of the RGB-D cameras in the array, thus the refined transformations (RT f ta , RT f tb , RT f tc ) are solved. The desired transformation RT 1 , RT 2 can be written as Equation (20): The relative EoPs of the three Kinect V2s in the array are finally calibrated into a common reference frame. The individual three RGB-D data streams are co-registered for efficient data acquisition.

Intrinsic Calibration Results of Single RGB-D Sensor
A 5 × 7 checkerboard pattern with voxel size of 0.03 m was used in the optical-related geometry calibration of the RGB-D camera ( Figure 6) following the proposed method. In total, 1041 pairs of RGB and IR images were collected for the calibration of the three Kinect v2s, as listed in Table 2. The sync images stand for simultaneously collected color and infrared image pairs, and are used for the conjunct calibration of the RGB and IR camera that solves the relative pose between them. Because of their different FOVs, it is difficult to maintain good checkerboard coverage in both the IR and RGB images using the sync images. Thus, separate datasets are used for the IR/RGB camera intrinsic calibration and the estimation of the relative pose parameters between them. The calibrated intrinsic parameters of the three pairs of IR/RGB cameras are shown in Table 3. The pixel projection errors of the detected checkerboard corners are used to evaluate the accuracy of the calibration results. It is also taken as the criteria to determine whether the corresponding images should be included in the next iteration of the calibration process, in order to identify the outlier caused by image blur. The projection error threshold is set to 1 pixel to detect the outliers. The accuracy of the optical-related geometry calibration is shown in Table 4. parameters of the three pairs of IR/RGB cameras are shown in Table 3. The pixel projection errors of the detected checkerboard corners are used to evaluate the accuracy of the calibration results. It is also taken as the criteria to determine whether the corresponding images should be included in the next iteration of the calibration process, in order to identify the outlier caused by image blur. The projection error threshold is set to 1 pixel to detect the outliers. The accuracy of the optical-related geometry calibration is shown in Table 4.        To reduce the range measurement noise inherent to the sensor, an amount of successive data frames from the Kinect were averaged to deliver the depth map used in the following calibration process. Figure 7a shows the original range errors distributions rendered in color of a random selected depth map frame at 1.232 m. The depth map frame generated by averaging 100 successive frames is depicted in Figure 7b. The salt-and-pepper-like noise on the single frame were reduced after the averaging procedure. Figure 7c illustrates the observed standard deviations along the horizontal center line of the depth map by averaging 20 successive frames (blue) and 100 successive frames (red). It is learned from the statistics that increasing the frame amount makes the range errors uniform rather than enhancing the precision. Similar conclusions can be found in existing studies [73,74]. Considering both the data quality and data gathering efficiency, the amount of averaging successive depth map frames was empirically set to 100 frames (3.3 s observation) in the calibration process of this study. center line of the depth map by averaging 20 successive frames (blue) and 100 successive frames (red). It is learned from the statistics that increasing the frame amount makes the range errors uniform rather than enhancing the precision. Similar conclusions can be found in existing studies [73,74]. Considering both the data quality and data gathering efficiency, the amount of averaging successive depth map frames was empirically set to 100 frames (3.3 s observation) in the calibration process of this study.
(a) Range errors in single frame.
(b) Range errors after averaging 100 successive frames.
(c) Color-coded standard deviation calculated using different numbers of successive data frames. After the geometry calibration of the individual Kinect V2, the depth calibration process proposed in Section 3.2 was applied to the three Kinect V2s to correct the range measurements deriving from the sensor. Figure 8 illustrates the range biases distribution in different view directions before and after the depth calibration process of randomly selected averaged data frames from one of the three Kinect V2s. The observation distance was 1.208 m. Range biases greater than 15 mm were taken as outliers and were not included in the statistical analysis. The statistics of the range biases before and after the depth calibration are listed in Table 5. As it can be learned from the calibration After the geometry calibration of the individual Kinect V2, the depth calibration process proposed in Section 3.2 was applied to the three Kinect V2s to correct the range measurements deriving from the sensor. Figure 8 illustrates the range biases distribution in different view directions before and after the depth calibration process of randomly selected averaged data frames from one of the three Kinect V2s. The observation distance was 1.208 m. Range biases greater than 15 mm were taken as outliers and were not included in the statistical analysis. The statistics of the range biases before and after the depth calibration are listed in Table 5. As it can be learned from the calibration results: (1) the range biases were reduced after the calibration; the percentage of low-bias points was much higher than that before calibration. (2) The originally randomly distributed range biases were corrected so that their distributions were more uniform. (3) Seldom outliers appeared on the lower-right section of the depth map. These could be caused by a defect in the infrared image sensor or miscalculations derived from the OpenKinect drive. results: (1) the range biases were reduced after the calibration; the percentage of low-bias points was much higher than that before calibration. (2) The originally randomly distributed range biases were corrected so that their distributions were more uniform. (3) Seldom outliers appeared on the lowerright section of the depth map. These could be caused by a defect in the infrared image sensor or miscalculations derived from the OpenKinect drive.
(a) Absolute range biases distribution in the XY plane.

Extrinsic Calibration Results of the RGB-D Camera Array
After the intrinsic calibration of each sensor in the array, the EoPs of each sensor were solved following the coarse-to-fine calibration method proposed in Section 3.3. Figure 9 illustrates the RGB-D camera array extrinsic calibration process. The color point clouds from the individual Kinects are in their own sensor coordinate reference frame. To fuse the three data streams (Figure 9a), extrinsic calibration of the sensors in the array was applied to generate the final data fusion results (Figure 9c).

Extrinsic Calibration Results of the RGB-D Camera Array
After the intrinsic calibration of each sensor in the array, the EoPs of each sensor were solved following the coarse-to-fine calibration method proposed in Section 3.3. Figure 9 illustrates the RGB-D camera array extrinsic calibration process. The color point clouds from the individual Kinects are in their own sensor coordinate reference frame. To fuse the three data streams (Figure 9a), extrinsic calibration of the sensors in the array was applied to generate the final data fusion results (Figure 9c). The calibration field was built on a typical layout of an indoor scene that was a 2.5 × 2 × 3 m space consisting of two 90 degree intersected vertical walls and horizontal ceiling/floors (Figure 10a). Twenty high-contrast control targets were mounted on different surfaces facing various directions in the scene. The control targets were made of low-reflection black rough rubber and high-contrast smooth white plastic. Thus, the color contrast and the laser reflection intensity contrast were both taken into account for better significance in color RGB-D and laser point clouds. The snapshots of the control targets in the color images, TLS point clouds, and RGB-D point clouds are shown in Figure  10b The calibration field was built on a typical layout of an indoor scene that was a 2.5 × 2 × 3 m space consisting of two 90 degree intersected vertical walls and horizontal ceiling/floors (Figure 10a). Twenty high-contrast control targets were mounted on different surfaces facing various directions in the scene. The control targets were made of low-reflection black rough rubber and high-contrast smooth white plastic. Thus, the color contrast and the laser reflection intensity contrast were both taken into account for better significance in color RGB-D and laser point clouds. The snapshots of the control targets in the color images, TLS point clouds, and RGB-D point clouds are shown in Figure 10b-d.
Each sensor in the array can simultaneously observe more than four high-contrast control targets that are not coplanar. The reference point clouds (Figure 11a) were collected by a Riegl VZ-400 scanner with an angle resolution set to 0.02 degrees. The control targets were measured in both the RGB-D and the TLS point clouds to construct conjugate control target pairs to solve the initial transformation between the individual sensors in the array, according to the method described in Section 3.3. The coarse alignment between the reference laser point clouds and the RGB-D point clouds is illustrated in Figure 11b,e (left column). The details in the overlaying regions between the Kinects in the array after coarse calibration are illustrated in the first and third picture of Figure 11d. From visual inspection, the RGB-D point clouds are not blending in the laser point clouds; thus, showing the registration between the point clouds is not accurate. The misalignments are mainly caused by the imprecise control target selection, which cannot be overcome due to the limitation of the point clouds density, range, and reflectance resolution. Therefore, the coarse calibration of the array needs to be improved.
The coarse calibration parameters were used to convert the sensor data from the sensor coordinate system to the array coordinate system and taken as the initial alignments between the point clouds in the next ICP refinement. This initial calibration was further refined according to the ICP variant algorithm in Section 3.3. Figure 11c,e (right column) show the results of overlaying the RGB-D sensor array point clouds and the reference laser point clouds after refinement. The 2nd and 4th pictures of Figure 11d show that the misalignment in the overlaying regions between the Kinects were eliminated after the refinement. Figure 12 depicts the point clouds registration details before (middle row) and after refinement (bottom row). It can be learned from the registration results that the refining process eliminated the major registration errors existing in the coarse calibration step. The calibration field was built on a typical layout of an indoor scene that was a 2.5 × 2 × 3 m space consisting of two 90 degree intersected vertical walls and horizontal ceiling/floors (Figure 10a). Twenty high-contrast control targets were mounted on different surfaces facing various directions in the scene. The control targets were made of low-reflection black rough rubber and high-contrast smooth white plastic. Thus, the color contrast and the laser reflection intensity contrast were both taken into account for better significance in color RGB-D and laser point clouds. The snapshots of the control targets in the color images, TLS point clouds, and RGB-D point clouds are shown in Figure  10b-d. Each sensor in the array can simultaneously observe more than four high-contrast control targets that are not coplanar. The reference point clouds (Figure 11a) were collected by a Riegl VZ-400 scanner with an angle resolution set to 0.02 degrees. The control targets were measured in both the RGB-D and the TLS point clouds to construct conjugate control target pairs to solve the initial transformation between the individual sensors in the array, according to the method described in Section 3.3. The coarse alignment between the reference laser point clouds and the RGB-D point clouds is illustrated in Figure 11b,e (left column). The details in the overlaying regions between the Kinects in the array after coarse calibration are illustrated in the first and third picture of Figure 11d. From visual inspection, the RGB-D point clouds are not blending in the laser point clouds; thus, showing the registration between the point clouds is not accurate. The misalignments are mainly caused by the imprecise control target selection, which cannot be overcome due to the limitation of the point clouds density, range, and reflectance resolution. Therefore, the coarse calibration of the array needs to be improved. Each sensor in the array can simultaneously observe more than four high-contrast control targets that are not coplanar. The reference point clouds (Figure 11a) were collected by a Riegl VZ-400 scanner with an angle resolution set to 0.02 degrees. The control targets were measured in both the RGB-D and the TLS point clouds to construct conjugate control target pairs to solve the initial transformation between the individual sensors in the array, according to the method described in Section 3.3. The coarse alignment between the reference laser point clouds and the RGB-D point clouds is illustrated in Figure 11b,e (left column). The details in the overlaying regions between the Kinects in the array after coarse calibration are illustrated in the first and third picture of Figure 11d. From visual inspection, the RGB-D point clouds are not blending in the laser point clouds; thus, showing the registration between the point clouds is not accurate. The misalignments are mainly caused by the imprecise control target selection, which cannot be overcome due to the limitation of the point clouds density, range, and reflectance resolution. Therefore, the coarse calibration of the array needs to be improved. The coarse calibration parameters were used to convert the sensor data from the sensor coordinate system to the array coordinate system and taken as the initial alignments between the point clouds in the next ICP refinement. This initial calibration was further refined according to the ICP variant algorithm in Section 3.3. Figure 11c,e (right column) show the results of overlaying the RGB-D sensor array point clouds and the reference laser point clouds after refinement. The 2nd and 4th pictures of Figure 11d show that the misalignment in the overlaying regions between the Kinects were eliminated after the refinement. Figure 12 depicts the point clouds registration details before (middle row) and after refinement (bottom row). It can be learned from the registration results that the refining process eliminated the major registration errors existing in the coarse calibration step. The coarse calibration parameters were used to convert the sensor data from the sensor coordinate system to the array coordinate system and taken as the initial alignments between the point clouds in the next ICP refinement. This initial calibration was further refined according to the ICP variant algorithm in Section 3.3. Figure 11c,e (right column) show the results of overlaying the RGB-D sensor array point clouds and the reference laser point clouds after refinement. The 2nd and 4th pictures of Figure 11d show that the misalignment in the overlaying regions between the Kinects were eliminated after the refinement. Figure 12 depicts the point clouds registration details before (middle row) and after refinement (bottom row). It can be learned from the registration results that the refining process eliminated the major registration errors existing in the coarse calibration step. To quantify the accuracy of the calibration results of the proposed method, the difference in terms of point-to-point distance between the RGB-D array point clouds and the reference TLS point clouds were statistically analyzed. For each point in the TLS point clouds, a local plane patch was fitted according to its neighborhood points, and its normal was calculated. The closest point in the RGB-D array point clouds along the normal direction was considered as the conjugate point of the TLS point. The registration error (the Euclidean distance between these two points) was taken as the calibration accuracy measurement. Figure 13 shows the registration errors rendered in color between the RGB-D array point clouds and TLS point clouds. The color encoding addresses the value of the point clouds' registration error measurements. The error values are represented from blue to red. As can be learned from visual inspection, significant registration errors still existed after coarse calibration (green coded points), and the fine registration eliminated most of the significant errors, resulting in the final calibration parameters. Table 6 lists the statistical results of the registration errors in the coarse and fine calibration stages. The RMSE and mean residual error dropped to 0.01 m and 0.005 m after the refinement. The histogram of the registration errors distribution is shown in Figure 14. The peak of the histogram was found to be close to zero, and 95% of the registration errors were smaller than 0.025 m. The locations where large registration errors existed were the object edges or high-reflectance surfaces (e.g., metal surfaces).The Kinect V2 is known deliver inaccurate depth measurements in those spots [36]. Points on the large areas of the horizontal and vertical plane were in the low error category. The improvement of the fine calibration process was achieved by not only considering the sparse control points, but also the whole RGB-D point clouds.
To quantify the accuracy of the calibration results of the proposed method, the difference in terms of point-to-point distance between the RGB-D array point clouds and the reference TLS point clouds were statistically analyzed. For each point in the TLS point clouds, a local plane patch was fitted according to its neighborhood points, and its normal was calculated. The closest point in the RGB-D array point clouds along the normal direction was considered as the conjugate point of the TLS point. The registration error (the Euclidean distance between these two points) was taken as the calibration accuracy measurement. Figure 13 shows the registration errors rendered in color between the RGB-D array point clouds and TLS point clouds. The color encoding addresses the value of the point clouds' registration error measurements. The error values are represented from blue to red. As can be learned from visual inspection, significant registration errors still existed after coarse calibration (green coded points), and the fine registration eliminated most of the significant errors, resulting in the final calibration parameters. Table 6 lists the statistical results of the registration errors in the coarse and fine calibration stages. The RMSE and mean residual error dropped to 0.01 m and 0.005 m after the refinement. The histogram of the registration errors distribution is shown in Figure  14. The peak of the histogram was found to be close to zero, and 95% of the registration errors were smaller than 0.025 m. The locations where large registration errors existed were the object edges or high-reflectance surfaces (e.g., metal surfaces).The Kinect V2 is known deliver inaccurate depth measurements in those spots [36]. Points on the large areas of the horizontal and vertical plane were in the low error category. The improvement of the fine calibration process was achieved by not only considering the sparse control points, but also the whole RGB-D point clouds.

Indoor Mapping with the Calibrated RGB-D Camera Array
Typical indoor scenes were chosen for the RGB-D indoor mapping with the calibrated sensor array. The RGB-D camera array proposed in this paper is not equipped with any position and orientation sensors such as IMU. To solve the indoor localization problem, the real-time SLAM system of the real-time appearance-based mapping (RTAB-Map) [26,75,76] is adopted in this study. The RTAB-Map is a RGB-D SLAM solution that includes monocular vision odometry based on extracting key-points in key frames, an incremental appearance-based loop closure detector using bag of words (BoW) to distinguish whether the current location has been visited, and a global graph optimizer to achieve the bundle adjustment of the estimated pose parameters according to the found loop. However, RTAB-Map currently lacks support for three-band sensor array data input. Therefore,

Indoor Mapping with the Calibrated RGB-D Camera Array
Typical indoor scenes were chosen for the RGB-D indoor mapping with the calibrated sensor array. The RGB-D camera array proposed in this paper is not equipped with any position and orientation sensors such as IMU. To solve the indoor localization problem, the real-time SLAM system of the real-time appearance-based mapping (RTAB-Map) [26,75,76] is adopted in this study. The RTAB-Map is a RGB-D SLAM solution that includes monocular vision odometry based on extracting key-points in key frames, an incremental appearance-based loop closure detector using bag of words (BoW) to distinguish whether the current location has been visited, and a global graph optimizer to achieve the bundle adjustment of the estimated pose parameters according to the found loop. However, RTAB-Map currently lacks support for three-band sensor array data input. Therefore, only the horizontally mounted RGB-D camera output stream was taken as the input of the SLAM algorithm; that is to say, the array exterior orientation parameters were solved by the horizontally mounted RGB-D data frame. The other two RGB-D data streams were converted to the mapping coordinate system of the RTAB-Map according to the calibration parameters calibrated in the aforementioned process. Typical indoor scenes' point clouds collected by the proposed system are illustrated in Figure 15. It can be seen that the completeness of the point clouds is well maintained (especially the floor and ceiling). Benefiting from the large field view produced by the calibrated three-RGB-D camera structure, the proposed system delivers almost three times the data coverage compared with the single RGB-D mapping system. The workload of the manual operation and the number of the unknown EoPs needing to be solved by SLAM is reduced three times in the task of acquiring full-coverage 3D point clouds with auxiliary information (i.e., color) for indoor mapping.

Indoor Mapping with the Calibrated RGB-D Camera Array
Typical indoor scenes were chosen for the RGB-D indoor mapping with the calibrated sensor array. The RGB-D camera array proposed in this paper is not equipped with any position and orientation sensors such as IMU. To solve the indoor localization problem, the real-time SLAM system of the real-time appearance-based mapping (RTAB-Map) [26,75,76] is adopted in this study. The RTAB-Map is a RGB-D SLAM solution that includes monocular vision odometry based on extracting key-points in key frames, an incremental appearance-based loop closure detector using bag of words (BoW) to distinguish whether the current location has been visited, and a global graph optimizer to achieve the bundle adjustment of the estimated pose parameters according to the found loop. However, RTAB-Map currently lacks support for three-band sensor array data input. Therefore, only the horizontally mounted RGB-D camera output stream was taken as the input of the SLAM algorithm; that is to say, the array exterior orientation parameters were solved by the horizontally mounted RGB-D data frame. The other two RGB-D data streams were converted to the mapping coordinate system of the RTAB-Map according to the calibration parameters calibrated in the aforementioned process. Typical indoor scenes' point clouds collected by the proposed system are illustrated in Figure 15. It can be seen that the completeness of the point clouds is well maintained (especially the floor and ceiling). Benefiting from the large field view produced by the calibrated three-RGB-D camera structure, the proposed system delivers almost three times the data coverage compared with the single RGB-D mapping system. The workload of the manual operation and the number of the unknown EoPs needing to be solved by SLAM is reduced three times in the task of acquiring full-coverage 3D point clouds with auxiliary information (i.e., color) for indoor mapping.

Discussion
The RGB-D camera array proposed in this paper aims to improve the data collection efficiency and completeness of indoor 3D point clouds at low cost. The three Kinect V2s were calibrated and synchronized for simultaneous data collection. Thus, the collection efficiency of the 3D point clouds

Discussion
The RGB-D camera array proposed in this paper aims to improve the data collection efficiency and completeness of indoor 3D point clouds at low cost. The three Kinect V2s were calibrated and synchronized for simultaneous data collection. Thus, the collection efficiency of the 3D point clouds is three times higher than that of a traditional single RGB-D mapping system. The up-forward-down-looking array design ( Figure 1) guarantees a large coverage of the indoor scenes in each data frame. The proposed calibration method solves the unknown intrinsic and extrinsic parameters of each Kinect V2 in the array in an easy-to-implement manner. The used calibration patterns were checkerboard, flat wall, and a calibration field with sparse control point markers. These are easy to obtain or assemble. The optical-related geometry calibration achieved sub-pixel accuracy using a total of 1041 images, including separate and simultaneously collected RGB/IR images. The depth calibration corrected the range biases (wiggling errors) by pixel-wise B-spline functions. The statistics show that the irregularly distributed range biases were corrected to zero-mean Gaussian-like distribution after the calibration, thus improving the data quality. The proposed coarse-to-fine extrinsic calibration process achieved an accuracy of 95% registration errors smaller than 0.025 m when compared with TLS data in the calibration field. Indoor mapping with the proposed Kinect V2 array was accomplished in multiple locations, proving the efficiency and data-completeness of the proposed system. To guarantee a large FOV, there were barely any overlaps between three views of the proposed RGB-D array. Therefore, we tackled the registration problem of three RGB-D point clouds streams directly using 3D point clouds registration techniques instead of a 2D image registration method [28] or a local registration method utilizing overlaps [29,30] between the individual RGB-D point clouds. The advantages are that no specially made calibration board and no overlap of FOV is needed in the extrinsic calibration process. Further, a complex calibration field filled with dense control markers is avoided in our proposed method, which is essential if the calibration is done in the photogrammetric approach. The drawback is that a TLS is needed to setup the reference point clouds.
Despite the advantages that the array improves the collection efficiency and data completeness of RGB-D point clouds, the array data quality and accuracy should be discussed. The low data quality region mainly appears at the four corners in the depth maps and the edges in the scene. The "corner effect" is mainly caused by the nonhomogeneous IR illumination. The RGB-D array is essentially made of three Kinect V2s. The Kinect V2 is built upon the ToF principle, which means that it is a range finder with active IR illumination. The IR illumination follows a light cone distribution [35], which is not uniformly distributed. Thus, less-well-illuminated areas (e.g., corners, far objects) deliver inaccurate depth measurements. Even after the depth calibration in controlled environments proposed in this paper, light diffraction effect cannot be omitted, and may result in a depth error pattern (Figure 8) similar to Fraunhofer diffraction [77].
The "edge-effect" is mainly caused by the shadow and multipath effect. In uncontrolled environments, the Kinect V2 depth measurements are influenced by temperature, material, ambient light, and multipath effect [36]. The high error distribution at the edges of objects with complex shape whose local curvature changes severely is caused by a shadow effect [62] in complex edges. This is the outcome of the different optical axis of the infrared emitter of the infrared camera. The Kinect V2 suffers from multipath interference, which deviates the depth measurements from the actual position [71]. The medium error distribution area appeared in the two plane intersection positions where multipath interference is inevitable, resulting in the loss of accuracy in such scenarios. Additionally, high-reflectance material cannot be detected by infrared cameras, resulting in data blanks.
Although inaccurate points existed in the point clouds collected by our array, the proportion of them was small according to the calibration filed data (<5% with error higher than 0.025 m).
Even though the data accuracy and quality of the Kinect V2 is not as good as laser scanners and is easily influenced by the scene, the low-cost and ready availability of the sensor make it ideal for indoor mapping with a low budget. Due to the nature of lower range measurement accuracy of the ToF sensor, the proposed array cannot be applied to TLS-level data accuracy-demanding tasks such as indoor cadastral map generation. It is known that the frequent movements of the sensor and longer task time increases the complexities of the SLAM problem in terms of more unknown pose parameters and drift, resulting in inaccuracy localization and mapping solution. The proposed RGB-D array only needs to be moved in the horizontal plane to achieve complete point clouds with fine details, while the single RGB-D mapping suite needs to rotate the sensor frequently to complete the point clouds data. Only one data stream in the RGB-D array is currently used for SLAM. However, all three RGB-D cameras provide individual observation data that are inherently connected. So, the calibrated array can be utilized for a more robust solution of the localization and mapping problem. Thus, in future work, RGB-D array SLAM will be addressed to make full use of the three RGB-D data streams.

Conclusions
The low-cost, fast, and efficient collection of 3D point clouds data with auxiliary information (e.g., color, semantic information) is key to the popularization and development of indoor mobile mapping technology. Traditional indoor laser scanning trolley/backpacks with multi-laser scanner and IMUs installed solved the efficient and full-coverage indoor laser point clouds collection to a certain degree. However, the cost of those mapping systems is quite high and can hardly be replicated by consumer electronic components. The RGB-D camera is low-cost, but the FOV of a single sensor is narrow; thus, the collection efficiency and the data coverage are low when compared with laser scanners. Aiming to tackle the low-cost efficient indoor point clouds collection bottleneck with RGB-D sensors, after preliminary work, an RGB-D camera array was designed, and the intrinsic/extrinsic geometry and depth calibration method of the array is presented in this paper. The three RGB-D data streams are synchronized and gathered by the open source driver OpenKinect. The optical-related intrinsic calibration that involves the RGB/IR camera model parameters and the relative pose between the RGB and IR cameras are solved by a homography-based method and by minimizing the weighted RGB/IR pixel projection errors, respectively. The depth calibration process corrects the default IR camera model parameters to the intrinsic calibrated solutions to correct the IR ray, and the range bias of the ToF sensor is solved by pixel-wise spline line functions. The extrinsic calibration is achieved through a coarse-to-fine scheme that solves the initial EoPs based on sparse control markers and further refines the initial value by an ICP variant, minimizing the distance between the RGB-D point clouds and the referenced laser point clouds. The effectiveness and accuracy of the proposed method are evaluated by comparing the point clouds derived from the proposed prototype with ground truth data collected by TLS at high density. The optical-related intrinsic calibration achieves sub-pixel accuracy using a total of 1041 IR/RGB images. The calibrated range biases are zero-mean Gaussian-like distributed, thus improving the data quality (Bias < 1 mm: 36.374% (before) to 81.020% (after) at 1.208 m). The extrinsic calibration of the array achieved an accuracy of only 5% of the points captured in the calibration field are with an error higher than 0.025 m, compared with the TLS data. The overall analysis of the results shows that the proposed method achieves seamless integration of multiple point clouds from different RGB-D cameras collected at 30 frames per second, resulting in low-cost, efficient, and high-coverage 3D color point clouds collection for indoor mapping applications.