A Novel Method for Extrinsic Calibration of Multiple RGB-D Cameras Using Descriptor-Based Patterns

This paper presents a novel method to estimate the relative poses between RGB-D cameras with minimal overlapping fields of view. This calibration problem is relevant to applications such as indoor 3D mapping and robot navigation that can benefit from a wider field of view using multiple RGB-D cameras. The proposed approach relies on descriptor-based patterns to provide well-matched 2D keypoints in the case of a minimal overlapping field of view between cameras. Integrating the matched 2D keypoints with corresponding depth values, a set of 3D matched keypoints are constructed to calibrate multiple RGB-D cameras. Experiments validated the accuracy and efficiency of the proposed calibration approach.


I. INTRODUCTION
In recent years, indoor scene reconstruction and robot navigation have attracted much attention with the advent of low-cost and efficient depth and color (RGB-D) devices such as the Microsoft Kinect, Intel RealSense, and Structure Sensor.The depth cameras of these devices can provide a depth map with a VGA resolution (640x480) at video rate (e.g., 30 Hz) using efficient light-coding technologies that avoid the challenging task of dense 3D reconstruction from color images.The 3D models reconstructed using these depth cameras have been used to generate more realistic 3D content for virtual reality (VR) [1] and help align the rendered virtual objects with real scenes for augmented reality (AR) [2].Furthermore, the direct depth sensing capability of these depth cameras is particularly suitable for robots to navigate in an unknown environment.
With an RGB-D camera, the simultaneous localization and mapping (SLAM)-based approach is mainly used for fusing point cloud frames to reconstruct indoor scenes [3], [4], [5], [6].However, hundreds or even thousands of frames must be captured in state-of-the-art SLAM systems to reconstruct a common indoor environment, such as a room or an office [7] because of two problems.(1) The field of view (FoV) of depth cameras is limited; thus, only a small part of the scene is represented in a single frame.The Kinect, for example, has a horizontal FoV of 57 • , which is much smaller sun@mie.utoronto.cathan the horizontal 240 • FoV of the Hokuyo URG-04LX-UG01, a laser scanner with a similar range and measurement accuracy to the Kinect [8].(2) To track the poses of depth cameras to effectively fuse multiple point cloud frames, consecutive frames must be captured to have sufficient scene overlap.Typically, more than ninety percent of overlap is required, which further increases the number of frames for reconstruction.
One solution to these problems is to use a multi-camera setup in which RGB-D cameras face different directions to sample different sections of the environment [9], [10], [11], [12].However, complications can occur with the use of multiple cameras, such as more difficult calibration caused by the minimal overlapping FoV between cameras.
Classical extrinsic calibration strategies such as the chessboard-based method and the keypoints-based method cannot be applied to calibrate RGB-D cameras in a multicamera setup, because their overlap requirement constitutes a very strong constraint.A more general approach that can calibrate multiple cameras in an arbitrary configuration is based on per-camera odometry [13], [14], [15], [16].Cameras are calibrated by finding all camera odometry transforms based on matched features from frames that are captured in the motion paths of all cameras.In these methods, SLAM or visual odometry techniques are applied to estimate camera trajectories.However, the robustness of SLAM and visual odometry techniques highly depends on the environment.
Instead of the tedious estimation of useful trajectories for calibration, Fernandez-Moral et al. proposed to calibrate multiple cameras through planes and lines [9], [17].Planes and lines have large spatial spans; thus, they can be observed by cameras with little or no overlapping FoV.However, to apply the Fernandez-Moral method, the multi-camera setup is required to be moved around in scenes to extract enough matched planes or lines, which results in low efficiency.
In this work, a new extrinsic calibration method that relies on descriptor-based patterns to provide well-matched 2D keypoints is proposed to estimate the relative poses between the RGB-D cameras with minimal overlapping FoV.In our method, a set of 3D matched keypoints are constructed based on extracted 2D keypoints and corresponding depth values from depth maps to directly estimate poses between multiple RGB-D cameras.Then the estimated poses are globally optimized through the loop-closure constraint provided by the panoramic camera setup (Fig. 1(a)).Experimental results quantitatively verified the accuracy of this method and demonstrated that it is fast and easy to apply.

II. EXTRINSIC CALIBRATION
In this section, we address the problem of estimating extrinsic calibration (i.e., relative poses) between RGB-D cameras that have little overlapping FoV.Fig. 1(a) shows the panoramic RGB-D camera setup that we built to evaluate the proposed calibration method.It consists of 12 Kinect v1 cameras, all Kinects are vertically positioned for a more compact design.The Kinect v1 camera has an angular field of view (FoV) of 43 • from the vertical.The overlap FoV of two neighboring cameras is only approximately 30 percent of the vertical FoV of each camera.
We propose to solve the calibration problem of multiple RGB-D camera with little overlapping FoV using feature descriptor-based calibration patterns [18] (see the patterns on the walls in Fig. 1(b)), which can provide robust and accurate matched feature points in this case of minimal overlapping FoV.Based on these matched 2D feature points and depth maps from the depth cameras, we construct two 3D point sets to estimate poses by bundle adjustment.Then, a pose graph optimization method is used to refine the estimated poses.

A. Initial Estimation of Poses
The descriptor-based calibration pattern is composed of several noise images at different scales in accordance with the mechanism of SIFT/SURF.Compared with natural scenes, this pattern contains a high number of detectable features that can be easily detected by a camera at varying distances.Thus, the descriptor-based pattern can provide many and more accurately matched keypoints between two cameras in the case of minimal overlapping FoV.
In Fig. 2, the detected SURF keypoints are represented by green dots.Based on these matched keypoints, we find their corresponding depth values from the depth maps generated by the depth camera in the RGB-D camera to construct 3D point sets to estimate poses.Due to the different spatial positions and intrinsic parameters of the depth camera and of the color camera in a RGB-D camera, the depth map is not aligned with the color image.The extrinsic parameters between the depth and color camera are used to align the depth map with the color image.
Let (u 0 , v 0 ) denote the coordinates of the principal point of the depth camera, f x and f y denote the scale factors in image u and v axes of the depth camera, and u 0 , v 0 , f x and f y be the intrinsic parameters of the depth camera.Let Let T represent the transformation matrix from the depth camera coordinate frame to the color camera coordinate frame.The relationship between the transformed 3D point [X ,Y , Z ] T in the color camera's coordinate frame and [X,Y, Z] T can be expressed as Let (u 0 , v 0 ) denote the coordinates of the principal point of the color camera and f x and f y denote the scale factors in image u and v axes of the color camera.After mapping [X,Y, Z] T to the color image coordinate system, the aligned depth point [u , v , Z ] can be obtained, where u and v are calculated according to We obtain two 3D point sets {p i }, {p i }; i = 1, 2, ..., N, based on the matched keypoints and corresponding depth value in the aligned depth map.p i and p i are 3 × 1 column matrices.The relative poses between these two 3D point sets can be found by minimizing where ξ ∈ se(3) is a vector with six dimensions that represents the camera pose, ˆmaps ξ to a matrix with four mentions R 4×4 .ξ ˆis mapped to T ∈ SE(3) by the exponential map exp().We use bundle adjustment to jointly solve all camera poses [19].The derivative with respect to the camera pose of an error element is given by

B. Pose Graph Optimization
Assume that there are N RGB-D cameras in the panoramic 3D vision system.Let T i, j denote the estimated relative poses between two adjacent RGB-D cameras, i = 1, 2, ..., N; j = i + 1, when i equals N, j is 1.When the pose of the first RGB-D camera is set to [0, 0, 0, 1] T , we can calculate the pose x i for each RGB-D camera based on T i, j .Due to the pose estimation error, the pose that is calculated by T N,1 x N is not equal to [0, 0, 0, 1] T .Fortunately, this panoramic setup of RGB-D cameras provides a definite loop-closure constraint for optimizing the estimated poses.This problem can be solved by using the popular pose graph optimization method in SLAM [19], [20].
According to pose graph optimization theory, the problem can be solved by finding the minimum of a function of this form: x * = arg min where x = x T 1 , ..., x T n T is a vector of poses, r i, j is the residual of the predicted and observed relative poses between the i−th and j−th node, Λ i, j denotes the measurement information matrix, and S represents the set of edges that connect the nodes.

III. EXPERIMENTAL RESULTS
Experiments were performed to evaluate the accuracy of the proposed extrinsic calibration method and demonstrate the efficiency of constructing indoor environments using our panoramic RGB-D camera setup.We designed a camera rig R a s p b e r r y P i Gi g a b i t S wi t c h that consists of twelve Kinect v1 RGBD cameras mounted in a radial configuration (see Fig. 1(a)).
We also designed a distributed system built on top of a local area network (LAN) to capture the RGB-D frames from all twelve cameras in real time.The distributed capturing system (see Fig. 4) consists of twelve Raspberry Pi single board computers, a gigabit switch and a PC.Raspberry Pi is used to obtain RGB-D frames from each Kinect and send the data to the PC through the LAN using User Datagram Protocol.The PC is used to receive and process the RGB-D frames.Both the depth and color frames were set to a size of 640×480; the values of a single pixel in the depth frame and color frame have a size of 2 bytes and 3 bytes, respectively.Thus, the size of the RGB-D frames from all twelve Kinects is 17.58 million bytes, which can be sent to the PC through the gigabit switch at 7 fps.This frame rate can be increased by decreasing the size of RGB-D frames from Kinect.
To evaluate the accuracy of the proposed extrinsic calibration method, we used a motion capture system to obtain the ground truth relative poses between the cameras.The motion capture system requires at least three reflective markers to track the pose of a rigid body, such as the Kinect and the chessboard in our experiments.We attached four reflective markers to both the Kinect and the chessboard (see Fig. 5).We placed the four markers on the outer corners of the chessboard such that the relative poses between the chessboard and the motion capture system and the Kinect color camera were known.The motion capture system tracked the poses of the markers that were attached to the Kinect to determine the poses of the Kinect color camera.For convenience, we only attached markers to one Kinect and placed this Kinect at twelve different positions in the camera rig to capture RGD frames.The accuracy of extrinsic calibration was evaluated using twelve pairs of RGB-D frames from the captured twelve RGB-D frames.
While determining the correspondences of keypoints for each pair of RGB-D frames, we increased the distance threshold of the keypoints (denoted by dist thresh) from one and a half times the minimum distance to the maximum distance (ten times the minimum distance on average) to analyze the calibration error in rotation and translation with respect to the keypoints correspondences.In TABLE I, the average residual error in rotation and translation is presented for the initial estimation (denoted by Ini.) and optimized estimation (denoted by Opti.) using pose graph optimization with loop closure.It can be seen that the average residual error was reduced when raising the distance threshold to consider more keypoint correspondences.The residual error in both rotation and translation for the optimized estimation is generally less than the residual error for the initial estimation.When the distance threshold was set to be larger than four times the minimum distance, the residual error remained at the same level.The aligned point clouds before and after pose graph optimization with loop closure can also be observed in Fig. 6.It can be seen in the red circle on the bottom of Fig. 6(a) that the point clouds were misaligned; however, these point clouds aligned well in Fig. 6(b) with optimized poses with the loop closure constraint.
The intrinsic calibration process can be avoided as the intrinsic parameters of Kinects can be obtained through the software development kit (SDK) of Kinect.The SDK also provides an application interface to map the depth map to color camera coordinate frame in real time.The whole calibration process using our setup and method cost 800 ms on a computer with a 3.60 GHz CPU, most of which was spent on extracting and matching feature points.In comparison, for the setup and method reported in [9], the panoramic RGB-D camera setup was required to be moved around to take about 200 images in order to extract enough   planes for all camera pairs to estimated poses, which took more than 5 seconds [9].In terms of accuracy, the residual error of rotation was 1.60 degrees, and the error of translation was 2.5 cm in [9].Our proposed calibration method resulted in a rotation error of 0.56 degree and a translation error of 1.80 cm (see TABLE I).The operating range of the Kinect v1 depth camera is between 0.5 m to 5.0 m [21].Using an output panoramic frame can reconstruct scenes within the circle with a radius of five meters.Thus the reconstruction of indoor scenes becomes very efficient using our panoramic RGB-D camera setup.Fig. 7 presents the reconstruction result of a bedroom (3.3 m × 3 m) and a living room (9 m × 3.5 m) using only one output panoramic frame from the system.The panoramic RGB-D camera setup provides a 360 • FoV, leading to better constraints for localization and can be potentially used to reduce localization and mapping errors.Our next step is to investigate direct registration methods such as [22], [23], which do not depend on time-consuming keypoint detectors or descriptors for large scale SLAM, using this panoramic 3D vision system.

IV. CONCLUSION
In this letter, a new method that relies on well-matched keypoints provided by a feature descriptor-based calibration pattern was proposed to calibrate the extrinsic parameters of the RGB-D cameras in the system.A LAN-based distributed system was developed, which enabled the system to provide panoramic RGB-D frames in real time.The reconstruction of indoor scenes was efficiently and conveniently performed using the panoramic RGB-D 3D vision system.The experiments validated the accuracy and efficiency of the proposed calibration method and the efficiency of the panoramic RGB-D camera setup in 3D reconstruction, and quantitatively demonstrated a higher speed and higher accuracy compared with existing methods.

Fig. 1 .
Fig. 1.(a) The setup used in this study for evaluating the proposed calibration method composed of 12 Kinect v1 RGB-D cameras.(b) The panoramic 3D color point cloud obtained by the panoramic 3D vision system.Descriptor-based calibration patterns are pasted on the walls to calibrate multiple RGB-D cameras.

Fig. 2 .
Fig. 2. (a) and (b) show the detected keypoints of the image captured in a natural scene, and the image captured in a scene with descriptor-based patterns respectively.Keypoints are represented by green dots.
[u, v, Z] represent a pixel in the depth map, Z represent the depth value in [u, v], and [X,Y, Z] T represent the mapped 3D point of [u, v] in the depth camera coordinate system.According to the pinhole camera model, the values of X and Y can be

Fig. 3 .
Fig. 3. (a) shows the matched keypoints of the image pairs captured in a natural scene.(b) and (c) show the matched keypoints of the image pairs captured in a scene with descriptor-based patterns.The keypoints in (c) are only detected in the image regions within blue rectangles.Well-matched keypoints are connected with green lines, poorly matched keypoints are connected with red lines.

Fig. 4 .
Fig. 4. Distributed capturing system consisting of a gigabit switch and 12 low-cost Raspberry Pi single board computers.

Fig. 6 .
Fig. 6.The aligned point clouds before (a) and after (b) closing the loop.A misalignment can be observed in the red circle on the bottom of (a).After pose graph optimization with loop closure, the misalignment is resolved.

Fig. 7 .
Fig. 7. Reconstruction of a bedroom (3.3 m × 3 m) (a), and a living room (9 m × 3.5 m) using only one output frame from the proposed panoramic 3D vision system.

TABLE I RESIDUAL
ERRORS OF INITIALLY ESTIMATED POSES AND OPTIMIZED POSES