Enhanced RGB-D Mapping Method for Detailed 3D Indoor and Outdoor Modeling

RGB-D sensors (sensors with RGB camera and Depth camera) are novel sensing systems that capture RGB images along with pixel-wise depth information. Although they are widely used in various applications, RGB-D sensors have significant drawbacks including limited measurement ranges (e.g., within 3 m) and errors in depth measurement increase with distance from the sensor with respect to 3D dense mapping. In this paper, we present a novel approach to geometrically integrate the depth scene and RGB scene to enlarge the measurement distance of RGB-D sensors and enrich the details of model generated from depth images. First, precise calibration for RGB-D Sensors is introduced. In addition to the calibration of internal and external parameters for both, IR camera and RGB camera, the relative pose between RGB camera and IR camera is also calibrated. Second, to ensure poses accuracy of RGB images, a refined false features matches rejection method is introduced by combining the depth information and initial camera poses between frames of the RGB-D sensor. Then, a global optimization model is used to improve the accuracy of the camera pose, decreasing the inconsistencies between the depth frames in advance. In order to eliminate the geometric inconsistencies between RGB scene and depth scene, the scale ambiguity problem encountered during the pose estimation with RGB image sequences can be resolved by integrating the depth and visual information and a robust rigid-transformation recovery method is developed to register RGB scene to depth scene. The benefit of the proposed joint optimization method is firstly evaluated with the publicly available benchmark datasets collected with Kinect. Then, the proposed method is examined by tests with two sets of datasets collected in both outside and inside environments. The experimental results demonstrate the feasibility and robustness of the proposed method.


Introduction
Detailed 3D modeling of indoor and outdoor environments is an important technology for many tasks such as indoor mapping, indoor positioning and navigation, and semantic mapping [1]. Traditionally, there are two main approaches to close-range 3D modeling-terrestrial laser scanning be applied only in small workspace mapping [15]. Du et al. (2011) introduced a mobile system that runs in real-time on a laptop. Color and depth are jointly used to achieve robust 3D registration. However, some manual interactions should be involved [16]. Henry et al. (2012) proposed an improved registration method denoted as RGB-ICP to incorporate visual information into the ICP algorithm for image registration [17], and this method can improve the alignment accuracy to a certain extent. However, the final models were still broken, and lacked of details in some regions. The authors suggested that it would be helpful to apply a visualization technique such as PMVS (patch-based multi-view stereo) to enrich the indoor model. Engelhard et al. (2011) [18] presented an approach, which is similar to the work of Henry et al. (2012) [17]. Instead of SIFT, this approach applies SURF for feature detection [18]. Khoshelham et al. (2013) presented a Frame alignment method by assigning weights of 3D points correspondences based on the theoretical random error of individual points. However, the method completely relied on the visual features, emphasizing the importance of a fine registration step extracted from the depth images to generate accurate point clouds from RGB-D data [19]. Based on the method of weighting the 3-D points, Santos et al. (2016) introduced a refined mapping method, robust coarse-to-fine registration method. The loop-closure detection and a global adjustment of the frames sequences are used to improve the consistency of the frames sequences [20]. Endres et al. (2014) applied a similar approach, using the RANSAC (RANdom Sample Consensus) method to estimate the transformations between associated key points, and then generated a volumetric 3D map of the environment [21]. In this approach, Endres et al. concentrated mainly on SLAM rather than scene modeling. Stuckler and Behnke (2012) presented an approach for scene modeling and pose tracking that used RGB-D cameras [22]. They conducted two experiments in the small range to evaluate the performance of the registration. Their experiments showed that although the improvement of depth alignment could enlarge the modeling range of the sensor significantly, the absolute distance limitation may still cause trouble when modeling a large-scale indoor scene with a high, arched roof, like the airport terminal or church. Similar to these methods, a kind of multi feature points matching algorithm is proposed for loop closing detection in RGB-D SLAM by combining appearance and local geometric constraints [23]. Chow et al. (2014) [24] introduced a mapping system that integrated a 3D terrestrial LiDAR system with a MEMS IMU and two Microsoft Kinect sensors to map indoor urban environments. A point-to-plane ICP was used to minimize the reprojection error of the infrared camera and projector pair in an implicit iterative extended Kalman filter (IEKF). However, this system is not handheld and its cost would be much higher than single RGB-D sensors.
In the previous developments, only feature points extracted from RGB image were used as a constraint to improve the pose accuracy of depth frames. Although it can merge the entire depth frame well, the final model is completely generated from the depth frame. As presented by Khoshelham and Elberink (2012), only the data obtained within the distance of 1 to 3 m from the depth sensor can be used for mapping [11]. Therefore, the problem associated with measured range limitation still cannot be solved when modeling a scene with a high, arched roof. In addition, the depth sensors capture depth information based on the concept of structured light pattern and time-of-flight, and the measurement is highly related to the material and structure of objects. It would cause "details lost" when modeling objects with smooth surfaces or low reflection certain materials or scene structures which do not reflect infra-red (IR) light, very thin structures or surfaces at glancing incidence angles. Meanwhile, the device will also experience motion blur (like any camera) under fast moving condition, which can also lead to missing data. However, in computer vision, many approaches to Structure from Motion (SfM) are used for 3D scene reconstruction. They allow the production of high quality 3D models by using unordered image collections that depict a scene or an object from different viewpoints [25]. RGB image-based modeling could create 3D models from a collection of images based on visual features instead of material and structure of objects [26][27][28]. The corresponding RGB image sequences generated from RGB-D sensors may not only be used for depth frame registration but also be used to provide extra 3D information for the unmeasured areas including far range scenes and regions with holes.
In this paper, we intended to innovatively integrate the 3D scene generated from image-based modeling method and the 3D scene from depth images for scene modeling. 3D scene from RGB images can not only enlarge the measurement distance of the RGB-D sensors but can also serve as a good supplement to 3D scene from depth images.
Compared with previous works, this paper presents three key novelties. First, a precise calibration for both of IR and RGB cameras is demonstrated. The full set of calibration data for external and internal parameters as well as the relative pose between RGB camera and IR camera can be obtained. Second, a novel false matches rejection method is presented by combining the depth information and the initial pose parameters from the RGB-D sensor. Third, the image-based modeling method is innovatively incorporated to enhance the mapping system of RGB-D scenes. A global optimization model is used to improve the accuracy of the camera pose, decreasing the inconsistencies between the depth frames. In order to eliminate the geometric inconsistencies between 3D scene from RGB images and depth scene, the scale ambiguity problem encountered during the pose estimation with RGB image sequences can be resolved by integrating the depth and visual information. A robust rigid-transformation recovery method is developed to register 3D scene from RGB images to depth scenes.

Overview of the Enhanced RGB-D Mapping System
The RGB-D sensor system used in this research contains two sensors: one RGB camera, and one IR sensor called "Structure sensor". The IR sensor is combined with an IR camera and an IR projector. This sensor system is highly mobile, and can be attached to an iPad, iPhone, or other mobile instrument. The system can capture 640 × 480 registered RGB images and depth images at 30 frames per second. Figure 1 shows its hardware structure. The lower panels of Figure 1 show a sample frame observed with the RGB-D sensor. The white part of the depth image indicates that no depth information is measured due to certain materials or scene structures that do not reflect infrared (IR) light, very thin structures or surfaces at glancing incidence angles. images can not only enlarge the measurement distance of the RGB-D sensors but can also serve as a good supplement to 3D scene from depth images. Compared with previous works, this paper presents three key novelties. First, a precise calibration for both of IR and RGB cameras is demonstrated. The full set of calibration data for external and internal parameters as well as the relative pose between RGB camera and IR camera can be obtained. Second, a novel false matches rejection method is presented by combining the depth information and the initial pose parameters from the RGB-D sensor. Third, the image-based modeling method is innovatively incorporated to enhance the mapping system of RGB-D scenes. A global optimization model is used to improve the accuracy of the camera pose, decreasing the inconsistencies between the depth frames. In order to eliminate the geometric inconsistencies between 3D scene from RGB images and depth scene, the scale ambiguity problem encountered during the pose estimation with RGB image sequences can be resolved by integrating the depth and visual information. A robust rigid-transformation recovery method is developed to register 3D scene from RGB images to depth scenes.

Overview of the Enhanced RGB-D Mapping System
The RGB-D sensor system used in this research contains two sensors: one RGB camera, and one IR sensor called "Structure sensor". The IR sensor is combined with an IR camera and an IR projector. This sensor system is highly mobile, and can be attached to an iPad, iPhone, or other mobile instrument. The system can capture 640 × 480 registered RGB images and depth images at 30 frames per second. Figure 1 shows its hardware structure. The lower panels of Figure 1 show a sample frame observed with the RGB-D sensor. The white part of the depth image indicates that no depth information is measured due to certain materials or scene structures that do not reflect infrared (IR) light, very thin structures or surfaces at glancing incidence angles. The proposed enhanced RGB-D mapping system can be divided into three stages: the calibration stage, the image-based 3D modeling stage and the robust geometric registration of RGB and depth model stage, as illustrated in Figure 2. First, a precise calibration for both the RGB camera and the IR camera is conducted, and the results of calibration stage is the full set of calibration data for external and internal parameters as well as the relative pose between RGB camera and IR camera. These parameters could be used in the robust registration process. Second, a refined image-based The proposed enhanced RGB-D mapping system can be divided into three stages: the calibration stage, the image-based 3D modeling stage and the robust geometric registration of RGB and depth model stage, as illustrated in Figure 2. First, a precise calibration for both the RGB camera and the IR camera is conducted, and the results of calibration stage is the full set of calibration data for external and internal parameters as well as the relative pose between RGB camera and IR camera. These parameters could be used in the robust registration process. Second, a refined image-based modeling method is used for 3D scene reconstruction from RGB images. A novel false matches rejection method is used to minimize the false matches during feature matching process. A key frames selection method is used to ensure sufficient overlapping between the candidates and the previous key frame. Third, in the stage of robust geometric registration of RGB and depth model, a global optimization model is used to improve the accuracy of the camera pose, decreasing the inconsistencies between the depth frames. The accurate global scale factor is recovered for RGB sequences combining RANSAC and Pau Ta Norm and the rigid geometric transformation between RGB model and depth model is robust calculated using Besl and RANSAC method. Finally, according to the registration parameters, the 3D scene from RGB images can be registered to the 3D scene from depth images well. modeling method is used for 3D scene reconstruction from RGB images. A novel false matches rejection method is used to minimize the false matches during feature matching process. A key frames selection method is used to ensure sufficient overlapping between the candidates and the previous key frame. Third, in the stage of robust geometric registration of RGB and depth model, a global optimization model is used to improve the accuracy of the camera pose, decreasing the inconsistencies between the depth frames. The accurate global scale factor is recovered for RGB sequences combining RANSAC and Pau Ta Norm and the rigid geometric transformation between RGB model and depth model is robust calculated using Besl and RANSAC method. Finally, according to the registration parameters, the 3D scene from RGB images can be registered to the 3D scene from depth images well.

Precise Calibration for RGB-D Sensors
The main concept of camera calibration is based on the pinhole camera model shown in Equation (1), which illustrates the relationship between the image point and the corresponding ground point as a function of the camera's internal and external parameters. Both IR and RGB cameras can use this model.

Precise Calibration for RGB-D Sensors
The main concept of camera calibration is based on the pinhole camera model shown in Equation (1), which illustrates the relationship between the image point and the corresponding ground point as a function of the camera's internal and external parameters. Both IR and RGB cameras can use this model. where s is the scale factor, u, v are the image points coordinates in pixels, a camera matrix of intrinsic parameters, R is a 3 × 3 rotation matrix, and X, Y, Z are the ground coordinates in mm.
Both IR and RGB cameras suffer from distortion, mostly radial distortion and slight tangential distortion. This effect can be estimated based on Equation (2) [23]. The model illustrated three parameters for radial distortion (k 1 , k 2 , k 3 ) and two parameters for tangential distortion (p 1 , p 2 ). As the RGB camera, the one used in this structure sensor is the smartphone's camera, is expected to produce high distortion in both radial and tangential directions, we illustrate the full model parameters in our model [29].
where r 2 = u 2 + v 2 . By retyping Equation (1) This equation can be applied to RGB and IR cameras. The two sensors collected the same scene for an ordinary checkerboard. Therefore, by knowing the ground coordinates and image coordinates of the checkerboard corners, one can estimates the internal K and external R, t parameters for each camera using sufficient scenes.
For initial parameters estimation for K, R, and t, we use homography transform estimation which transforms the intrinsic and extrinsic matrix to one matrix called homography which can illustrate the relationship between the image point and the corresponding ground point in 3 × 3 matrix. If Z = 0, then Equation (4) is simplified to: For eliminating the scale factor s, we can make a cross product for both sides by p. The resulting equation will be: Based on Equation (6), we can estimate the homography matrix known the image point and the corresponding ground point, by using Singular Value Decomposition (SVD) we can compute the homography matrix. Finally, with estimated homography matrix we can extract the internal and external parameters for each camera based on the characteristic of first and second column of R, which are orthonormal. Therefore, we can rewrite this formulae h t 1 ·K −t ·k −1 ·h 2 = 0 as h t 1 ·B·h 2 = 0, where B = K −t ·K −1 . B is 3 × 3 matrix which contains only the internal parameters for camera. Based on characteristic of this matrix we can reduce the nine parameters to only six parameters. Thus, we can rewrite the last formulae as h t 1 ·B·h 2 = L t 12 ·b = 0, where b is a vector which contains only six parameters, and Subscription of o and s refer to first and second column of H matrix, respectively. From this equation, we can compute the vector b, which summarizes the internal parameters for the camera using SVD. After that, we can reconstruct the intrinsic matrix K for the camera and then compute the rotation matrix and translation vector from homography matrix and intrinsic matrix. The output values for rotations and translations as well as the internal parameters can be used as initial value for refinement stage. The cost function to be minimized is: where m is the point number and n is the scene number. As shown in Figure 3, the difference between the RGB camera and the depth camera lies in their methods of data collection. Due to the specific mechanics of the hardware, the sensor cannot collect the IR images and RGB images at the same time. The RGB camera collects RGB images all the time, but the data collected by the depth sensor depends on the status of the IR projector. When the IR projector is switched on, the IR camera collects the depth data for the scene. When the IR projector is switched off, the IR camera captures an ordinary image, which is similar to the RGB image, but on the IR band. The depth images on the IR band are used for the calibration process.
The result for this method is the full set of calibration data for external and internal parameters as well as the relative pose between RGB camera and IR camera. These parameters used in the robust geometric registration process are shown in Section 3.4. values for rotations and translations as well as the internal parameters can be used as initial value for refinement stage. The cost function to be minimized is: where is the point number and is the scene number. As shown in Figure 3, the difference between the RGB camera and the depth camera lies in their methods of data collection. Due to the specific mechanics of the hardware, the sensor cannot collect the IR images and RGB images at the same time. The RGB camera collects RGB images all the time, but the data collected by the depth sensor depends on the status of the IR projector. When the IR projector is switched on, the IR camera collects the depth data for the scene. When the IR projector is switched off, the IR camera captures an ordinary image, which is similar to the RGB image, but on the IR band. The depth images on the IR band are used for the calibration process.
The result for this method is the full set of calibration data for external and internal parameters as well as the relative pose between RGB camera and IR camera. These parameters used in the robust geometric registration process are shown in Section 3.4.

Refined Relative Motion Estimation for RGB Images Sequence
The task of relative pose estimation, which is done by computing consistent feature matches across multiple images, presents a classic problem. Numerous algorithms have been proposed to solve this issue [27,[30][31][32]. Normally, two steps would be involved in the relative motion estimation: key-point detection and matching, camera pose estimation. In our work, we add a refined outlier rejection method to eliminate the false matches by using the depth information as a reference and the pose derived from the ICP algorithm as a priori information. In the following subsections, we summarize the steps in the motion estimation algorithm.

False Matches Rejection Method
The SiftGPU detector (which is an implementation of SIFT [33] for GPU) is used for image feature detection. SiftGPU processes pixels in parallel to build Gaussian pyramids and to detect DoG key points. Based on the GPU list generation [34], SiftGPU then uses a GPU/CPU mixed method to efficiently build compact key point lists. Finally, the key points are processed in parallel to obtain their orientations and descriptors. Typically, thousands of SIFT key points can be detected in each RGB image from RGB-D sensors with 640 × 480 pixels. Based on the local descriptor of each key point, we can use the approximate nearest neighbors package for feature matching [35].
However, several false matches still exist after the feature detection and feature matching processes. We therefore introduce an effective method to reduce the outliers by combining the depth information and the initial camera poses between frames from the RGB-D sensor plus RANSAC (RANdom Sample Consensus).

Refined Relative Motion Estimation for RGB Images Sequence
The task of relative pose estimation, which is done by computing consistent feature matches across multiple images, presents a classic problem. Numerous algorithms have been proposed to solve this issue [27,[30][31][32]. Normally, two steps would be involved in the relative motion estimation: key-point detection and matching, camera pose estimation. In our work, we add a refined outlier rejection method to eliminate the false matches by using the depth information as a reference and the pose derived from the ICP algorithm as a priori information. In the following subsections, we summarize the steps in the motion estimation algorithm.

False Matches Rejection Method
The SiftGPU detector (which is an implementation of SIFT [33] for GPU) is used for image feature detection. SiftGPU processes pixels in parallel to build Gaussian pyramids and to detect DoG key points. Based on the GPU list generation [34], SiftGPU then uses a GPU/CPU mixed method to efficiently build compact key point lists. Finally, the key points are processed in parallel to obtain their orientations and descriptors. Typically, thousands of SIFT key points can be detected in each RGB image from RGB-D sensors with 640 × 480 pixels. Based on the local descriptor of each key point, we can use the approximate nearest neighbors package for feature matching [35].
However, several false matches still exist after the feature detection and feature matching processes. We therefore introduce an effective method to reduce the outliers by combining the depth information and the initial camera poses between frames from the RGB-D sensor plus RANSAC (RANdom Sample Consensus).
Supposing feature matches dataset is As mentioned by Khoshelham et al. (2013) [19], a correction of the depth image pixels should be used to align the depth image with the color image. In this paper, the discrepancy is calibrated by using markers that can be measured in the depth image as well as in the color image. The corresponding points from the infrared frame and the RGB image can be obtained and the affine transformation for the depth image pixels is calculated with a least-squares process. Therefore, corresponding to the points set on 2D images, 3D coordinates for each feature points can be obtained from depth image according to Equation (1), which can be represented as The basic idea of the method is to acquire feature matches from the RGB images, to compute global 3D coordinates of every feature points based on the initial pose parameters generated by RGB-D sensor. For each feature matches in D, we adopt RANSAC method to iterate and obtain the optimal 3D coordinates of feature matches can be backprojected to the target frames and the corresponding image points set E. uv depends on the accuracy of the initial pose from RGB-D sensors) and a point is recognized as an outlier whenever the residual error is greater than R.E. uv . In Algorithm 1, for each match, if the remaining image points, n2DInlier, is bigger than 3, it is recognized as inlier, otherwise it is outlier.

Algorithm 1 False matches rejection combining depth information + RANSAC
3D points corresponding to feature matches; d XYZ : distance threshold in object space for RANSAC iterations; d uv : distance threshold in image space Output: number of inliers: n2DInlier, Inliers: While Iterations <= MaxIterations do 5. n3DInlier = 0 6.
Randomly select 5 feature points in current feature matches from q i , compute the mean value of 3D coordinates q i 7.
For q i ∈ q i do 8. If End if 12.
End for 13.
Let q i backproject to j-th Frame, obtain the backprojected image point p1 i End for 27.
End if 30. End for 31. Return n2DInlier, F It should be noted that due to the limitation in measurement distance of the RGB-D sensor, it is impossible to find all of the corresponding points from the depth image. Therefore, the outlier rejection method can only be used within a certain range (within 8 m) and the thresholds R.E. uv and R.E. XYZ differ with the increasing of measurement distance.

Camera Pose Estimation for RGB Images Sequence
As frame rate to RGB-D sensors speed ratios are often higher than necessary, not all of the RGB images need to be processed, so choosing the right frames requires careful consideration. Camera baselines and overlap between images are highly important for robust 3D reconstruction. Short baselines usually induce larger measurement errors than those produced by the long baselines [36]. Therefore, the selection criteria must guarantee both enough baseline and sufficient overlap between the candidates and the previous key frame.
In this paper, the initial pose from RGB-D sensor is employed to ensure enough baseline by computing Euclidean distance between two frames. Besides, we use the correspondence ratio R C (the ratio of the number of frame-to-frame point features to the total number of point features considered for correspondence) defined by [37] to ensure sufficient overlap between the candidates and the previous key frame. The image is selected as a key frame whose ratio of feature point to correspondence is less than 90% and the baseline B l between the candidates and the previous key frame is greater than 10 cm. If the ratio is greater than 90% or the baseline is less than 10 cm, we consider the next frame as candidates until find the next key frame.
Finally, we then robustly estimate a fundamental matrix between key frames F n−1 and F n by using the five-point algorithm proposed by Nistér [2] and the RANSAC method [38]. Then, the rotation R c and translation t c are recovered by matrix factorization. This minimization problem is solved with the Levenberg-Marquardt nonlinear optimization [39], and R c and t c are further refined. The corresponding 3D coordinates of feature matches can be calculated with space intersection.

Robust Geometric Registration of RGB and Depth Models
Since the geometry of RGB images (rotation R and translation T of each RGB image, 3D coordinates of feature matches) obtained in Section 3.3 can only be recovered up to a scale factor and the coordinates system is different from that of depth sensor, the robust geometric registration method aims to integrating geometry of RGB images and depth geometry according to a global scale recovery and rigid transformation recovery method. Tie points are obtained on the RGB images based on the image matching algorithm in Section 3.3.1. The 3D coordinates of feature matches can be derived from the space intersection using the recovered RGB image pose. There would be discrepancies between the RGB pose-derived object coordinates and the ground truth obtained from depth image according to the camera model for depth images. First, a global optimization model is employed to improve the accuracy of the camera pose, decreasing the inconsistencies between the depth frames. Then, a global scale for RGB geometry is recovered by computing the distance ratio between the point pairs of RGB pose-derived points and depth-derived points and the rigid transformation between the two sets of corresponding 3D points is calculated to ensure that they are aligned. Ultimately, the inconsistencies between two sets of corresponding 3D points is eliminated with the recovered scale and rigid transformation.

Camera Model for Depth Images
By knowing the internal parameters and distortion of depth camera by camera calibration, we can compute the object coordinates X c , Y c , Z c in the camera coordinate system from the image space as follows: where f xD , f yD are the focal length of the depth camera, c xD , c yD are the image center of the depth image, and u , v are the image coordinate corrected by distortion parameters. A rigid body transformation relates points X ∼ [X Y Z 1] T in the sensor coordinate system of the referenced frame to points X C ∼ [X C Y C Z C 1] T in the camera coordinates of the current frame. This transformation can be written as where R D is the rotation matrix from current frame to the referenced frame, t D is the translation matrix from current frame to the referenced frame, and X, Y, Z are the real object coordinates in the 3D scene. Figure 4 shows the relationship between the camera and the sensor coordinate systems. By knowing the internal parameters and distortion of depth camera by camera calibration, we can compute the object coordinates , , in the camera coordinate system from the image space as follows: are the focal length of the depth camera, , are the image center of the depth image, and , are the image coordinate corrected by distortion parameters. A rigid body transformation relates points ~ 1 in the sensor coordinate system of the referenced frame to points ~ 1 in the camera coordinates of the current frame. This transformation can be written as where is the rotation matrix from current frame to the referenced frame, is the translation matrix from current frame to the referenced frame, and , , are the real object coordinates in the 3D scene. Figure 4 shows the relationship between the camera and the sensor coordinate systems.

Joint Optimization Model for Poses of Depth Camera
The RGB-D camera uses the ICP algorithm for depth alignment. An initial relative camera pose for each frame can thereby be obtained. However, errors in alignment between depth frames and noise in depth information cause the camera pose to drift over time, especially when the camera follows a long trajectory. Therefore, a global optimization model is used for decreasing the inconsistencies between frames in advance. All of the feature matches in Section 3.3.1 and the initial

Joint Optimization Model for Poses of Depth Camera
The RGB-D camera uses the ICP algorithm for depth alignment. An initial relative camera pose for each frame can thereby be obtained. However, errors in alignment between depth frames and noise in depth information cause the camera pose to drift over time, especially when the camera follows a long trajectory. Therefore, a global optimization model is used for decreasing the inconsistencies between frames in advance. All of the feature matches in Section 3.3.1 and the initial camera pose obtained from the ICP alignment are involved in the model. Supposing the total number of the frame pairs is M and for each frame pair a, b, the total number of the point pairs is N. The corresponding features matches dataset {PP} can be represented as: Therefore, the discrepancy between two point pair can be represented as follows: where {R a , t a } and {R b , t b } are the initial rotation and translation matrix of the frame a, b, respectively. For the whole scene, the cost function can be written as Equation (11) and a least square solution is used to minimize the error iteratively. The global optimization model ultimately improves the accuracy of the camera pose, decreasing the inconsistencies between the depth frames.

Global Scale Recovery for RGB Images
Based on recovered RGB images poses, the 3D coordinates for each tie point can be obtained by a space intersection. As a control, we select the registered depth frame that possesses the greatest number of corresponding points between RGB image and depth image. As shown in Figure 5, for each feature match located in the RGB image, the image coordinates can be obtained and the corresponding depth value can be extracted from the registered depth image. The points that have no depth value are discarded. The ground truth of each point can be calculated from Equations (9) and (10). where , and , are the initial rotation and translation matrix of the frame , , respectively. For the whole scene, the cost function can be written as Equation (11) and a least square solution is used to minimize the error iteratively. The global optimization model ultimately improves the accuracy of the camera pose, decreasing the inconsistencies between the depth frames.

Global Scale Recovery for RGB Images
Based on recovered RGB images poses, the 3D coordinates for each tie point can be obtained by a space intersection. As a control, we select the registered depth frame that possesses the greatest number of corresponding points between RGB image and depth image. As shown in Figure 5, for each feature match located in the RGB image, the image coordinates can be obtained and the corresponding depth value can be extracted from the registered depth image. The points that have no depth value are discarded. The ground truth of each point can be calculated from Equations (9) and (10). Two sets of 3D points, = |1 ≤ ≤ , = |1 ≤ ≤ can be obtained from RGB images and depth images, respectively. The PC set is obtained from the space intersection of the RGB images, and the PD set is obtained from the depth images. Then, the relative scale can be determined from the distance ratio between the point pairs of the two points sets , , as follows: For a robustness test, a large number of scale ratios for point pairs are calculated at random, the Two sets of 3D points, P C = {P i |1 ≤ i ≤ N}, P D = {P i |1 ≤ i ≤ N} can be obtained from RGB images and depth images, respectively. The P C set is obtained from the space intersection of the RGB images, and the P D set is obtained from the depth images. Then, the relative scale S can be determined from the distance ratio between the point pairs of the two points sets P C , P D , as follows: For a robustness test, a large number of scale ratios for point pairs are calculated at random, the Pau Ta Norm are used for outlier rejection, as in Equation (15). RANSAC is used to iterate and calculate the optimal scale value.
where S c is mean value of 5 scale values selected at random, S is the median value of the scale set, and σ is the root-mean-square error of the scale set.
The global scale recovery method is presented in Algorithm 2. First, a set of scale values F s is calculated iteratively, and, in each iteration, the point pairs from P C , P D is selected at random. To find the optimal scale value, we iteratively apply Pau Ta Norm to the subset with 5 scale values selected from F s randomly, the scale subset with the biggest number of inliers F s is returned and the proper scale is determined by the mean value of the inliers. The point sets from the space intersection of the RGB images are scaled to a new point set P S , as follows: Randomly select 2 points from P C , P D , compute scale value S 4. F s = F s ∪ S 5. End for 6. Compute the mean value S and the root-mean-square error σ 7. nInlier max = 0, S = 0 8.

Rigid Transformation Recovery
After scale recovery, it is necessary to find the optimal rotation and translation between the two sets of corresponding 3D points to ensure that they are aligned. We compute the rigid transformation matrix with Besl's method [40]. This solution can be used for a dataset of any size, as long as there are at least three corresponding points. A least square solution is used to minimize the error as in Equation (17).
The method based on a Besl's rigid transformation estimator plus RANSAC is presented in Algorithm 3. In each iteration, we randomly select 5 pairs of corresponding points from {P s } and {P D }, the current rigid transformation R', t' can be calculated with Besl' method. The threshold value used for outlier rejection is determined by the initial pose accuracy obtained from depth sensor. RANSAC method is used to iterate and seek the optimal corresponding points set. An iterator is used to loop through the point pairs in {P s } and {P D }, it is recognized as inlier when the distance between P i and R P i s + t is less than Threshold. The corresponding points set with the most inliers is used to compute the final rigid transformation matrix R, t.
Randomly select 5 pairs of corresponding points from {P s } and {P D }, use Besl's method to compute the rigid transformation R', t' 5. For By knowing the scale factor S and the rigid transformation R, t between the 3D coordinates of RGB scene and that from depth scene, the model generated from RGB images can be registered to the coordinates system of depth model with Equation (18).
Finally, the absolute camera trajectory of RGB images sequence R a , T a can be written as Equation (19), which can be used for dense matching with the CMPMVS tool. CMPMVS tool is a multi-view reconstruction software. The input to this software is a set of perspective images and camera parameters (internal and external camera calibrations). The output is a textured mesh of the rigid scene visible in the images [41]. Then, the dense model generated from RGB images sequence can be matched with the 3D model obtained from the depth images.

Benefit of Joint Optimization Model
We first evaluated our joint optimization method with the publicly available RGB-D benchmark provided by [42]. The public RGB-D benchmark dataset is used to assess the accuracy of the camera trajectory and the results is compared with the state-of-the-art methods. They contain ground truth information for camera poses in terms of time-series. Absolute trajectory error is used for trajectory estimation and comparative estimation.
Three sets of publicly available datasets are used for accuracy evaluation. Figure 6 shows the estimated camera trajectories compared against the ground truth trajectories. As shown in Table 1, for the datasets with structure, like fr1_desk and fr2_xyz, our method can achieve median and maximum absolute trajectory accuracy within 3 cm and 10 cm, respectively. Difficult scenes contain only little geometric structure but with fine texture like fr3_nostruct.tex.far sequences, the proposed joint optimization method can also yield only moderate trajectory drift, about 3.2 cm in median and 7 cm in maximum. information for camera poses in terms of time-series. Absolute trajectory error is used for trajectory estimation and comparative estimation. Three sets of publicly available datasets are used for accuracy evaluation. Figure 6 shows the estimated camera trajectories compared against the ground truth trajectories. As shown in Table 1, for the datasets with structure, like fr1_desk and fr2_xyz, our method can achieve median and maximum absolute trajectory accuracy within 3 cm and 10 cm, respectively. Difficult scenes contain only little geometric structure but with fine texture like fr3_nostruct.tex.far sequences, the proposed joint optimization method can also yield only moderate trajectory drift, about 3.2 cm in median and 7 cm in maximum.   Table 1 also shows the comparison of median (maximum) absolute trajectory error for joint optimization between our method and several state-of-art registration methods including 3D-NDT method [43], Warp from OpenCV [44] and Fovis method [45]. The best results are marked in bold. Except for the maximum error in fr2_xyz sequences, our approach outperforms the other methods both in the median error and the maximum error. In the second case, all methods yield similar accuracy because of the rich texture information, and our method achieves the best median result because of the robust false matches rejection method in Section 3.3.1. Especially for the scene with no geometric information, our method performs much better than three others. Table 1. Comparison of median (maximum) absolute trajectory error in mm for joint optimization on  Table 1 also shows the comparison of median (maximum) absolute trajectory error for joint optimization between our method and several state-of-art registration methods including 3D-NDT method [43], Warp from OpenCV [44] and Fovis method [45]. The best results are marked in bold. Except for the maximum error in fr2_xyz sequences, our approach outperforms the other methods both in the median error and the maximum error. In the second case, all methods yield similar accuracy because of the rich texture information, and our method achieves the best median result because of the robust false matches rejection method in Section 3.3.1. Especially for the scene with no geometric information, our method performs much better than three others.

Datasets
In this section, we carry out the field tests to validate the feasibility and effectiveness of the proposed enhanced RGB-D mapping method. Two sets of data were collected, using the structure sensor attached to an iPad Air. We conducted a precise camera calibration for this device and the camera calibration results including the internal parameters and distortion parameters are shown in Table 2. 3.9233 ± 2.3220 The first dataset is used to deal with the sequence captured along a corridor. The two images in Figure 7a (left) shows a sample RGB frame. The 3D model generated from depth images based on the ICP + Global optimization sequential alignment, the corresponding camera trajectory marked with red points and a top view of the 3D model overlaid on a laser scan point cloud are shown in Figure 7a (right). The whole length of the camera trajectory was about 26.5 m, and it contained 305 registered frames. To further investigate the performance of the proposed methodology in an outside environment, as shown in the RGB image in Figure 7b (left), one chair was placed in front of the wall and the dataset was collected by walking around the chair. A total of 196 registered frames were obtained. The corresponding 3D scene generated from the depth images shown in Figure 7b (right). Figure 7a (right). The whole length of the camera trajectory was about 26.5 m, and it contained 305 registered frames. To further investigate the performance of the proposed methodology in an outside environment, as shown in the RGB image in Figure 7b (left), one chair was placed in front of the wall and the dataset was collected by walking around the chair. A total of 196 registered frames were obtained. The corresponding 3D scene generated from the depth images shown in Figure 7b

Experimental Results and Analysis
To further thoroughly evaluate the benefits of global optimization model, the accuracy of the camera poses is determined by computing the discrepancies in the contiguous frames. Instead of placing targets on the ground, the exact truth poses are obtained through frame alignment manually. To reduce the time complexity, only the truth rotation and translation between the adjacent key frames are obtained as referenced, the translational error and the angular error of the sequential alignment can be obtained by comparing with the ground-truth poses. As can be seen in Table 3, by combining ICP and global optimization, it achieves accuracy which is superior to using the ICP algorithm only. In the ICP algorithm, the alignment accuracy highly depended on the geometric information in the adjacent frames. However, in the corridor experiment, it provides little geometric information, and the frames mainly contain several single flat walls. It is reasonable that global optimization model can improve the alignment accuracy due to involving additional RGB information.
In addition, the corridor model generated from the structure sensor is compared with a laser scan point cloud. As shown in Figure 6a (right), these two models can match well in both horizontal and vertical direction. To evaluate the absolute accuracy of the coordinator model, some key point pairs are selected from the sensor model and the laser scanner and the distance between two point pairs selected at random is calculated. The average distance errors are shown in Table 3. Similar with two others, ICP + Global Optimization can achieve the absolute accuracy to centimeter level, which is higher than that of the ICP algorithm. After applying global optimization for the pose of depth camera, we implement the robust geometric registration m to register the 3D model based on image-based modeling method to the model generated from depth images, and then the results is compared with the model totally generated from depth images. Check points are selected from the results of feature matching. For each check point, two sets of object coordinates can be obtained from the image-based model and the model from depth respectively. Then, we achieved a relative accuracy assessment of the obtained result through the root mean square error (RMSE) of the discrepancies of each check points in the object space. It should be noted that only the depth within 3 m of the depth frame is used for accuracy assessment.
In the corridor experiment, 172 frames are selected as key frames and then are used for 3D modeling. The feature matches in the key frames are first checked with the false matches rejection method, the corresponding R.E. uv and R.E. XYZ are set at 10 pixels and 0.2 m, respectively, according to the initial accuracy of the camera pose. Figure 8 shows the comparison of feature matches in the corridor images. The original 3980 feature matches are obtained after using a traditional RANSAC false matches rejection method. In RANSAC, the threshold for estimating F matrix is 2, and the threshold for estimating H matrix is 4. The maximum iterations in RANSAC is 1000. In this experiment, 42 more false matches can be rejected by using the refined false matches rejection method in this paper. Then, 432 feature matches identified from the first frame are used for geometric registration. Due to the measurement distance limitation of depth sensor, 1302 feature points with depth value within 3 m are used to check the performance of geometric registration.
The performance of geometric registration approach is evaluated in object space. The 1302 check points are compared based on the object coordinates from depth information and the transformed coordinates from RGB sequences. Table 4 lists registration results including the recovered scale, rigid transformation and the statistics of discrepancies between two models after geometric integration. As Table 4 shows, the discrepancies between the scene from depth images and the scene from RGB images can accurate to centimeter-level (within 3 cm) in all the three directions. This indicates that the geometric inconsistencies between the geometry of RGB images and depth images are nearly eliminated. recovered scale, rigid transformation and the statistics of discrepancies between two models after geometric integration. As Table 4 shows, the discrepancies between the scene from depth images and the scene from RGB images can accurate to centimeter-level (within 3 cm) in all the three directions. This indicates that the geometric inconsistencies between the geometry of RGB images and depth images are nearly eliminated.   In Figure 9a, the 3D scene from RGB images is first transformed to the coordinate system of depth scene based on the recovered scale and rigid transformation parameters. Figure 9b shows the original 3D scene totally generated from depth image. Although all of the depth frames were used for scene modeling, significant details are lost, especially on the ceiling and the floor. Figure 9c shows the enhanced 3D scene combining 3D scene from RGB images and from depth images after geometric registration. The vertices have significantly increased from about two million to three million. In Figure 9b, the broken regions are marked with red dotted borders. As expected, the scene detail in the corresponding regions is enriched significantly after geometric registration shown in  In Figure 9a, the 3D scene from RGB images is first transformed to the coordinate system of depth scene based on the recovered scale and rigid transformation parameters. Figure 9b shows the original 3D scene totally generated from depth image. Although all of the depth frames were used for scene modeling, significant details are lost, especially on the ceiling and the floor. Figure 9c shows the enhanced 3D scene combining 3D scene from RGB images and from depth images after geometric registration. The vertices have significantly increased from about two million to three million. In Figure 9b, the broken regions are marked with red dotted borders. As expected, the scene detail in the corresponding regions is enriched significantly after geometric registration shown in Figure 9c. It means that the model generated from the corresponding RGB images can be a good supplement to the model from depth images.
For the chair model collected outside, 86 frames are selected as key frames. The corresponding R.E. uv and R.E. XYZ parameters for false matches rejection are set at 3 pixels and 0.05 m, respectively, due to high accuracy of the camera pose. The 6293 feature matches were obtained and 38 more false matches are rejected. The 246 feature points are used for geometric registration. The performance of the geometric registration is examined with 1278 check points.
The performance of geometric registration approach was evaluated in object space. The 1278 check points were compared based on the object coordinates from depth information and the transformed coordinates from RGB images. Table 4 lists registration results including the recovered scale, rigid transformation and the statistics of discrepancies between two models after geometric integration. As Table 4 shows, the geometric registration accuracy can obtain an accuracy of less than 2 cm in all three directions. Since the model from depth images is used as reference for geometric registration accuracy evaluation and the check points are selected from different frame, the consistency between depth frames can directly influence the performance of the registration method. The inconsistency between frames grows with the distance of the trajectory due to error propagation during frames alignment. In the corridor experiment, the length of the camera trajectory is much higher than that of the outdoor experiment, the global consistency of the scene is worse than that of the scene of the outdoor. The better consistency results in higher accuracy of the initial pose parameters. Therefore, the geometric registration accuracy should be higher in the chair scene than that in the corridor scene. than 2 cm in all three directions. Since the model from depth images is used as reference for geometric registration accuracy evaluation and the check points are selected from different frame, the consistency between depth frames can directly influence the performance of the registration method. The inconsistency between frames grows with the distance of the trajectory due to error propagation during frames alignment. In the corridor experiment, the length of the camera trajectory is much higher than that of the outdoor experiment, the global consistency of the scene is worse than that of the scene of the outdoor. The better consistency results in higher accuracy of the initial pose parameters. Therefore, the geometric registration accuracy should be higher in the chair scene than that in the corridor scene. Figure 9. Results of geometric registration for corridor model. Figure 10a,b shows the original 3D scene generated from depth image and the enhanced 3D scene combining 3D information from RGB images and from depth images after geometric registration, respectively. Only a close-range scene with about 4.2 m maximum length can be obtained from the depth images. As the far-range model generated from the RGB images is added to the original 3D scene from depth image, the vertices number have a significant increase from 754,316 to 933,454 and the measurement distance can be extended to about 9 m. In this case, the information Figure 9. Results of geometric registration for corridor model. Figure 10a,b shows the original 3D scene generated from depth image and the enhanced 3D scene combining 3D information from RGB images and from depth images after geometric registration, respectively. Only a close-range scene with about 4.2 m maximum length can be obtained from the depth images. As the far-range model generated from the RGB images is added to the original 3D scene from depth image, the vertices number have a significant increase from 754,316 to 933,454 and the measurement distance can be extended to about 9 m. In this case, the information from the RGB image sequences both enriched the details for the close-range model from the depth images and greatly broadened the modeling range of the RGB-D camera.

Summary and Conclusions
The key issues that we encountered when using RGB-D sensors to produce 3D models are the limited measurement distance and the limited field of view. Other key insights of this investigation are that existing ICP frame matching techniques are not sufficient to provide robust visual odometry with these cameras; and a tight integration of depth and color information can yield robust frame matching and global optimization. We first presented a global optimization model for camera poses improvement that takes advantage of the richness of information contained in RGB images. Then we have presented a novel approach for the geometric integration of depth scene and RGB scene to enhance the mapping system of RGB-D sensors for detailed 3D modeling of large indoor environments. The 3D scene produced from the RGB images is innovatively used as supplement to the 3D scene produced by the depth sensors, which can not only enhance scene details where lack of depth information, but can also broaden the modeling range of RGB-D sensors. At the calibration stage, we employ a precise calibration method to obtain the full set of external and internal parameters as well as the relative pose between RGB camera and IR camera. In order to avoid false matches as much as possible, features extracted from RGB-D image are checked with a refined false matches rejection method. Based on the robust geometric registration method, the global scale of RGB camera motion and the rigid transformation between the RGB scene and depth scene is automatically recovered.
The benefit of the proposed global optimization method is firstly evaluated with the publicly available benchmark datasets collected with Kinect. Absolute trajectory error is used for trajectory estimation and comparative estimation. Then, we demonstrate the performance of the proposed robust geometric registration approach with results obtained when dealing with the dataset collected in inside and outside environments. The performance of the proposed enhanced mapping method is evaluated from two perspectives, the absolute accuracy of the sensor model and the relative registration accuracy between model from depth and RGB images.
Despite these encouraging results, our system has several shortcomings. The current implementation of the enhanced mapping system is not real-time. The global optimization model can handle up to about 200 frames, but we believe the model can be improved through proper algorithm optimization. The next step of this research is to concentrate on larger and more complicated environment and extend the system to implement a full modeling approach including real-time processing and mesh reconstruction.

Summary and Conclusions
The key issues that we encountered when using RGB-D sensors to produce 3D models are the limited measurement distance and the limited field of view. Other key insights of this investigation are that existing ICP frame matching techniques are not sufficient to provide robust visual odometry with these cameras; and a tight integration of depth and color information can yield robust frame matching and global optimization. We first presented a global optimization model for camera poses improvement that takes advantage of the richness of information contained in RGB images. Then we have presented a novel approach for the geometric integration of depth scene and RGB scene to enhance the mapping system of RGB-D sensors for detailed 3D modeling of large indoor environments. The 3D scene produced from the RGB images is innovatively used as supplement to the 3D scene produced by the depth sensors, which can not only enhance scene details where lack of depth information, but can also broaden the modeling range of RGB-D sensors. At the calibration stage, we employ a precise calibration method to obtain the full set of external and internal parameters as well as the relative pose between RGB camera and IR camera. In order to avoid false matches as much as possible, features extracted from RGB-D image are checked with a refined false matches rejection method. Based on the robust geometric registration method, the global scale of RGB camera motion and the rigid transformation between the RGB scene and depth scene is automatically recovered.
The benefit of the proposed global optimization method is firstly evaluated with the publicly available benchmark datasets collected with Kinect. Absolute trajectory error is used for trajectory estimation and comparative estimation. Then, we demonstrate the performance of the proposed robust geometric registration approach with results obtained when dealing with the dataset collected in inside and outside environments. The performance of the proposed enhanced mapping method is evaluated from two perspectives, the absolute accuracy of the sensor model and the relative registration accuracy between model from depth and RGB images.
Despite these encouraging results, our system has several shortcomings. The current implementation of the enhanced mapping system is not real-time. The global optimization model can handle up to about 200 frames, but we believe the model can be improved through proper algorithm optimization. The next step of this research is to concentrate on larger and more complicated environment and extend the system to implement a full modeling approach including real-time processing and mesh reconstruction.