Real-Time RGB-D Simultaneous Localization and Mapping Guided by Terrestrial LiDAR Point Cloud for Indoor 3-D Reconstruction and Camera Pose Estimation

In recent years, low-cost and lightweight RGB and depth (RGB-D) sensors, such as Microsoft Kinect, have made available rich image and depth data, making them very popular in the field of simultaneous localization and mapping (SLAM), which has been increasingly used in robotics, self-driving vehicles, and augmented reality. The RGB-D SLAM constructs 3D environmental models of natural landscapes while simultaneously estimating camera poses. However, in highly variable illumination and motion blur environments, long-distance tracking can result in large cumulative errors and scale shifts. To address this problem in actual applications, in this study, we propose a novel multithreaded RGB-D SLAM framework that incorporates a highly accurate prior terrestrial Light Detection and Ranging (LiDAR) point cloud, which can mitigate cumulative errors and improve the system’s robustness in large-scale and challenging scenarios. First, we employed deep learning to achieve system automatic initialization and motion recovery when tracking is lost. Next, we used terrestrial LiDAR point cloud to obtain prior data of the landscape, and then we applied the point-to-surface inductively coupled plasma (ICP) iterative algorithm to realize accurate camera pose control from the previously obtained LiDAR point cloud data, and finally expanded its control range in the local map construction. Furthermore, an innovative double window segment-based map optimization method is proposed to ensure consistency, better real-time performance, and high accuracy of map construction. The proposed method was tested for long-distance tracking and closed-loop in two different large indoor scenarios. The experimental results indicated that the standard deviation of the 3D map construction is 10 cm in a mapping distance of 100 m, compared with the LiDAR ground truth. Further, the relative cumulative error of the camera in closed-loop experiments is 0.09%, which is twice less than that of the typical SLAM algorithm (3.4%). Therefore, the proposed method was demonstrated to be more robust than the ORB-SLAM2 algorithm in complex indoor environments.


Introduction
Simultaneous localization and mapping (SLAM) has been rapidly developing, and it is being widely used in the simultaneous estimation of the pose of a moving platform and the building of a estimate camera poses. Sun, Q. matches the plane features between two successive RGB-D scans and the scans are used to estimate the camera pose [31] only if they can provide sufficient constraints. Zhou, Y provides Canny-VO [32], which efficiently tracks all Canny edge features extracted from the images, to estimate the camera pose. A real-time RGB-D SLAM with points and lines [33] is proposed. To reduce cumulative errors, a vertex-to-edge weighted closed-form algorithm [34] is used to reduce the camera drift error for dense RGB-D SLAM, which results in significantly low trajectory error than several state-of-the-art methods; in addition, this research has received great attention for the improving real-time performance. The back-end optimization problem is decoupled into linear components (feature position) and nonlinear components (camera poses) [35], and as a result complexity is significantly reduced without compromising accuracy; in addition, this algorithm can achieve globally consistent pose estimation in real-time via CPU computing. Thus, it is clear that, after various improvements, all of the above-mentioned RGB-D SLAM algorithms generally achieve better results in indoor environments involving small scenes. In environments involving large scenes, if prior data is not obtained, large cumulative errors are often caused by complex environments, such as by changes in illumination, motion blur, and feature mismatches. Although the global drift error can be reduced by loop closure, no loop is available in many scenarios, and computation of large loops is time-consuming. To address this problem, we intend to utilize the priori LiDAR data as the control information to guide the RGB-D SLAM operation in large indoor environments; the priori data can neglect the cumulative error problem irrespective of the tracking distance. Recently, multisource data fusion SLAM has been developed for complex indoor environments, which combines the LiDAR data and vision data; this technology mainly involves real-time fusion and postprocessing fusion. A lot of research has been conducted on real-time data acquisition fusion, by which the data acquired by the laser sensor and the camera are directly fused online for positioning and mapping. A real-time method is presented, namely V-LOAM [36], which combines visual odometry in high frequency and LiDAR odometry in low frequency to provide localization and mapping without global map optimization. Sarvrood, Y.B combines stereo visual odometry, LiDAR odometry, and reduced inertial measurement unit (IMU) to construct a reliable and relatively accurate odometry-only system [37]. Although both the above-mentioned methods function well, because the LiDAR data is usually obtained in real time without preprocessing, large cumulative errors may still exist after long-distance tracking. For postprocessed fusion, the LiDAR data can be utilized to obtain a priori map to enhance visual localization; however, little research exists on this technique. Fast and accurate feature landmark-based camera parameter estimation is achieved by adopting tentative camera parameter estimation and considering the local 3-D structure of a landscape using the dense depth information obtained by a laser range sensor [38]; however, only positioning, and not map construction, is obtained by this method. A mobile platform tracks the pose of a monocular camera with respect to a given 3-D LiDAR map, achieving excellent results under changing photometric appearance of the environment [39], because it relies only on matching geometry; however, the initialization time is long. The normalized mutual information between real camera measurements and synthetic views is utilized to the maximum extent, based on a 3D prior ground-map generated by a survey vehicle equipped with 3D LiDAR scanners to localize a single monocular camera [40]. This method does not fully utilize the scene information of the LiDAR data map, but utilizes only that of the ground part; therefore, it has reduced robustness in special environments. A UAV platform is utilized to obtain prior LiDAR data and a stereo camera is used to map large-scale outdoor scenes without automatic initialization and real-time performance [41]. To address the real-time map optimization problem, a multithreaded segment-based map double window optimization strategy is adopted in the method proposed in this paper, to ensure consistency and accuracy of map construction, as well as high real-time performance. The algorithm used for initial registration in the above-mentioned methods is time consuming or requires manual operation, which greatly limits its application, whereas the proposed method uses the deep learning neural network to perform automatic initialization and motion recovery, which improves the flexibility and robustness of the algorithm in difficult scenarios. In recent years, deep learning has shown great advancements and very high performance in various computer research fields because of their similar cognitive characteristics as humans. The combination of deep learning and geometric SLAM has been receiving increasing attention from researchers. Deep learning is used for closed-loop detection [42,43], and the results indicate that it is more suitable than conventional methods based on visual low-level features. Researches have been focused not only on addressing the conventional SLAM problems, geometric mapping, and localization, but also on semantic SLAM, which has gradually become a trend in the field of SLAM. The object detection module, realized by the deep-learning method and localization module, is integrated seamlessly into the RGB-D SLAM framework to build semantic maps with object-level entities [44]. A robust semantic visual SLAM named DS-SLAM [45], aimed at dynamic environments, is proposed. The absolute trajectory accuracy of this method can be improved by one order of magnitude compared with that of the ORB-SLAM2.
A review of literature shows that visual and RGB-D SLAM generally use the natural landscape to construct a 3-D environment and simultaneously estimate the camera pose; however, under highly variable illumination conditions and low-textured environments, without priori data, errors accumulate in long distance tracking, causing tracking failures, which may result in kidnapped-robot problem. This limitation often hinders the practicality of SLAM technology in real-life applications. Moreover, many SLAM applications such as the Google self-driving car, which is based on high definition map and indoor navigation using priori map, do not require a moving platform to navigate in unknown environments. Therefore, this research aims at proposing a novel RGB-D SLAM approach that, along with deep learning, utilizes a highly accurate priori LiDAR point cloud data as guidance for indoor 3-D scene construction and camera pose estimation. The contributions of this research can be summarized as follows.
(1) Using an effective combination of priori LiDAR data and deep learning, an approach for SLAM system automatic initialization and motion recovery in complex indoor environments, is provided. (2) Priori LiDAR data of the indoor scene, along with the multithreaded segment-based map double window optimization strategy, is used to eliminate cumulative error in camera pose and map construction, as well as improve the real-time performance. (3) A feasible solution is provided for automatic matching of 3-D LiDAR data and 2-D image data from the RGB-D sensor in large indoor scenes, and fusing the 2-D and 3-D data to construct an accurate environmental model of the scene, thereby compensating the deficiency of single data source. (4) To ensure accuracy, high real-time performance, and robustness, priori LiDAR data and deep learning are successfully integrated into ORB-SLAM2, creating a complete and novel SLAM framework.

Materials and Methods
The proposed method, built on the ORB-SLAM2 framework, uses the terrestrial LiDAR point cloud data with high accuracy and precision to guide the RGB-D SLAM in real-time to construct large-scale and accurate 3-D textured models and obtain accurate camera poses using a multithreaded framework, so that the accumulation error of the RGB-D SLAM during long-distance tracking is eliminated. Automatic system initialization and motion recovery under difficult scenarios, real-time performance, accurate priori LiDAR data, drift-error-free performance, dynamic object culling, and accurate and consistent global map are the six key characteristics of the proposed algorithm. The proposed approach primarily and innovatively integrates the existing methods to yield perfect results while ensuring the system operates in real-time with high precision and remains robust under challenging environments. Figure 1 illustrates the proposed method in a detailed flowchart, which is composed of five parts: (1) acquisition and preprocessing of LiDAR data from the experimental site, (2) accurate automatic initialization of the RGB-D SLAM algorithm in the LiDAR data coordinate system framework, (3) accurate camera pose determination using the ICP algorithm with priori LiDAR data as ground Appl. Sci. 2019, 9, 3264 5 of 24 truth control, (4) accurate construction of 3-D indoor map by multithreaded segment-based map optimization using LiDAR corrected camera pose from the previous step, and (5) improving the algorithm's robustness under challenging environments, using the priori LiDAR data and a combination of deep learning and geometric SLAM. The proposed algorithm has four important threads: pose tracking, map construction, loop closing, and LiDAR data optimization. Furthermore, the LiDAR data optimization thread includes two key functions, which are camera pose estimation with LiDAR ground control information and multithreaded segment-based map optimization.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 5 of 24 as ground truth control, (4) accurate construction of 3-D indoor map by multithreaded segment-based map optimization using LiDAR corrected camera pose from the previous step, and (5) improving the algorithm's robustness under challenging environments, using the priori LiDAR data and a combination of deep learning and geometric SLAM. The proposed algorithm has four important threads: pose tracking, map construction, loop closing, and LiDAR data optimization. Furthermore, the LiDAR data optimization thread includes two key functions, which are camera pose estimation with LiDAR ground control information and multithreaded segment-based map optimization.

LiDAR Data Acquisition and Preprocessing
We used a 3-D ground-based LiDAR scanner RIEGL VZ-1000 (RIEGL Laser Measurement Systems GmbH, Horn, Austria, shown in Figure 2), capable of providing a range of 1400 m, measurement accuracy of 5 mm, measurement rate of 300,000 points/second, and field of view of 100° × 360°. Single scans are registered together to form a single model, as shown in Figure 3, using the RIEGL LiDAR processing software. Table 1 lists the specifications of the LiDAR scanner RIEGL VZ-1000.

LiDAR Data Acquisition and Preprocessing
We used a 3-D ground-based LiDAR scanner RIEGL VZ-1000 (RIEGL Laser Measurement Systems GmbH, Horn, Austria, shown in Figure 2), capable of providing a range of 1400 m, measurement accuracy of 5 mm, measurement rate of 300,000 points/second, and field of view of 100 • × 360 • . Single scans are registered together to form a single model, as shown in Figure 3, using the RIEGL LiDAR processing software. Table 1 lists the specifications of the LiDAR scanner RIEGL VZ-1000. as ground truth control, (4) accurate construction of 3-D indoor map by multithreaded segment-based map optimization using LiDAR corrected camera pose from the previous step, and (5) improving the algorithm's robustness under challenging environments, using the priori LiDAR data and a combination of deep learning and geometric SLAM. The proposed algorithm has four important threads: pose tracking, map construction, loop closing, and LiDAR data optimization. Furthermore, the LiDAR data optimization thread includes two key functions, which are camera pose estimation with LiDAR ground control information and multithreaded segment-based map optimization.

LiDAR Data Acquisition and Preprocessing
We used a 3-D ground-based LiDAR scanner RIEGL VZ-1000 (RIEGL Laser Measurement Systems GmbH, Horn, Austria, shown in Figure 2), capable of providing a range of 1400 m, measurement accuracy of 5 mm, measurement rate of 300,000 points/second, and field of view of 100° × 360°. Single scans are registered together to form a single model, as shown in Figure 3, using the RIEGL LiDAR processing software. Table 1 lists the specifications of the LiDAR scanner RIEGL VZ-1000.     Because the volume of data output of the 3D LiDAR scanner is huge, the collected data must be preprocessed to ensure efficient and good real-time operation of the proposed algorithm. For this purpose, downsampling and KD-tree data structure are applied to the LiDAR point cloud data. This study uses a voxel with a resolution of 5 cm to downsample the point cloud, while retaining the geometrical information of the point cloud data with minimum data volume, as shown in Figure 4. Further, the KD-tree structure, used as the data structure of point cloud storage management, greatly improves the retrieval rate of cloud data in large-scale indoor scenes.

Automatic and Accurate System Initialization
System initialization is a prerequisite for the proposed algorithm. Initially if the system cannot match the LiDAR data accurately under its coordinate framework, the LiDAR data cannot be used as ground truth for subsequent optimization. In this study, deep learning (PoseNet) is used to automatically align the priori LiDAR point cloud data with the RGB-D point cloud in a single coordinate system framework. PoseNet neural network was transformed from GoogLeNet, which is a 22-layer convolutional network with six "inception modules" and two additional intermediate classifiers. PoseNet replaces all three softmax classifiers in GoogLeNet with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4). Based on our previous work [46], first, we manually aligned the priori LiDAR data with RGB-D point cloud, obtained Because the volume of data output of the 3D LiDAR scanner is huge, the collected data must be preprocessed to ensure efficient and good real-time operation of the proposed algorithm. For this purpose, downsampling and KD-tree data structure are applied to the LiDAR point cloud data. This study uses a voxel with a resolution of 5 cm to downsample the point cloud, while retaining the geometrical information of the point cloud data with minimum data volume, as shown in Figure 4. Further, the KD-tree structure, used as the data structure of point cloud storage management, greatly improves the retrieval rate of cloud data in large-scale indoor scenes.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 24  Because the volume of data output of the 3D LiDAR scanner is huge, the collected data must be preprocessed to ensure efficient and good real-time operation of the proposed algorithm. For this purpose, downsampling and KD-tree data structure are applied to the LiDAR point cloud data. This study uses a voxel with a resolution of 5 cm to downsample the point cloud, while retaining the geometrical information of the point cloud data with minimum data volume, as shown in Figure 4. Further, the KD-tree structure, used as the data structure of point cloud storage management, greatly improves the retrieval rate of cloud data in large-scale indoor scenes.

Automatic and Accurate System Initialization
System initialization is a prerequisite for the proposed algorithm. Initially if the system cannot match the LiDAR data accurately under its coordinate framework, the LiDAR data cannot be used as ground truth for subsequent optimization. In this study, deep learning (PoseNet) is used to automatically align the priori LiDAR point cloud data with the RGB-D point cloud in a single coordinate system framework. PoseNet neural network was transformed from GoogLeNet, which is a 22-layer convolutional network with six "inception modules" and two additional intermediate classifiers. PoseNet replaces all three softmax classifiers in GoogLeNet with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4). Based on our previous work [46], first, we manually aligned the priori LiDAR data with RGB-D point cloud, obtained

Automatic and Accurate System Initialization
System initialization is a prerequisite for the proposed algorithm. Initially if the system cannot match the LiDAR data accurately under its coordinate framework, the LiDAR data cannot be used as ground truth for subsequent optimization. In this study, deep learning (PoseNet) is used to automatically align the priori LiDAR point cloud data with the RGB-D point cloud in a single coordinate system framework. PoseNet neural network was transformed from GoogLeNet, which is a 22-layer convolutional network with six "inception modules" and two additional intermediate classifiers. PoseNet replaces all three softmax classifiers in GoogLeNet with affine regressors. The softmax layers were removed and each final fully connected layer was modified to output a pose vector of 7-dimensions representing position (3) and orientation (4). Based on our previous work [46], first, we manually aligned the priori LiDAR data with RGB-D point cloud, obtained consecutive video frames labeled in camera pose produced by the relocalization function in the RGB-D SLAM method for the corresponding scene, trained the CNN network using consecutive video frames, and produced an average positional error of about 1 m and average angular error of 4 • . Then the initial coarse camera pose was achieved from deep learning with errors (i.e., misalignment between RGB-D and the corresponding LiDAR point cloud set), as indicated clearly in Figure 5b. A more accurate and refined initial camera pose was obtained using the nearest neighbor iterative algorithm initialized by pose from deep learning, as shown in Figure 5c. Figure 6 demonstrates the final initialization result with accurately matched RGB-D and LiDAR scenes.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 7 of 24 consecutive video frames labeled in camera pose produced by the relocalization function in the RGB-D SLAM method for the corresponding scene, trained the CNN network using consecutive video frames, and produced an average positional error of about 1 m and average angular error of 4°. Then the initial coarse camera pose was achieved from deep learning with errors (i.e., misalignment between RGB-D and the corresponding LiDAR point cloud set), as indicated clearly in Figure 5b. A more accurate and refined initial camera pose was obtained using the nearest neighbor iterative algorithm initialized by pose from deep learning, as shown in Figure 5c. Figure 6 demonstrates the final initialization result with accurately matched RGB-D and LiDAR scenes.

Accurate Camera Pose Determination Using ICP with LIDAR and RGB-D Point Cloud Data
After the proposed algorithm is successfully initialized, the priori LiDAR point cloud data is aligned to the RGD-SLAM. Now the objective of the proposed algorithm is to obtain accurate camera pose ground control information from the priori LiDAR data, and control the operation and optimization of the RGB-D SLAM in LiDAR data optimization thread. Figure 7 illustrates the framework of the LiDAR guided camera pose estimation algorithm, which comprises three steps: (1) selection of the keyframes to be corrected and generation of corresponding RGB-D point cloud scene with dynamic object culling and denoising of the depth image data, (2) data acquisition of local LiDAR point cloud based on initial pose produced by the visual tracking thread, and (3) using point-to-plane ICP algorithm to obtain and refine accurate pose of the keyframes and transfer the poses to the local map through the keyframe covisibility graph. The following subsections describe each step in detail. consecutive video frames labeled in camera pose produced by the relocalization function in the RGB-D SLAM method for the corresponding scene, trained the CNN network using consecutive video frames, and produced an average positional error of about 1 m and average angular error of 4°. Then the initial coarse camera pose was achieved from deep learning with errors (i.e., misalignment between RGB-D and the corresponding LiDAR point cloud set), as indicated clearly in Figure 5b. A more accurate and refined initial camera pose was obtained using the nearest neighbor iterative algorithm initialized by pose from deep learning, as shown in Figure 5c. Figure 6 demonstrates the final initialization result with accurately matched RGB-D and LiDAR scenes.

Accurate Camera Pose Determination Using ICP with LIDAR and RGB-D Point Cloud Data
After the proposed algorithm is successfully initialized, the priori LiDAR point cloud data is aligned to the RGD-SLAM. Now the objective of the proposed algorithm is to obtain accurate camera pose ground control information from the priori LiDAR data, and control the operation and optimization of the RGB-D SLAM in LiDAR data optimization thread. Figure 7 illustrates the framework of the LiDAR guided camera pose estimation algorithm, which comprises three steps: (1) selection of the keyframes to be corrected and generation of corresponding RGB-D point cloud scene with dynamic object culling and denoising of the depth image data, (2) data acquisition of local LiDAR point cloud based on initial pose produced by the visual tracking thread, and (3) using point-to-plane ICP algorithm to obtain and refine accurate pose of the keyframes and transfer the poses to the local map through the keyframe covisibility graph. The following subsections describe each step in detail.

Accurate Camera Pose Determination Using ICP with LIDAR and RGB-D Point Cloud Data
After the proposed algorithm is successfully initialized, the priori LiDAR point cloud data is aligned to the RGD-SLAM. Now the objective of the proposed algorithm is to obtain accurate camera pose ground control information from the priori LiDAR data, and control the operation and optimization of the RGB-D SLAM in LiDAR data optimization thread. Figure 7 illustrates the framework of the LiDAR guided camera pose estimation algorithm, which comprises three steps: (1) selection of the keyframes to be corrected and generation of corresponding RGB-D point cloud scene with dynamic object culling and denoising of the depth image data, (2) data acquisition of local LiDAR point cloud based on initial pose produced by the visual tracking thread, and (3) using point-to-plane ICP algorithm to obtain and refine accurate pose of the keyframes and transfer the poses to the local map through the keyframe covisibility graph. The following subsections describe each step in detail. Appl. Sci. 2019, 9, x FOR PEER REVIEW 8 of 24

Selection of Keyframes to Be Corrected and the Generation of the Corresponding RGB-D Point Cloud Scene
In a typical visual SLAM method, the error in camera pose estimation usually accumulates over time. Although it is unnecessary to correct the cumulative error of each frame with the LiDAR data, the accumulation error must be maintained sufficiently low within the initial range required by ICP convergence to the global optimal. To ensure both drift-error-free and real-time performances, LiDAR data are used to eliminate accumulation error for every 40 keyframes. A 40-keyframe interval is chosen because a balance between successful convergence of the ICP algorithm and real-time performance is maintained by experiments. For instance, if an interval lower than 20 keyframes is applied, the LiDAR data optimization thread falls behind the tracking thread and results in delayed camera pose correction of the current frame. An interval higher than 60 keyframes produces higher error accumulation and results in failure of ICP convergence to the global optimal, because the camera pose provided by the tracking thread is used as the initial value of the ICP algorithm.
Furthermore, because of the complexity of the test environment, the selected keyframes for correction are further filtered according to the field operational condition of the proposed algorithm. For instance, in the case of fast rotation of camera, the cumulative error grows accordingly, because the tracking image becomes blurred and feature mismatch rate increases. Therefore, the selection criterion for the uncorrected keyframe is "whether the change in positive direction of the camera before and after the two adjacent keyframes exceeds a certain threshold." If the threshold is not exceeded, the camera moves forward smoothly; however, if the threshold is exceeded, the camera rotates too fast and must be corrected. The threshold was set as 45° in our research.
On the other hand, dynamic objects in the scene may cause error in pose tracking, because they interfere with the acquisition of accurate pose control information from LiDAR data. To address this problem, the optical flow method is used to detect and remove dynamic objects in RGB and depth images in real time, as shown in Figure 8. Because variable illumination and infrared reflection characteristics of the surface material of the object may produce noise in the depth data, a bilateral filter denoising method, which combines the spatial proximity and similarity of the pixel values, is adopted to preserve detail and remove noise from the depth image data. In a typical visual SLAM method, the error in camera pose estimation usually accumulates over time. Although it is unnecessary to correct the cumulative error of each frame with the LiDAR data, the accumulation error must be maintained sufficiently low within the initial range required by ICP convergence to the global optimal. To ensure both drift-error-free and real-time performances, LiDAR data are used to eliminate accumulation error for every 40 keyframes. A 40-keyframe interval is chosen because a balance between successful convergence of the ICP algorithm and real-time performance is maintained by experiments. For instance, if an interval lower than 20 keyframes is applied, the LiDAR data optimization thread falls behind the tracking thread and results in delayed camera pose correction of the current frame. An interval higher than 60 keyframes produces higher error accumulation and results in failure of ICP convergence to the global optimal, because the camera pose provided by the tracking thread is used as the initial value of the ICP algorithm.
Furthermore, because of the complexity of the test environment, the selected keyframes for correction are further filtered according to the field operational condition of the proposed algorithm. For instance, in the case of fast rotation of camera, the cumulative error grows accordingly, because the tracking image becomes blurred and feature mismatch rate increases. Therefore, the selection criterion for the uncorrected keyframe is "whether the change in positive direction of the camera before and after the two adjacent keyframes exceeds a certain threshold." If the threshold is not exceeded, the camera moves forward smoothly; however, if the threshold is exceeded, the camera rotates too fast and must be corrected. The threshold was set as 45 • in our research.
On the other hand, dynamic objects in the scene may cause error in pose tracking, because they interfere with the acquisition of accurate pose control information from LiDAR data. To address this problem, the optical flow method is used to detect and remove dynamic objects in RGB and depth images in real time, as shown in Figure 8. Because variable illumination and infrared reflection characteristics of the surface material of the object may produce noise in the depth data, a bilateral filter denoising method, which combines the spatial proximity and similarity of the pixel values, is adopted to preserve detail and remove noise from the depth image data.
interfere with the acquisition of accurate pose control information from LiDAR data. To address this problem, the optical flow method is used to detect and remove dynamic objects in RGB and depth images in real time, as shown in Figure 8. Because variable illumination and infrared reflection characteristics of the surface material of the object may produce noise in the depth data, a bilateral filter denoising method, which combines the spatial proximity and similarity of the pixel values, is adopted to preserve detail and remove noise from the depth image data. After selecting the keyframe, culling the dynamic objects, and performing noise removal, the RGB-D point cloud data is generated based on the depth information of the RGB-D camera and the rough pose of the tracking thread. Because the volume of RGB-D point cloud data is huge, which affects the efficiency of the ICP iterative algorithm, the voxel grid accordingly downsamples the point cloud to a resolution of 5 cm to acquire RGB-D sparse point cloud for matching with LiDAR point cloud.
According to the rough pose provided by the visual tracking of the keyframe, the LiDAR point cloud data covering the complete experimental site is filtered to obtain the local LiDAR point cloud data, ensuring a high degree of overlapping with the RGB-D point cloud data at the current keyframe for ICP iteration. The degree of overlapping between the acquired local LiDAR data and the RGB point cloud depends on the cumulative error of the keyframe camera pose.

Accurate Pose Determination
Owing to complexity of the indoor scene, obtaining a precise camera pose from the LiDAR data is challenging. Generally, an indoor environment satisfies the Manhattan world hypothesis that is mainly composed of cubical objects, which have abundant geometric structured information in LiDAR data. Therefore, the point-to-plane ICP iterative algorithm initialized by the pose of the visual tracking thread achieves fast and robust convergence to the global optimal, and can obtain precise camera pose control information, although the point-to-plane ICP iterative algorithm is not innovative. Equation (1) illustrates the point-to-plane ICP distance error function: After selecting the keyframe, culling the dynamic objects, and performing noise removal, the RGB-D point cloud data is generated based on the depth information of the RGB-D camera and the rough pose of the tracking thread. Because the volume of RGB-D point cloud data is huge, which affects the efficiency of the ICP iterative algorithm, the voxel grid accordingly downsamples the point cloud to a resolution of 5 cm to acquire RGB-D sparse point cloud for matching with LiDAR point cloud.
According to the rough pose provided by the visual tracking of the keyframe, the LiDAR point cloud data covering the complete experimental site is filtered to obtain the local LiDAR point cloud data, ensuring a high degree of overlapping with the RGB-D point cloud data at the current keyframe for ICP iteration. The degree of overlapping between the acquired local LiDAR data and the RGB point cloud depends on the cumulative error of the keyframe camera pose.

Accurate Pose Determination
Owing to complexity of the indoor scene, obtaining a precise camera pose from the LiDAR data is challenging. Generally, an indoor environment satisfies the Manhattan world hypothesis that is mainly composed of cubical objects, which have abundant geometric structured information in LiDAR data. Therefore, the point-to-plane ICP iterative algorithm initialized by the pose of the visual tracking thread achieves fast and robust convergence to the global optimal, and can obtain precise camera pose control information, although the point-to-plane ICP iterative algorithm is not innovative. Equation (1) illustrates the point-to-plane ICP distance error function: where, P is the RGB-D point cloud data generated by the corrected keyframe ( Figure 9); Q is the local LiDAR point cloud data; A j , B j , C j , and D j are the plane parameters of the tangent plane S j k ; w ij is the weight of the distance from the point p i to the plane S j k (usually set to one); α, β, and γ are rotation angles; and t x , t y , and t z are translation parameters. The registration algorithm finds a final transformation T which minimizes the above-mentioned error function E(α, β, γ, t x , t y , t z ) k by iteratively using the least squares method; the iterative termination condition is that the number of iterations is greater than the maximum number of iterations, or that the registration error defined in Equation (2) is less than a given threshold.
where e is the error threshold and N is the number of used points p i .
Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 24 where is the error threshold and is the number of used points . Using the above-mentioned ICP algorithm, the camera pose is corrected and the accurate pose is obtained. However, because the effect of single LiDAR corrected keyframe camera pose in the map is small, the accurate pose needs to be propagated to the local map using keyframe covisibility graph to expand the control range of the accurate pose. The pose transfer to the local map is described in Equation (3).
where, is a keyframe set is shared with the corrected keyframe , is the relative pose between the common view keyframe and the corrected keyframe, is the corrected pose of the common view keyframe, is the corrected 3D coordinates of the j-th map point observed by the common view keyframe, and is the coordinate of the map point in the keyframe camera coordinate system. After the above steps are performed, the accuracy of the local map around the corrected keyframe is improved, and the control range of the accurate pose information in the map is expanded.

Sparse Maps and Semi-Dense Map Construction
Although the map model of the scene is available as the LiDAR data, this data often contains only the intensity information, and not the texture information. In addition, the data of some detailed scenes are missed because of occlusion problems during scanning. Therefore, the proposed algorithm constructs high-precision maps with texture information under constraint of the priori LiDAR data, improves the content and use of the LiDAR data map, and patches data holes.
The proposed method is based on the ORB-SLAM2 algorithm framework, which has a sparse map construction thread. The sparse map consists of keyframes and corresponding 3-D ORB feature points. The 3-D ORB feature points are created by projection based on keyframe camera pose and ORB feature point depth data. Then the final keyframes and 3-D ORB feature points in the local map are created by applying special selection strategies to candidates in the local mapping thread. Local optimization and global optimization for the sparse map are applied in ORB-SLAM2 without external constraints; thus, errors in both camera pose and sparse map are cumulated after long distance tracking. The proposed method further optimizes the sparse map constructed by Using the above-mentioned ICP algorithm, the camera pose is corrected and the accurate pose is obtained. However, because the effect of single LiDAR corrected keyframe camera pose in the map is small, the accurate pose needs to be propagated to the local map using keyframe covisibility graph to expand the control range of the accurate pose. The pose transfer to the local map is described in Equation (3).
where, K f is a keyframe set is shared with the corrected keyframe k f , T c i c is the relative pose between the common view keyframe and the corrected keyframe, T c i w is the corrected pose of the common view keyframe, p j iw is the corrected 3D coordinates of the j-th map point observed by the common view keyframe, and p j ic is the coordinate of the map point in the keyframe camera coordinate system. After the above steps are performed, the accuracy of the local map around the corrected keyframe is improved, and the control range of the accurate pose information in the map is expanded.

Sparse Maps and Semi-Dense Map Construction
Although the map model of the scene is available as the LiDAR data, this data often contains only the intensity information, and not the texture information. In addition, the data of some detailed scenes are missed because of occlusion problems during scanning. Therefore, the proposed algorithm constructs high-precision maps with texture information under constraint of the priori LiDAR data, improves the content and use of the LiDAR data map, and patches data holes.
The proposed method is based on the ORB-SLAM2 algorithm framework, which has a sparse map construction thread. The sparse map consists of keyframes and corresponding 3-D ORB feature points. The 3-D ORB feature points are created by projection based on keyframe camera pose and ORB feature point depth data. Then the final keyframes and 3-D ORB feature points in the local map are created by applying special selection strategies to candidates in the local mapping thread. Local optimization and global optimization for the sparse map are applied in ORB-SLAM2 without external constraints; thus, errors in both camera pose and sparse map are cumulated after long distance tracking. The proposed method further optimizes the sparse map constructed by ORB-SLAM2 using the precise camera pose constraints provided by the LiDAR data to eliminate the cumulative error. Because the optimization operation is performed once for every 40 keyframes, the optimization frequency is high. Therefore, to achieve a good real-time performance, and high accuracy and consistency of sparse map optimization, the multithreaded segment-based map double window optimization strategy is adopted.
After optimization, precise camera poses for all keyframes were obtained by combining the depth image data of keyframes to recover the dense point cloud model of the scene. Then we constructed a semi-dense map based on keyframes, which fuses the 2-D rich texture information from image data and the high-precision 3-D geometrical information from the terrestrial LiDAR point cloud data.

Segment-Based Sparse Map Optimization Using Double Window Approach
In previous sections, though locally optimized poses were obtained, the map did not achieve the global optimal. Therefore, to achieve good real-time performance, as well as a globally optimal and consistent map, we adopted a multithreaded segment-based map double window optimization strategy that considered the first and last keyframes of the current map segment corrected by the ICP algorithm with LiDAR data as the true values and then applied the double window approach to optimize this map segment.
In multithread segment-based map optimization, as shown in Figure 10, the visual tracking thread operates on current map segment whereas the LiDAR data optimization thread operates on the previous map segment. When the visual tracking thread starts tracking the next map segment from t 1 to t 2 , the LiDAR data optimization thread simultaneously completes optimizing the previous map segment and starts optimizing the next segment; therefore, real-time optimization and tracking performance are ensured.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 11 of 24 keyframes were obtained by combining the depth image data of keyframes to recover the dense point cloud model of the scene. Then we constructed a semi-dense map based on keyframes, which fuses the 2-D rich texture information from image data and the high-precision 3-D geometrical information from the terrestrial LiDAR point cloud data.

Segment-Based Sparse Map Optimization Using Double Window Approach
In previous sections, though locally optimized poses were obtained, the map did not achieve the global optimal. Therefore, to achieve good real-time performance, as well as a globally optimal and consistent map, we adopted a multithreaded segment-based map double window optimization strategy that considered the first and last keyframes of the current map segment corrected by the ICP algorithm with LiDAR data as the true values and then applied the double window approach to optimize this map segment.
In multithread segment-based map optimization, as shown in Figure 10, the visual tracking thread operates on current map segment whereas the LiDAR data optimization thread operates on the previous map segment. When the visual tracking thread starts tracking the next map segment from to , the LiDAR data optimization thread simultaneously completes optimizing the previous map segment and starts optimizing the next segment; therefore, real-time optimization and tracking performance are ensured. In the double window approach, as shown in Figure 11, the optimization window is divided into a front window and a back window. The front window constraint ensures the accuracy of the new map and the back window constraint ensures consistency between front and rear map segment. The front window comprises 40 keyframes in the current segment and uses the pose-point constraint to control the precise construction of this local map segment. The back window consists of an In the double window approach, as shown in Figure 11, the optimization window is divided into a front window and a back window. The front window constraint ensures the accuracy of the new map and the back window constraint ensures consistency between front and rear map segment. The front window comprises 40 keyframes in the current segment and uses the pose-point constraint to control the precise construction of this local map segment. The back window consists of an optimized set of keyframes that observe the map points in the front window. The back window constraints include the pose-pose constraints between two keyframes with common visible map points in the back and front window, as well as the pose-point constraints between the keyframes in the back window and the map points in the front window. These two constraints help achieve global consistency of the map. To improve the optimization efficiency and memory consumption, the map points that were optimized multiple times and the keyframes in the back window that are far from the front window are removed from the current optimization. This research considers the map points that have been optimized more than 6 times to be sufficiently accurate, and removes them from current optimization; therefore, a set of front window map point optimization variable V p− f ront is obtained. The set of front window keyframe optimization variable is then V k f − f ront . Keyframes with distance more than 4 m were removed. The set of the back window keyframe optimization variable is V k f −back .
Appl. Sci. 2019, 9, x FOR PEER REVIEW 12 of 24 the front window are removed from the current optimization. This research considers the map points that have been optimized more than 6 times to be sufficiently accurate, and removes them from current optimization; therefore, a set of front window map point optimization variable is obtained. The set of front window keyframe optimization variable is then . Keyframes with distance more than 4 m were removed. The set of the back window keyframe optimization variable is . Figure 11. Front window and back window.

Error Equation Formation
We constructed the total error equation in the current segment, which is given by The Jacobian matrix ( , Hessian matrix ( , and gradient vector can be obtained as follows = Figure 11. Front window and back window.

Error Equation Formation
We constructed the total error equation in the current segment, which is given by The Jacobian matrix (J), Hessian matrix (H), and gradient vector (g) can be obtained as follows By the above steps, the optimization model for double window is established, where λ is the positive definite factor in the LM iterative optimization algorithm.

Combination of Deep Learning CNN with Geometric SLAM
The deep learning CNN position system (PoseNet) consists of the GoogleNet classification deep convnet and pose regression convnet, which are based on high-level semantic features and output a pose vector P, given by the camera position x and orientation represented by quaternion q. The deep learning CNN position system is resistant to multimodal, motion blur, variable illumination, and less-texture scene. Therefore, the proposed algorithm, which utilizes ICP iterative algorithm initialized by the coarse pose from the deep learning to obtain precise camera pose, can recover camera motion during tracking failure, and greatly improve algorithm's robustness in difficult environments.

Results
An indoor scene generally consists of two geometric structures: rectangular block and curved plane structures. To evaluate the effectiveness of the proposed algorithm in different indoor scenarios, we used the RGB-D camera (kinect v1) to capture image data and depth data in two typical indoor scenes: the third floor of Block A consisting of planar rectangular block structures and closed round hall consisting of curved plane structures located in the New Technology Park of Aerospace Information Research Institute, Academy of Sciences. Because the published data set does not have the terrestrial LiDAR data of the scene, we conducted the experiment with two data sets collected from each experimental site, long-distance tracking data set, and closed-loop data set. The RIEGL VZ-1000 terrestrial LiDAR scanner was used to collect the LiDAR data. The resolution of the point cloud data had an accuracy of the order of millimeter.
Before performing the experiment, we used the OpenCV camera calibration module to conduct the camera calibration process. The calibration parameters include lens distortion coefficients and the internal parameters, as seen in Table 2. We designed this experiment to verify if the proposed method can eliminate the cumulative error of RGB-D SLAM in long distance tracking in corridor scene with poor texture information. The experiments were performed in a travel distance of 100 m using a laptop (Dell precision7530 GPU work station)-12 Intel Core i7-8750H CPU, 2.20 GHz frequency, 16 GB RAM memory, 6G NVIDIA corporation GP104GLM. Figure 12a,b shows the sparse maps constructed by the two methods.

Long Distance Tracking
We designed this experiment to verify if the proposed method can eliminate the cumulative error of RGB-D SLAM in long distance tracking in corridor scene with poor texture information. The experiments were performed in a travel distance of 100 m using a laptop (Dell precision7530 GPU work station)-12 Intel Core i7-8750H CPU, 2.20 GHz frequency, 16 GB RAM memory, 6G NVIDIA corporation GP104GLM. Figure 12a,b shows the sparse maps constructed by the two methods.  To quantitatively assess the accuracy of the model, Table 3 has been provided, which shows the matching error between the ground truth LiDAR data and the sparse map generated by the two methods. As seen in Table 3, the mean error of our method is 0.08 m, whereas that of ORB-SLAM2 is 1.18 m. We conducted a loop closure experiment in a travel distance of 100 m. Figure 13a,b clearly show the maps and the camera trajectory generated by the proposed method and ORB-SLAM2. It is observed that, in Figure 13a, the map has been wrongly divided into two. As seen from statistical results of matching error between the models generated by the two methods in loop closure experiment and the ground truth LiDAR data in Table 4, the mean error of our method is 0.1722 m, whereas that of ORB-SLAM2 is 0.7337 m. To quantitatively assess the accuracy of the model, Table 3 has been provided, which shows the matching error between the ground truth LiDAR data and the sparse map generated by the two methods. As seen in Table 3, the mean error of our method is 0.08 m, whereas that of ORB-SLAM2 is 1.18 m. We conducted a loop closure experiment in a travel distance of 100 m. Figure 13a,b clearly show the maps and the camera trajectory generated by the proposed method and ORB-SLAM2. It is observed that, in Figure 13a, the map has been wrongly divided into two. As seen from statistical results of matching error between the models generated by the two methods in loop closure experiment and the ground truth LiDAR data in Table 4, the mean error of our method is 0.1722 m, whereas that of ORB-SLAM2 is 0.7337 m.     Figure 14 illustrates a comparison of the camera trajectories produced by the two methods in the loop closure experiment; the trajectory in red was generated by the proposed method, and that in white was produced by ORB-SLAM2. When the map segment optimization approach was applied, the continuity of the trajectory was affected, as shown in Figure 13. Poor continuity occurs when the camera's cumulative error is corrected by the LiDAR data. The quantitative statistical results are provided in Table 5. It can be observed from Table 5 that the total lengths of the trajectories of the same data set produced by two different methods are different because of the difference in the cumulative error in the camera trajectory. It is evident from this result that the proposed method has a closure error of 0.0977 m and an error percentage of 0.09%, whereas ORB-SLAM2 has a much larger closure error of 3.4247 m and an error percentage of 3.46%.

Experiment in Round Hall
Another long-distance tracking experiment was designed in a closed round hall, which had two floors and a circumference of 85 m, as seen in Figure 15. Compared with the corridor scene, this scene was more complex where the distance from the ceiling to the ground was about 11 m; the interior structure consisted of a curved plane, and a lot of empty space existed in the middle of the hall. Therefore, obtaining accurate camera pose control information from prior LiDAR data is more difficult. The hall supported controlling the lighting conditions, thus testing the robustness of the approach with variable illumination was possible. Figure 16a,b shows the sparse map constructed by the two methods, and Figure 16c,d shows the camera trajectory. Clearly, the camera trajectory of the proposed method (shown in red) is generally consistent with the actual scene, whereas that of ORB-SLAM2 (shown in white) has a large misalignment. Table 6 shows the matching error between the model and LiDAR data. The mean error of the proposed method is 0.2142 m, whereas that of ORB-SLAM2 is 0.6406 m. The results of two typical indoor scenes illustrate that the proposed method can eliminate the cumulative error in long distance tracking.

Experiment in Round Hall
Another long-distance tracking experiment was designed in a closed round hall, which had two floors and a circumference of 85 m, as seen in Figure 15. Compared with the corridor scene, this scene was more complex where the distance from the ceiling to the ground was about 11 m; the interior structure consisted of a curved plane, and a lot of empty space existed in the middle of the hall. Therefore, obtaining accurate camera pose control information from prior LiDAR data is more difficult. The hall supported controlling the lighting conditions, thus testing the robustness of the approach with variable illumination was possible. Figure 16a,b shows the sparse map constructed by the two methods, and Figure 16c,d shows the camera trajectory. Clearly, the camera trajectory of the proposed method (shown in red) is generally consistent with the actual scene, whereas that of ORB-SLAM2 (shown in white) has a large misalignment. Table 6 shows the matching error between the model and LiDAR data. The mean error of the proposed method is 0.2142 m, whereas that of ORB-SLAM2 is 0.6406 m. The results of two typical indoor scenes illustrate that the proposed method can eliminate the cumulative error in long distance tracking.
Another long-distance tracking experiment was designed in a closed round hall, which had two floors and a circumference of 85 m, as seen in Figure 15. Compared with the corridor scene, this scene was more complex where the distance from the ceiling to the ground was about 11 m; the interior structure consisted of a curved plane, and a lot of empty space existed in the middle of the hall. Therefore, obtaining accurate camera pose control information from prior LiDAR data is more difficult. The hall supported controlling the lighting conditions, thus testing the robustness of the approach with variable illumination was possible. Figure 16a,b shows the sparse map constructed by the two methods, and Figure 16c,d shows the camera trajectory. Clearly, the camera trajectory of the proposed method (shown in red) is generally consistent with the actual scene, whereas that of ORB-SLAM2 (shown in white) has a large misalignment. Table 6 shows the matching error between the model and LiDAR data. The mean error of the proposed method is 0.2142 m, whereas that of ORB-SLAM2 is 0.6406 m. The results of two typical indoor scenes illustrate that the proposed method can   Table 6. Statistical results of matching error between the sparse maps generated by the two methods and the ground truth LiDAR data in closed round hall.

Deep Learning CNN Training and System Robustness Test
By employing the proposed algorithm by manually selecting point registration in LiDAR data and the RGB image, an accurate sparse map, which combines high-precision 3-D geometrical information with rich 2-D texture information, can be constructed. The precise camera pose of multiview image with centimeter-level accuracy can be obtained by the relocalization function based on bag-of-words model. Therefore, the proposed method can easily achieve high-quality training sample dataset for the deep learning position system, which can reduce the cost of labor. We trained the deep learning position network with 7813 RGB images with pose labels. The training data set basically included all corners of the closed round hall, which covered an area of 796.1216 square meters.
As seen from Figures 17 and 18, and Table 7, deep learning position network training with high-quality and full data set produces less error in the case of large indoor space; the average position error was 0.77 m and angular error was ~4°. With these degrees of errors, the acquired camera pose can be sufficiently used as an initial value of the ICP iteration algorithm for obtaining accurate camera pose from the LiDAR data.

Deep Learning CNN Training and System Robustness Test
By employing the proposed algorithm by manually selecting point registration in LiDAR data and the RGB image, an accurate sparse map, which combines high-precision 3-D geometrical information with rich 2-D texture information, can be constructed. The precise camera pose of multiview image with centimeter-level accuracy can be obtained by the relocalization function based on bag-of-words model. Therefore, the proposed method can easily achieve high-quality training sample dataset for the deep learning position system, which can reduce the cost of labor. We trained the deep learning position network with 7813 RGB images with pose labels. The training data set basically included all corners of the closed round hall, which covered an area of 796.1216 square meters.
As seen from Figures 17 and 18, and Table 7, deep learning position network training with high-quality and full data set produces less error in the case of large indoor space; the average position error was 0.77 m and angular error was~4 • . With these degrees of errors, the acquired camera pose can be sufficiently used as an initial value of the ICP iteration algorithm for obtaining accurate camera pose from the LiDAR data.
sample dataset for the deep learning position system, which can reduce the cost of labor. We trained the deep learning position network with 7813 RGB images with pose labels. The training data set basically included all corners of the closed round hall, which covered an area of 796.1216 square meters.
As seen from Figures 17 and 18, and Table 7, deep learning position network training with high-quality and full data set produces less error in the case of large indoor space; the average position error was 0.77 m and angular error was ~4°. With these degrees of errors, the acquired camera pose can be sufficiently used as an initial value of the ICP iteration algorithm for obtaining accurate camera pose from the LiDAR data.  In the closed round hall scene, we changed the illumination suddenly and moved the camera quickly to initiate system track failure. The relocalization based on the bag-of-words model in ORB-SLAM2 was unusable in this condition, whereas deep learning based on high-level semantic feature recovered its motion successfully. The experimental results show that the constructed map overlaps with the LiDAR data well after restoration of system motion, as shown in Figure 19a,b and Table 8.  In the closed round hall scene, we changed the illumination suddenly and moved the camera quickly to initiate system track failure. The relocalization based on the bag-of-words model in ORB-SLAM2 was unusable in this condition, whereas deep learning based on high-level semantic feature recovered its motion successfully. The experimental results show that the constructed map overlaps with the LiDAR data well after restoration of system motion, as shown in Figure 19a,b and Table 8. In the closed round hall scene, we changed the illumination suddenly and moved the camera quickly to initiate system track failure. The relocalization based on the bag-of-words model in ORB-SLAM2 was unusable in this condition, whereas deep learning based on high-level semantic feature recovered its motion successfully. The experimental results show that the constructed map overlaps with the LiDAR data well after restoration of system motion, as shown in Figure 19a,b and Table 8.

Definition of Keyframe
Before providing a detailed discussion, we would like to elaborate the definition of keyframe. The image insertion frequency is very high during the operation of SLAM, which leads to a rapid increase in information redundancy. However, the accuracy of the redundant information is low, and does not improve; further, it requires more computing resources. Therefore, the SLAM algorithm uses a specific selection strategy to select part of the image frames as the keyframe for location and mapping. The purpose of this strategy is to obtain the best balance between information redundancy and computing resource loss without loss of accuracy, to satisfy the requirements. Generally, as shown in Table 9, the visual tracking thread takes~550 ms to select a new keyframe, and~30 ms to receive a frame from the camera sensor.

Keyframe-Based LiDAR Optimization in a Multithreaded Framework
In terms of accuracy, the ICP iterative algorithm can meet the global optimum although the geometrical information of the scene is not rich; for instance, when the scene has only one or several planes of cylindrical geometry and the pose of the ICP iterative calculation has a good constraint on the direction of the camera. Therefore, the geometric information of the scene can be used to constrain the visual SLAM. The local LiDAR corrected pose is propagated into local map construction, which can then provide an appropriate initial value for the map segmentation iterative optimization algorithm, and make the map optimization quickly reach the global optimum.
After conducting a large number of experiments under normal tracking in different scenes, we obtained the correlation between the number of keyframes per map segment and the absolute cumulative error obtained by registration with LiDAR data, the ICP iterations, and correlation between the ICP iterations and the absolute cumulative error. As seen from Figure 20a,b, as the number of keyframes increases, the cumulative errors increase. When the number of keyframes per map segment is within 80 frames, the position error is generally below 2.5 m, angular error is below 10 • , and the ICP iterative algorithm can quickly converge to the global optimal value with at most 35 in the number of iterations, as shown in Figure 20e. However, as the cumulative error increases, the locally acquired LiDAR data may not overlap with the point cloud generated by the depth image, resulting in the ICP algorithm acquiring an inaccurate camera pose.
construction, which can then provide an appropriate initial value for the map segmentation iterative optimization algorithm, and make the map optimization quickly reach the global optimum.
After conducting a large number of experiments under normal tracking in different scenes, we obtained the correlation between the number of keyframes per map segment and the absolute cumulative error obtained by registration with LiDAR data, the ICP iterations, and correlation between the ICP iterations and the absolute cumulative error. As seen from Figure 20a,b, as the number of keyframes increases, the cumulative errors increase. When the number of keyframes per map segment is within 80 frames, the position error is generally below 2.5 m, angular error is below 10°, and the ICP iterative algorithm can quickly converge to the global optimal value with at most 35 in the number of iterations, as shown in Figure 20e. However, as the cumulative error increases, the locally acquired LiDAR data may not overlap with the point cloud generated by the depth image, resulting in the ICP algorithm acquiring an inaccurate camera pose. In terms of real-time performance, the voxel downsampling of a certain resolution of the LiDAR data of a large scene greatly reduces the amount of data used. Furthermore, LiDAR data retrieval for large scenes requires high computation time. In this study, the pose obtained by camera tracking is used as the prerequisite for LiDAR data retrieval, and the LiDAR data is organized by the KD-tree In terms of real-time performance, the voxel downsampling of a certain resolution of the LiDAR data of a large scene greatly reduces the amount of data used. Furthermore, LiDAR data retrieval for large scenes requires high computation time. In this study, the pose obtained by camera tracking is used as the prerequisite for LiDAR data retrieval, and the LiDAR data is organized by the KD-tree method, which greatly reduces the computation time. In addition, the point cloud data collected by RGB-D is downsampled with a certain resolution to reduce the number of point clouds participating in the ICP iteration. The adopted map point culling strategy of the front window and the keyframe culling strategy of the back window eliminate the redundant optimization variables, and the calculation time of the Jacoby matrix and the Hessian matrix is thus reduced. The multithread segmentation optimization is applied to avoid global optimization of the overall map, ensuring the optimization time is within a certain threshold, and ensuring the optimized results are shared by each thread in real time. As seen from Figure 21a,b, the average optimization time per keyframe is reduced with the increase in the number of keyframes per map segment. When the number of keyframes is less than 20 frames, the optimization time is above 75 ms. Therefore, from the real-time perspective of the algorithm operation, the number of keyframes per map segment should be set to at least 20 frames. After a comprehensive study of real-time operation and accuracy, we set the number of keyframes per map segment to 40 frames. Table 10 shows the time spent by LiDAR data optimization thread for every 40 keyframes; each operation in the table was performed once in every 40 keyframes. It is clear that the average time spent on each keyframe is very small. As seen from Table 11, the time spent by every keyframe in optimization operation is only~40.20 ms. It must be noted that the tracking thread took about 550 ms to create a new keyframe, as shown in Table 9. These results clearly demonstrate the effective real-time performance of the proposed method.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 20 of 24 be noted that the tracking thread took about 550 ms to create a new keyframe, as shown in Table 9. These results clearly demonstrate the effective real-time performance of the proposed method.    In terms of consistency, the map segmentation optimization had a 25% overlap with the previous segment of the map, whereas the keyframe poses at both ends of the map remained fixed as true value constraints. Using the double window optimization strategy, the front window uses the map point-pose constraint to construct an accurate map. The back window adopts the map point-position constraint and the pose-pose constraint to ensure front and rear map segment consistency.
In terms of practicability, from the above experiments and analyses of real-time operation, accuracy, and consistency, it has been demonstrated that the proposed algorithm can accurately map and locate most indoor scenes in real-time, which is beneficial in large indoor scenes such as airports and shopping malls. With advancements in sensor technology, cost-effective and portable depth cameras will become popular. In addition to positioning and mapping of the surrounding environment in real-time, users carrying depth cameras can avail more services and experiences, such as augmented reality online games, path planning, and intelligent interaction with the environment, based on precise map and camera pose. Although the high-precision LiDAR data for large indoor scene are difficult to obtain, the data once obtained can be used permanently after subsequent processing; in addition, all users in the environment can share this data for high-precision positioning and mapping. Even in case the scene changes for any reason, the LiDAR data of the scene can be automatically updated using the high-precision model produced by the proposed algorithm, eliminating the need for rescanning.

Conclusions
A novel RGB-D SLAM approach that uses priori LiDAR point cloud data as the guidance for indoor 3-D scene construction and navigation was presented in this paper. There were three important innovations in this method: first, the high-precision geometrical control information from the LiDAR data of the indoor scene was used to eliminate the cumulative error of the RGB-D SLAM system in large-scale indoor operation; second, a combination of deep learning and geometric SLAM system enabled automatic initialization and motion recovery, which improved the robustness of the system in challenging environments; and, third, the multithreaded map segment optimization method using double window was adopted to ensure the algorithm runs in real-time with high precision under the guidance of LiDAR data and constructed large-scale accurate and consistent dense map, which can patch the LiDAR data holes. The standard deviation of the 3D map construction was approximately 10 cm in a travel distance of 100 m compared with the LiDAR ground truth, and the relative cumulative error of the camera in closed-loop experiments was 0.09%, which was twice less than that of the typical SLAM algorithm (3.4%). Thus, it is clear that the positioning accuracy and map construction accuracy of the proposed system are significantly higher than those of the most advanced SLAM systems that do not utilize priori data. Furthermore, the fusion of the 2-D rich texture image data from RGB-D camera and the high-precision 3-D terrestrial LiDAR point cloud data was performed automatically; it compensated the shortcomings of the LiDAR data, which has high precision but insufficient texture information. Moreover, the proposed method can effectively construct dense indoor maps of large indoor scenes under the unified coordinate system of multiple data sources, for scene recognition and understanding, robot path planning and navigation, artificial intelligence (e.g., robot and 3D environment interaction), augmented reality, and virtual game environment development.
In future, we will build semantic maps of large scenes based on the constraints of terrestrial LiDAR data. Accurate poses obtained from terrestrial LiDAR data assist accurate semantic segmentation of single-frame images and enhance the consistency of semantic segmentation of multiview image frames so that robots can interact with the environment better.