A LiDAR/Visual SLAM Backend with Loop Closure Detection and Graph Optimization

: LiDAR (light detection and ranging), as an active sensor, is investigated in the simultaneous localization and mapping (SLAM) system. Typically, a LiDAR SLAM system consists of front-end odometry and back-end optimization modules. Loop closure detection and pose graph optimization are the key factors determining the performance of the LiDAR SLAM system. However, the LiDAR works at a single wavelength (905 nm), and few textures or visual features are extracted, which restricts the performance of point clouds matching based loop closure detection and graph optimization. With the aim of improving LiDAR SLAM performance, in this paper, we proposed a LiDAR and visual SLAM backend, which utilizes LiDAR geometry features and visual features to accomplish loop closure detection. Firstly, the bag of word (BoW) model, describing the visual similarities, was constructed to assist in the loop closure detection and, secondly, point clouds re-matching was conducted to verify the loop closure detection and accomplish graph optimization. Experiments with different datasets were carried out for assessing the proposed method, and the results demonstrated that the inclusion of the visual features effectively helped with the loop closure detection and improved LiDAR SLAM performance. In addition, the source code, which is open source, is available for download once you contact the corresponding author.


Introduction
The concept of simultaneous localization and mapping (SLAM) was first proposed in 1986 by Cheeseman [1,2]. Estimation theory was introduced into robots mapping and position. After more than 30 years of development, SLAM technology is no longer limited to theoretical research in the field of robotics and automation; it is now promoted in many applications, i.e., intelligent robots, autonomous driving, mobile surveying, and mapping [3].
The core of SLAM is to utilize sensors, i.e., a camera and LiDAR, to perceive the environment and estimate states, i.e., position and attitude [4]. Generally, a typical SLAM includes two parts: a front-end odometer and a back-end optimization [5]. The front-end odometer estimates the state and maps the environment. The back-end optimization corrects the cumulative errors of the front-end odometer and improves the state estimation accuracy. In the back-end optimization, loop closure detection is also included for improving the performance of the back-end optimization. Traditionally, SLAM technology is divided according to the employed sensors, i.e., visual SLAM employs cameras as the Kalman filter (EKF) SLAM method, particle filter (PF) SLAM method, and information filter (IF) SLAM method. Representatives of Bayes-based LiDAR SLAM include Hector SLAM, and Gmapping, etc. Hector SLAM solves the two-dimensional plane translation and yaw angle matched by the single-line LiDAR scan using the Gauss Newton method [12]. Multi-resolution raster maps are utilized to avoid the state estimation falling into the local optimum. EKF is employed to fuse the information from the IMU and the LiDAR. Gmapping is an algorithm based on particle filtering [11]. It can achieve better results when there are more particles, but it also consumes higher computing resources. It lacks loop closure detection and optimization; therefore, the accumulated errors cannot be effectively eliminated.
The graph-based SLAM [19][20][21] method is also known as full SLAM. It usually takes observations as the constraints to model the graph structure and perform global optimization to achieve state estimation. Representative open-source algorithms include Karto SLAM [7], and Cartographer [9], etc. Karto SLAM is an algorithm based on graph optimization, which utilizes highly optimized and non-iterative Cholesky matrix decomposition to decouple sparse systems to solve the problem. The nodes represent a pose state of the robot or sensor observations, and the edges represent the constraints between nodes. When each new node is added, the geometric structure of the entire graph will be calculated and updated. Cartographer includes local matching in the front terminal graph and global back-end loop closure detection and subgraph optimization [22]. Cartographer has strong real-time performance and high accuracy, and it utilizes loop closure detection to optimize and correct the cumulative errors. In recent years, Zhang, from Carnegie Mellon University, proposed the LiDAR SLAM algorithm LOAM (LiDAR odometry and mapping) [10]. The core idea of LOAM is to run both high-frequency odometry pose estimation and low-frequency point clouds mapping in parallel. Two threads to achieve a balance between positioning accuracy and real-time performance; a generalized ICP algorithm is employed for adjacent frame point clouds matching method for high-frequency odometer pose estimation. The accuracy of geometric constraints is difficult to guarantee, and the results are easy to diverge when solving nonlinear optimization.
As mentioned previously, both visual SLAM and LiDAR SLAM have their own advantages and disadvantages. The visual camera outputs 2D image information, which can be divided into grayscale images and RGB color images. LiDAR outputs 3D discrete point clouds, and the radiation intensity is unreliable. The 3D here is strictly 2.5 D, because the real physical world is a 3D physical space, and the LiDAR point clouds are just a layer of surface models with depth differences, and the texture information behind the surface cannot be perceived by the single-wavelength LiDAR. Abundant investigations have revealed that the positioning accuracy of LiDAR SLAM is slightly higher than that of visual SLAM. In terms of robustness, the LiDAR point clouds are noisy at the corner points, and the radiation value of the visual image will change under different lighting conditions. It can be seen that if the laser LiDAR and vision camera can be externally calibrated with high precision, the point clouds and image data registration can make up for each other's shortcomings and promote the overall performance of the SLAM.
At present, research conducted on the SLAM technology of LiDAR/visual fusion is considerably less than the above two single-sensor SLAMs. Zhang proposed a depthenhanced monocular visual odometry (DEMO) in 2014 [23], which solves the problem of the loss of many pixels with large depth values during the state estimation of the visual odometry front-end. Graeter proposed the LiDAR-monocular visual odometry (LiDAR-LIMO) [24], which extracts depth information from LIDAR point clouds for camera feature point tracking, which makes up for the shortcomings of monocular visual scale. The core framework of the above two methods are still based on visual SLAM and the LiDAR works only as a supplemental role, and similar schemes include binocular visual inertial navigation LiDAR SLAM (stereo visual inertial LiDAR SLAM, VIL-SLAM) proposed by Shao [25,26]. In addition, Li implemented a SLAM system combining RGBD depth camera and 3D LiDAR. Since the depth camera itself can generate deep point clouds, the advantages of 3D LiDAR are not obvious. On the basis of EMO and LOAM, a visual/LiDAR SLAM (visual-LiDAR odometry and mapping, VLOAM) was developed [27,28]. The visual odometer provides the initial value for LiDAR point clouds matching, and its accuracy and robustness are further improved via the LOAM. VLOAM has achieved relatively high accuracy in the state estimation of the front-end odometer, but the lack of back-end loop closure detection and global graph optimization will inevitably affect the positioning accuracy and the consistency of map construction, and it will continue to degrade over time.
Loop closure detection is of great significance to SLAM systems. Since loop closure detection realizes the association between current data and all historical data, it helps to improve the accuracy, robustness, and consistency of the entire SLAM system [28]. In this paper, a LiDAR/visual SLAM based on loop closure detection and global graph optimization (GGO) is constructed to improves the accuracy of the positioning trajectory and the consistency of the point clouds map. A loop closure detection method, based on visual BoW similarity and point clouds re-matching, was implemented in the LiDAR/Visual SLAM system. KTTI datasets and WHU Kylin backpack datasets were utilized to evaluate the performance of the visual BoW similarity-based loop closure detection, and the position accuracy and point clouds map are presented for analyzing the performance of the proposed method.
The remainder of the paper is organized as follows: Section 2 presents the architecture of the LiDAR/visual SLAM, the flow chart of the loop closure detection; Section 3 presents the graph optimization, including the pose graph construction and the global pose optimization; and Section 4 presents the experiments, including the results and analysis. Section 5 concludes the paper.

System Architecture
The whole LV-SLAM also includes two parts: front-end odometry and back-end optimization. The system architecture is presented in Figure 1. The front-end used in this paper was an improved LOAM method, and the front-end problem was divided into two modules. One module performed odometry at a high-frequency, but at low fidelity, to estimate the velocity of the laser scanner. A second module ran at a frequency of an order of magnitude lower for fine matching and registration of the point clouds. In the original LOAM, feature points located on sharp edges and planar surfaces are extracted, and the feature points to edge line segments and planar surface patches are matched, respectively. The original LOAM belongs to the feature-based methods. Comparatively, our improved LOAM utilized the normal distributions transform (NDT) method instead of extracting feature points to the scans, matching directly and efficiently in the odometry module. In other words, the NDT-based odometry is referred to as direct odometry (DO). The mapping module is similar to the LOAM algorithm. Therefore, the front-end of the improved LOAM is also called DO-LFA.

Loop Closure Detection
There are two solutions to realize loop closure detection: the geometry-based method and the features-based method. The geometry-based method means detecting the loop closure with the known movement information from the front-end odometer, such as, for example, when the platform returns to a certain position that it passed before, to verify whether there is a loop [28,29]. The geometry-based idea is intuitive and simple, but it is difficult to execute when the accumulated error is large [30]. With regards to the featuresbased loop closure detection method, it has nothing to do with the state estimation of the front-end and the back-end. It utilizes the similarity detection of two frames of images to detect the loop closure, which eliminates the influence of accumulated errors. The features-based loop closure detection method has been applied to multiple visual SLAM systems [31,32]. However, while using the features similarity in the detection, the current frame needs to be compared and calculated with all previous frames, which requires a large amount of calculation, and it will be invalid for a feature's repetitive environment, i.e., the decoration of multiple adjacent rooms in a hotel. Due to the explosive development of computer vision and image recognition cognitive technology in recent years, compared to the features' similarity detection with point clouds, visual images containing various features are more mature and robust in loop closure detection. However, the point clouds can provide more accurate quantitative verification for the loop through re-matching based on the image similarity detection.
Based on the above analysis, we comprehensively considered geometry-based and features-based methods, and made full use of visual and point clouds information. A loop closure detection method utilizing visual features is presented in Figure 2. DO-LFA outputs data (pose, point clouds, and image) after the key frame preprocessing, and preliminary candidate frames were obtained by rough detection based on geometric information, such as the distance of the motion trajectory, and then the suspected loop frames were obtained through the BoWs similarity. Finally, the point clouds re-matching was conducted to verify the suspected loop closure. The back-end global map optimization (GGO) was to construct global pose map optimization and point clouds map after loop detection. After the front-end DO-LFA outputs the odometry, we first combined the BoW similarity score of the visual image and the point clouds re-matching judgment to realize the loop detection. Then, the edge constraints between the key frames were calculated to construct the pose map of the global key frames. Finally, the graph optimization theory was utilized to reduce the global cumulative error, improve the global trajectory accuracy and map consistency, and obtain the final global motion trajectory and point clouds map.

Loop Closure Detection
There are two solutions to realize loop closure detection: the geometry-based method and the features-based method. The geometry-based method means detecting the loop closure with the known movement information from the front-end odometer, such as, for example, when the platform returns to a certain position that it passed before, to verify whether there is a loop [28,29]. The geometry-based idea is intuitive and simple, but it is difficult to execute when the accumulated error is large [30]. With regards to the featuresbased loop closure detection method, it has nothing to do with the state estimation of the front-end and the back-end. It utilizes the similarity detection of two frames of images to detect the loop closure, which eliminates the influence of accumulated errors. The features-based loop closure detection method has been applied to multiple visual SLAM systems [31,32]. However, while using the features similarity in the detection, the current frame needs to be compared and calculated with all previous frames, which requires a large amount of calculation, and it will be invalid for a feature's repetitive environment, i.e., the decoration of multiple adjacent rooms in a hotel. Due to the explosive development of computer vision and image recognition cognitive technology in recent years, compared to the features' similarity detection with point clouds, visual images containing various features are more mature and robust in loop closure detection. However, the point clouds can provide more accurate quantitative verification for the loop through re-matching based on the image similarity detection.
Based on the above analysis, we comprehensively considered geometry-based and features-based methods, and made full use of visual and point clouds information. A loop closure detection method utilizing visual features is presented in Figure 2. DO-LFA outputs Remote Sens. 2021, 13, 2720 6 of 29 data (pose, point clouds, and image) after the key frame preprocessing, and preliminary candidate frames were obtained by rough detection based on geometric information, such as the distance of the motion trajectory, and then the suspected loop frames were obtained through the BoWs similarity. Finally, the point clouds re-matching was conducted to verify the suspected loop closure. The geometric rough detection was based on the pose or odometry outpu LFA, wherein we selected the preliminary candidate frames from the historic and matched with the current frame to carry out loop closure detection. Th threshold judgment conditions ( Figure 3): (1) when the search area threshold radius) of key frames was less than the threshold range, it might be a poten sure frame; (2) the interval threshold d2 must be greater than the threshold ring length threshold d3 should be greater than the threshold. The key fram the above conditions were saved as preliminary candidate loop closure fram processing. The geometric rough detection was based on the pose or odometry output by the DO-LFA, wherein we selected the preliminary candidate frames from the historical key frames and matched with the current frame to carry out loop closure detection. There are three threshold judgment conditions ( Figure 3): (1) when the search area threshold d1 (red circle radius) of key frames was less than the threshold range, it might be a potential loop closure frame; (2) the interval threshold d2 must be greater than the threshold; and (3) the ring length threshold d3 should be greater than the threshold. The key frames that meet the above conditions were saved as preliminary candidate loop closure frames for further processing. threshold judgment conditions ( Figure 3): (1) when the search area threshold d1 (r radius) of key frames was less than the threshold range, it might be a potential sure frame; (2) the interval threshold d2 must be greater than the threshold; an ring length threshold d3 should be greater than the threshold. The key frames t the above conditions were saved as preliminary candidate loop closure frames fo processing.

Visual BOW Similarity
The bag of words (BoW) model [32,33] originated from the fields of inform trieval and text classification, and is now widely used in the recognition of visua and loop closure detection in SLAM. The general idea is to use k-means or k-me

Visual BOW Similarity
The bag of words (BoW) model [32,33] originated from the fields of information retrieval and text classification, and is now widely used in the recognition of visual images and loop closure detection in SLAM. The general idea is to use k-means or k-means++ to perform cluster analysis based on image feature points, i.e., SURF or ORB to obtain a "word" vector composed of ID numbers and weights. A K-d tree, with k branch and d depth, is utilized to express the vectors as a dictionary; then the image is described according to the statistical histogram of the word, and finally the similarity is calculated for judgment.
The dictionary training is regarded as an unsupervised classification process. Similar to many models in the field of deep learning, the sample size and scale directly affect the effectiveness of the model. Training a large dictionary may require a machine with large memory and high performance, and it will take a long time. Here, we utilized a large dictionary that is widely used by the open-source community. It is trained from about 2900 images. The size of the dictionary is k = 10 and d = 5, that is, up to 10,000 words. The program uses the open-source library DBOW3 (https://github.com/rmsalinas/DBow3, accessed on 7 June 2021) to assist the implementation.
With the dictionary, the corresponding word w j of the specific features can be retrieved. After obtaining the N features of an image and the corresponding words, it is equivalent to obtaining the histogram of the distribution of the image in the dictionary. However, considering the different importance of different words in distinguishability, they are often weighted, similar to the method in text retrieval, which is called term frequency-inverse document frequency (TF-IDF) [34,35]. TF refers to the frequency of a word in a single image. The higher the frequency of a word in the image, the higher the degree of discrimination.
Assuming an image I, the word w i appears m i times, the total amount of the occurrences of all words is m, where: In addition, when the BoW model is established, assuming that the number of all features in the dictionary is n, and the number of features in a leaf node w i is n i , the IDF of the word is defined as: Then, the weight of w i is the product of TF and IDF For image A, its multiple feature points retrieve multiple words in the dictionary. Considering the weight, the BoW vector constituting the image is written as: The number of words in the dictionary is often very large, and there may be only some features and words in the image, and, thus, it will be a sparse vector with a large number Therefore, assuming two images A and B, their BoWs vectors v A and v B can be obtained. There are multiple representation methods for similarity calculation. Here, we chose the L1 norm to measure their similarity, and the result value falls within the interval (0, 1). If the images are exactly the same, the result is 1, and the similarity calculation is presented as [36]: In the loop closure detection phase, we calculated the BoW similarity between the current frame and all the candidate frames. The frames with similarity less than 0.05 were directly eliminated, and the rest of the suspected loop closure frames were sorted according to the similarity values, from high to low, and processed in the following step for further verification.

Loop Closure Detection Verifying and Its Accuracy
In order to verify the suspected loop closure frames, we matched the current frame and the suspected loop frames one by one in the sorted order, and counted the Euclidean fitness score for each match. The Euclidean consistency score is the mean value of the square of the distance from the source point clouds to the target point clouds. Corresponding points exceeding a certain threshold are not considered in the calculation. If the score is lower than the threshold (0.2 m), then this frame is the final loop closure frame of the current frame, and the matched relative pose is utilized as a constraint for subsequent pose map optimization.
When detecting the loop closure, there are usually four conditions, which are summarized in the Table 1: true positive (TP), false positive (False Positive, FP), true negative (TN), and false negative (FN). True positives mean that the frame is a loop closure frame; while true negatives mean that the frame is not a loop frame. False positives mean the frame is not a loop closure, but the algorithm judged it to be one; while false negatives mean that the frame is a loop closure, but the algorithm judged it not to be. In our results, we hoped that TP and TN would appear as much as possible, whereas we hoped that FP and FN would appear as little as possible or not at all. For a certain loop detection algorithm, the frequency of occurrence of TP, TN, FP, and FN on a certain sample data can be counted, and the accuracy (precision) and recall rate (recall) can be calculated: Recall(%) = TP TP + FN (7)

Global Pose Construction
Compared with the back-end GGO, the motion trajectory obtained by the DO-LFA is, broadly speaking, referred to as a front-end odometer. It mainly utilizes the point clouds matching between the current frame and adjacent or local multi-frames to estimate the Remote Sens. 2021, 13, 2720 9 of 29 current pose. These point clouds and their features may also be referred to as landmarks. While different from the visual SLAM, the direct method and abovementioned point clouds match method both fix the connection relationship and then solve the Euclidean transformation. In other words, DO-LFA does not optimize the landmarks, and only solves for the pose node.
As time accumulates, the trajectory of the platform will become longer, and the scale of the map will continue to grow. Since the scanning space of LiDAR is always limited, the scale of point clouds or road signs cannot grow indefinitely with the map, and the constraint relationship between the current frame and earlier historical data may no longer exist. In addition, there are errors in direct method matching and feature point method optimization. The cumulative errors obtained by DO-LFA will become larger, and the inconsistency of the global map will become more obvious.
In order to improve the pose accuracy of the key frame nodes and ensure the quality of the global point clouds map, we can save the trajectory of the DO-LFA and construct a back-end global map optimization to reduce the cumulative errors. The global pose graph utilizes the pose of the key frame as the node, and the relative motion between the two pose nodes, obtained by the point clouds matching, is employed as the constraint edge. The nonlinear least squares adjustment method is used to solve the problem to obtain better results. Figure 4 visually introduces the process of constructing a pose graph, where the arrow is the pose and the blue dashed line is the motion trajectory. Figure 4a presents the DO-LFA, where the red circle may be understood as overlapping point clouds or road signs for matching. The green line represents the constraint between two adjacent frames of DO. The green and red lines together represent the current frame and constraints between historical, local data. In the key frame screening, the pose nodes are reduced, and the trajectory of the DO-LFA is reserved as the constraint edge between the adjacent key frame pose nodes. Figure 4b presents the prepared frames for loop closure detection and global pose graph optimization. Figure 4c presents the loop detection process, where the green arrow represents the current frame being processed, the red dashed line indicates that the loop closure frame is found and constitutes a loop, and the green dashed line is the loop constraint edge after point clouds re-matching. Figure 4d presents the result of the optimization of the global pose graph. The pose nodes and motion trajectories are optimized, the motion trajectory forms a complete loop, and the point clouds consistency is improved.
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 27 As time accumulates, the trajectory of the platform will become longer, and the scale of the map will continue to grow. Since the scanning space of LiDAR is always limited, the scale of point clouds or road signs cannot grow indefinitely with the map, and the constraint relationship between the current frame and earlier historical data may no longer exist. In addition, there are errors in direct method matching and feature point method optimization. The cumulative errors obtained by DO-LFA will become larger, and the inconsistency of the global map will become more obvious.
In order to improve the pose accuracy of the key frame nodes and ensure the quality of the global point clouds map, we can save the trajectory of the DO-LFA and construct a back-end global map optimization to reduce the cumulative errors. The global pose graph utilizes the pose of the key frame as the node, and the relative motion between the two pose nodes, obtained by the point clouds matching, is employed as the constraint edge. The nonlinear least squares adjustment method is used to solve the problem to obtain better results. Figure 4 visually introduces the process of constructing a pose graph, where the arrow is the pose and the blue dashed line is the motion trajectory. Figure 4a presents the DO-LFA, where the red circle may be understood as overlapping point clouds or road signs for matching. The green line represents the constraint between two adjacent frames of DO. The green and red lines together represent the current frame and constraints between historical, local data. In the key frame screening, the pose nodes are reduced, and the trajectory of the DO-LFA is reserved as the constraint edge between the adjacent key frame pose nodes. Figure 4b presents the prepared frames for loop closure detection and global pose graph optimization. Figure 4c presents the loop detection process, where the green arrow represents the current frame being processed, the red dashed line indicates that the loop closure frame is found and constitutes a loop, and the green dashed line is the loop constraint edge after point clouds re-matching. Figure

Globe Pose Graph Optimization
In the graph optimization theory, the node denotes the pose of the key frame, which is represented by 1

Globe Pose Graph Optimization
In the graph optimization theory, the node denotes the pose of the key frame, which is represented by ξ 1 , · · · , ξ n . The edge denotes the relative motion estimation between two pose nodes. The ordinary edge comes from the direct method or point clouds matching in the DO-LFA, and the loop edge comes from the rematch of the point clouds during loop closure detection. The relative motion, ∆ξ ij , between ξ i and ξ j nodes can be expressed as: The corresponding relation of Lie group and Lie algebra is T = exp(ξ ∧ ), and it can be written with Lie group: From the perspective of the construction process of the pose graph, especially after the loop edge is added, the above formula will not be accurately established.
We regarded the edge constraint as the measured value, and the node pose as the estimated value, and, thus, moved the left side of the above formula to the right side, deriving the error equation: where ξ i and ξ j are the variables expected to be estimated. We used Lie algebra to find the derivative of these two variables, adding a disturbance to the ξ i and ξ j , wherein the error equation can be re-written as: In order to derive the linearization of the Taylor series expansion of the above formula, we introduced the adjoint property of SE (3): Equation (12) is re-written as: The Jacobi matrix calculation is written as: Remote Sens. 2021, 13, 2720

of 29
The optimization of the graph is essentially the least squares optimization. Each pose converter is an optimization variable. The perceptual constraint between poses is an edge, and all pose baselines and constraint edges together form a displacement map.
Assuming C denotes the set of all edges in the pose graph, the cost function of the nonlinear least optimization is written as: where Ω denotes the information matrix used to describe the matching errors of the point clouds. When the number of nodes reaches the set value, the above cost function can be solved by the Gauss-Newton method or LM method, etc. Open-source libraries, i.e., Ceres or g2o, also provide some solution methods for graph optimization.

Dataset Description
With the aim of qualitatively evaluating the performance of the proposed method, we employed the open-source KITTI dataset for testing. The KITTI data acquisition platform and the sensors used are shown in Figure 5. The left camera, a Point Grey Flea2 (FL2-14S3C-C), was installed at Cam2. This camera, a classic colorful industrial camera from Point Grey, Canada, has 1.4 million pixels, a global shutter, and an acquisition frequency of 10 HZ. There were a total of seven sequences with loops in the 11 sequences: #00, #05, #06, #07, #02, #08, and #09. We tested all the data from these seven sequences. Some important threshold parameters used in the experiment were set as follows: Assuming C denotes the set of all edges in the pose graph, the cost function o nonlinear least optimization is written as: where Ω denotes the information matrix used to describe the matching errors of point clouds. When the number of nodes reaches the set value, the above cost func can be solved by the Gauss-Newton method or LM method, etc. Open-source libra i.e., Ceres or g2o, also provide some solution methods for graph optimization.

Dataset Description
With the aim of qualitatively evaluating the performance of the proposed met we employed the open-source KITTI dataset for testing. The KITTI data acquisition form and the sensors used are shown in Figure 5. The left camera, a Point Grey Flea2 ( 14S3C-C), was installed at Cam2. This camera, a classic colorful industrial camera f Point Grey, Canada, has 1.4 million pixels, a global shutter, and an acquisition frequ of 10 HZ. There were a total of seven sequences with loops in the 11 sequences: #00, #06, #07, #02, #08, and #09. We tested all the data from these seven sequences. Some portant threshold parameters used in the experiment were set as follows: (1) The distance and angle thresholds for key frame selection were 10 m and respectively; (2) The thresholds d1, d2, and d3, for geometric rough detection, were 20 m, 5 and 100 m, respectively.  (1) The distance and angle thresholds for key frame selection were 10 m and 10 • , respectively; (2) The thresholds d1, d2, and d3, for geometric rough detection, were 20 m, 50 m, and 100 m, respectively.

Results Analysis
In the experiment, all seven sequences with loops were employed in the tests, and all the results with GGO (DO-LFA-GGO) and without GGO (DO-LFA) were saved for analyzing the position accuracy. According to the collection environment, we divided these seven sequences into three categories for analysis and discussion. Sequence #00 and #05 were classified as group A, sequence #06 and #07 were classified as group B, and sequence #02, #09, and #08 were classified as group C; the experimental results are listed in Table 2 and presented in Figures 6-8, respectively. The trajectory, loop position, and cumulative position error (CPE) of each group of data are presented. Group A and group B were both urban environments, including urban roads, many loops, and many regular buildings, and the perceived structure of the environment was better. Group C, on the other hand, was a mixed urban and rural environment with twists and turns in the rural roads. There were few loops and relatively few buildings. Farmland without references existed in this group, and the structure of the perceived environment was poor. The basis for the separation of group A and B was that the DO-LFA accuracy of group A was poor before GGO operation and the accuracy improvement effect was obvious after GGO, whereas the DO-LFA of group B achieved high accuracy and the optimization effect was not obvious.
In the following Figures 6-8, the white line denotes the GPS/INS reference trajectory, the green line denotes the trajectory of DO-LFA without GGO, and the blue line denotes the trajectory of DO-LFA with GGO (DO-LFA-GGO). Parts of the trajectories overlap, therefore, it seems that there is only one color for the overlapped trajectories. The red squares mark the detailed position of the loop in the trajectory. Combining with the trajectory and motion details, the counted number of loops was utilized to confirm whether the loop was correct. Moreover, the cumulative position errors with and without GGO are also presented in these figures.
Trajectory, loop closure location, and the CPE values of group A (#00 and #05) are presented in Figure 6. The environment of sequence #00 and #05 was a well-structured town, the time length of the sequence was comparatively long, and the movement distance was long, in excess of 2 km. In total, 15 loop closures were detected in sequence #00, while seven were detected in sequence #05. These loop closures were evenly distributed on the trajectory at an interval of 50 m, which is almost identical to the inter-loop threshold for loop closure detection. The loop closure detection accuracy showed that the accuracy rate reached 100%. The CPE of the DO-LFA of the two sequences were large, reaching 6.71 m and 14.27 m, respectively, and dropped to 0.12 m and 0.36 m, respectively, after GGO. The trajectory of DO-LFA-GGO was also closer than DO-LFA to the true trajectory, which indicated that GGO achieved the expected effect and the CPE was basically eliminated.  reference trajectory. After GGO operation, the trajectories and the CPE values did not ha any obvious changes, which indicated that the GGO process did not pose any negat influence on the DO-LFA results and that the CPE was still kept small after the GGO o eration. The environment of sequence #06 and #07 was a well-structured town, and th scanning time and movement distances were shorter than group A sequences. Therefo the DO-LFA performed well and the GGO did not effectively reduce the errors.
(a) The environments of group C sequences (#02, #09, and #08) were more complicat Their trajectories, loop closure positions, and cumulative position error calculations a presented in Figure 8. Many of the paths from the sequences #02, #09, and #08 were in t countryside with winding roads, bends, few loop closures, and relatively few buildin There was farmland without reference objects on the ground and the structure of the e vironment were poor.
The collection environment of sequence #02 was a rural road with continuous lar turns. There was no loop closure for a long time in the early stage, and only three lo closures were successfully detected during the second half of the trajectory. The CPE v ues before and after GGO were 8.76 m and 0.13 m, respectively, which showed that t GGO reduced the sequence #02 CPE values. However, we observed that the position Trajectory, loop closure position, and the CPE of group B data (#06 and #07) are presented in Figure 7. In this trajectory, six loop closures were detected in sequence #06, while sequence #07 had only one loop closure, which was located at the end of the trajectory. The most important finding was that the CPE values of the DO-LFA for these two sequences were small and less than 30 cm and their trajectories almost overlapped with the reference trajectory. After GGO operation, the trajectories and the CPE values did not have any obvious changes, which indicated that the GGO process did not pose any negative influence on the DO-LFA results and that the CPE was still kept small after the GGO operation. The environment of sequence #06 and #07 was a well-structured town, and their scanning time and movement distances were shorter than group A sequences. Therefore, the DO-LFA performed well and the GGO did not effectively reduce the errors.
The environments of group C sequences (#02, #09, and #08) were more complicated. Their trajectories, loop closure positions, and cumulative position error calculations are presented in Figure 8. Many of the paths from the sequences #02, #09, and #08 were in the countryside with winding roads, bends, few loop closures, and relatively few buildings. There was farmland without reference objects on the ground and the structure of the environment were poor.
The collection environment of sequence #02 was a rural road with continuous large turns. There was no loop closure for a long time in the early stage, and only three loop closures were successfully detected during the second half of the trajectory. The CPE values before and after GGO were 8.76 m and 0.13 m, respectively, which showed that the GGO reduced the sequence #02 CPE values. However, we observed that the position errors of the first half were larger than those of the second half due the fact that there was no loop closures detected in the first part of the trajectory. the DO-LFA with and without GGO were 4.27 m and 0.09 m, respectively, and the decrease in the CPE indicated that the GGO effectively utilized the detected loop closure and reduced the CPE values. Similar to sequence #02, the trajectory without loop closure performed slightly worse in terms of position errors after the GGO; the long-term continuous country curve and limited looping might account for this phenomenon. The sequence #08 environment included towns and villages, the roads were regular, there were 2-3 actual loop closure areas, but no loop closure was detected and the recall rate was 0%. The main reason for the loop closure detection failure was that the second time, the travel direction of the vehicle during the loop closure was opposite to that of the first time. Thus, the viewing angle of the sensor scan was also completely opposite, which seriously affected the interpretation of similarity during loop closure detection, especially for image similarity. Without the loop closure, the GGO could not be carried out, and the trajectory with or without GGO was the same. The environment of sequence #09 also contained a number of consecutive rural roads with large turns. There was also no loop closure in the early stage of the trajectory, and there was only one detected loop closure at the end of the trajectory. The CPE values of the DO-LFA with and without GGO were 4.27 m and 0.09 m, respectively, and the decrease in the CPE indicated that the GGO effectively utilized the detected loop closure and reduced the CPE values. Similar to sequence #02, the trajectory without loop closure performed slightly worse in terms of position errors after the GGO; the long-term continuous country curve and limited looping might account for this phenomenon.
The sequence #08 environment included towns and villages, the roads were regular, there were 2-3 actual loop closure areas, but no loop closure was detected and the recall rate was 0%. The main reason for the loop closure detection failure was that the second time, the travel direction of the vehicle during the loop closure was opposite to that of the first time. Thus, the viewing angle of the sensor scan was also completely opposite, which seriously affected the interpretation of similarity during loop closure detection, especially for image similarity. Without the loop closure, the GGO could not be carried out, and the trajectory with or without GGO was the same.

Dataset Description
The major sensors of the WHU Kylin backpack included: Velodyne VLP-16 Lidar, Mynak D1000-IR-120 color binocular camera, Xsens-300 IMU, and a power communication module. An example of the mobile data collection is presented in Figure 9. The average speed of the backpack was 1 m/s, and the data collection and algorithm running speed were both 10 Hz. There were two LiDARs installed on the backpack. In this experiment, only the horizontal LiDAR and the left camera of the binocular camera were used, and the images were 640 × 480 respectively. The backpack was not equipped with a GPS device, so there was no true value of the trajectory. As mentioned, we consciously walk out of a relatively regular matrix area at the beginning and end of the acquisition path for accuracy evaluation. Some important threshold parameters used in the experiment were set as follows: the distance threshold and angle threshold of the key frame selection were 2 m and 10 • , respectively; and the thresholds d1, d2, and d3 of the geometric rough detection were 5 m, 15 m, and 25 m, respectively.

Dataset Description
The major sensors of the WHU Kylin backpack included: Velodyne VLP-16 Lidar, Mynak D1000-IR-120 color binocular camera, Xsens-300 IMU, and a power communication module. An example of the mobile data collection is presented in Figure 9. The average speed of the backpack was 1 m/s, and the data collection and algorithm running speed were both 10 Hz. There were two LiDARs installed on the backpack. In this experiment, only the horizontal LiDAR and the left camera of the binocular camera were used, and the images were 640 × 480 respectively. The backpack was not equipped with a GPS device, so there was no true value of the trajectory. As mentioned, we consciously walk out of a relatively regular matrix area at the beginning and end of the acquisition path for accuracy evaluation. Some important threshold parameters used in the experiment were set as follows: the distance threshold and angle threshold of the key frame selection were 2m and 10°, respectively; and the thresholds d1, d2, and d3 of the geometric rough detection were 5m, 15m, and 25m, respectively.
With the backpack LiDAR scanning system, we collected three datasets, sequences #01, #02, and #03, to assess the proposed method. The trajectories, loop closure position, and CPE values are presented in the Figures 10-12. Similar to the results in the KITTI experiment, the green lines denote the trajectories from the results without DO-LFA, the blue lines denote the trajectories from the DO-LFA-GGA method, and part of the trajectories overlapping led to the corresponding trajectories being presented in one color. The red rectangle denotes the loop closure position in the trajectory, and we counted the loops closure according to the red rectangle and compared it with the motions.  Figure 10 shows the trajectories and point clouds map comparison of sequence #01 with and without GGO. Specifically, Figure 10a presents the trajectory comparison before With the backpack LiDAR scanning system, we collected three datasets, sequences #01, #02, and #03, to assess the proposed method. The trajectories, loop closure position, and CPE values are presented in the Figures 10-12. Similar to the results in the KITTI experiment, the green lines denote the trajectories from the results without DO-LFA, the blue lines denote the trajectories from the DO-LFA-GGA method, and part of the trajectories overlapping led to the corresponding trajectories being presented in one color. The red rectangle denotes the loop closure position in the trajectory, and we counted the loops closure according to the red rectangle and compared it with the motions.

Results and Analysis
were evenly distributed on the path with an interval of about 10 m (10 m was also the threshold set for loop closure detection). It showed that the loop closure detection accuracy reached 100%. The CPE of the DO-LFA of sequence #01 was large, reaching 2.89 m. However, it dropped to 0.12 m after GGO processing. The trajectory of DO-LFA-GGO was also more accurate than DO-LFA in the loop, indicating that the GGO achieved the expected effect. The error was basically eliminated to 0.12 m of the DO-LFA-GGO from 2.89 m of the DO-LFA.
Buildings, trees, and flagpoles can be described in detail with the point clouds. Here, through comparing the point clouds map, it can be seen that the trajectory without GGO drifted greatly, the constructed point cloud map had obvious layering and confusion, and its consistency was poor. After using loop closure detection and global map optimization, the optimal estimation of platform poses and point clouds was obtained. Therefore, the consistency of the point clouds map constructed was better, and the layering and disorder of the point clouds at the loop closure location disappeared.  Figure 11 shows the trajectory and point clouds map of sequence #02 with and without GGO. Sequence #02 was an indoor scene of the Huawei Song Research Institute (Xibeipo Village, Songshan Lake, Dongguan) H5. Specifically, Figure 11a presents the comparison of the trajectories from the DO-LFA and DO-LFA-GGO method, while Figure 11b presents the point clouds map from the DO-LFA-GGO, noting that accuracy before GGO was already relatively high and that the point clouds were no longer compared after GGO.
From comparing the trajectories, we observed that the number of loop closures detected in sequence#02 was two, and that the loop closure detection was all correct. The CPE of the DO-LFA method of this sequence was small, and there was little change after performing the GGO. Specifically, the CPE values of the DO-LFA and DO-LFA-GGO were 0.19 m and 0.17 m, respectively. For the trajectory that the DO-LFA worked well for, the DO-LFA-GGO still worked well and kept the CPE values small. In sequence #02, the point clouds map clearly described the trees, buildings, and other objects. Figure 12 presents the trajectories and point clouds map from the DO-LFA and DO-LFA-GGO methods. Sequence #02 was indoor and outdoor scenes from the Huawei Song Research Institute (Songshan Lake Streamback Slope Village, Dongguan) D4. The terrain was undulating, including up and down stairs. Figure Figure 10 shows the trajectories and point clouds map comparison of sequence #01 with and without GGO. Specifically, Figure 10a presents the trajectory comparison before and after GGO, (b) is the overall point clouds map after GGO, while (c) and (d) are the magnification of the black box area in (b) of the point clouds before and after GGO, respectively.

Results and Analysis
We observed that the number of loop closures detected in #01 was 5, and that they were evenly distributed on the path with an interval of about 10 m (10 m was also the threshold set for loop closure detection). It showed that the loop closure detection accuracy reached 100%. The CPE of the DO-LFA of sequence #01 was large, reaching 2.89 m. However, it dropped to 0.12 m after GGO processing. The trajectory of DO-LFA-GGO was also more accurate than DO-LFA in the loop, indicating that the GGO achieved the expected effect. The error was basically eliminated to 0.12 m of the DO-LFA-GGO from 2.89 m of the DO-LFA.
Buildings, trees, and flagpoles can be described in detail with the point clouds. Here, through comparing the point clouds map, it can be seen that the trajectory without GGO drifted greatly, the constructed point cloud map had obvious layering and confusion, and its consistency was poor. After using loop closure detection and global map optimization, the optimal estimation of platform poses and point clouds was obtained. Therefore, the consistency of the point clouds map constructed was better, and the layering and disorder of the point clouds at the loop closure location disappeared. Figure 11 shows the trajectory and point clouds map of sequence #02 with and without GGO. Sequence #02 was an indoor scene of the Huawei Song Research Institute (Xibeipo Village, Songshan Lake, Dongguan) H5. Specifically, Figure 11a presents the comparison of the trajectories from the DO-LFA and DO-LFA-GGO method, while Figure 11b presents the point clouds map from the DO-LFA-GGO, noting that accuracy before GGO was already relatively high and that the point clouds were no longer compared after GGO.  In aspects of the accuracy assessment, although we deliberately collected the d with loop closures, it was difficult to ensure that the front and back positions on the lo revisited path were exactly the same. An error of more than a dozen centimeters (less th the length of a foot) was normal. The CPE values of #00, #01, and #02 after GGO were 0 m, 0.17 m, and 0.15 m, respectively, which are all below 20 cm. Considering the origi inconsistency of true value motion, the actual error may be lower.
The above three sets of representative datasets included both indoor and outd scenes, which suggests that the DO-LFA-GGO proposed in this paper can be successfu operated. When the original DO-LFA had a cumulative position error, GGO significan eliminated its CPE values. When the original DO-LFA had a higher accuracy, the G still maintained its high accuracy. The GGO ensured the accuracy of positioning and consistency of the point clouds map. The accuracy of global positioning and point clou mapping can reach less than 20 cm, and it had high robustness via the backpack platfo under both indoor and outdoor environments.

Comparisons with Google Cartographer
The Cartographer developed by Google is a 2D/3D LiDAR SLAM utilizing loop c sure detection and graph optimization [33]. We ran the Cartographer with the Kylin and #03 datasets, and the results were compared with that from the DO-LFA-GGO. Figures 13 and 14 present the point clouds results from the Cartographer with Kylin #02 and #03 datasets. The color of the point clouds from the Cartographer were termined by the LiDAR backscatter laser intensity. Confusion and ghosts are all the res of coaxial progressive errors, and the displacement error roughly measured by the po From comparing the trajectories, we observed that the number of loop closures detected in sequence#02 was two, and that the loop closure detection was all correct. The CPE of the DO-LFA method of this sequence was small, and there was little change after performing the GGO. Specifically, the CPE values of the DO-LFA and DO-LFA-GGO were 0.19 m and 0.17 m, respectively. For the trajectory that the DO-LFA worked well for, the DO-LFA-GGO still worked well and kept the CPE values small. In sequence #02, the point clouds map clearly described the trees, buildings, and other objects. Figure 12 presents the trajectories and point clouds map from the DO-LFA and DO-LFA-GGO methods. Sequence #02 was indoor and outdoor scenes from the Huawei Song Research Institute (Songshan Lake Streamback Slope Village, Dongguan) D4. The terrain was undulating, including up and down stairs. Figure 12a  With the plotted trajectories, we observed that six loop closures were detected in sequence #03. The CPE of the DO-LFA in this sequence was small, and there was little change in the CPE while operating the GGO, specifically, the CPE reduced from 0.16 m to 0.15 m. For the high-precision trajectory, the DO-LFA-GGO still maintained its original high precision. The point clouds map had clear and accurate building outlines and targets. The accuracy of the up and down stairs point clouds illustrated the accuracy and robustness of the algorithm to three-dimensional position.
In aspects of the accuracy assessment, although we deliberately collected the data with loop closures, it was difficult to ensure that the front and back positions on the loop revisited path were exactly the same. An error of more than a dozen centimeters (less than the length of a foot) was normal. The CPE values of #00, #01, and #02 after GGO were 0.12 m, 0.17 m, and 0.15 m, respectively, which are all below 20 cm. Considering the original inconsistency of true value motion, the actual error may be lower.
The above three sets of representative datasets included both indoor and outdoor scenes, which suggests that the DO-LFA-GGO proposed in this paper can be successfully operated. When the original DO-LFA had a cumulative position error, GGO significantly eliminated its CPE values. When the original DO-LFA had a higher accuracy, the GGO still maintained its high accuracy. The GGO ensured the accuracy of positioning and the consistency of the point clouds map. The accuracy of global positioning and point clouds mapping can reach less than 20 cm, and it had high robustness via the backpack platform under both indoor and outdoor environments.

Comparisons with Google Cartographer
The Cartographer developed by Google is a 2D/3D LiDAR SLAM utilizing loop closure detection and graph optimization [33]. We ran the Cartographer with the Kylin #02 and #03 datasets, and the results were compared with that from the DO-LFA-GGO. Figures 13 and 14 present the point clouds results from the Cartographer with the Kylin #02 and #03 datasets. The color of the point clouds from the Cartographer were determined by the LiDAR backscatter laser intensity. Confusion and ghosts are all the result of coaxial progressive errors, and the displacement error roughly measured by the point clouds was approximately 5-10 m. The DLO-LFA-GGO in this paper detected the loop closure based on visual BoW similarity and conducted the point clouds re-matching to achieve fusion of the two data. However, the Cartographer detects the loop closure with the point clouds similarity. As a result, the point clouds from the Cartographer have severe layering and ghosting at the loop repeats, with a displacement error up to 5-10 m; while the point clouds maps from the DO-LFA-GGO have clear and accurate targets with high consistency, and the geometric CPE is only approximately 20 cm.

Conclusions
In this paper, we investigated a LiDAR SLAM back-end graph optimization methods using visual features to improve loop closure detection and graph optimization performance. With both experiments in open-source KITTI datasets and a self-developed backpack lasering scanning system, we can conclude that: (1) Visual features can efficiently improve loop closure detection accuracy;

Conclusions
In this paper, we investigated a LiDAR SLAM back-end graph optimization methods using visual features to improve loop closure detection and graph optimization performance. With both experiments in open-source KITTI datasets and a self-developed back-pack lasering scanning system, we can conclude that: (1) Visual features can efficiently improve loop closure detection accuracy; (2) With the detection loop closure, the graph optimization reduced the CPE values of the LiDAR SLAM through the point clouds re-matching; (3) Compared with Cartographer, LiDAR based SLAM, our LiDAR/visual SLAM with loop closure detection and global graph optimization achieved much better performance, including better point clouds map and CPE values.
The source code of this paper was uploaded to the GitHub website and is open source for readers. We expect that the work in this paper will inspire some other interesting investigations into visual/LiDAR SLAM. Although a satisfying performance was obtained, the following work is of great significance for further investigation.
(1) In this paper, to guarantee the accuracy of loop closure detection using visual features, we set strict parameters and rules, which led to some loop closures being missed. Thus a more robust loop closure detection strategy is of great significance for improving the use of visual/LiDAR SLAM in complex environments; (2) As presented in the experiment, different trajectories had different numbers and positions of detected loop closures. It is therefore interesting to explore the influence of the loop closure number and distributions on the GGO performance.
(3) In this paper, we utilized the visual features in back-end graph optimization, and so, it would be interesting to explore visual/LiDAR based front-end odometry.
(4) Visual/LiDAR are the most popular sensors in environmental perception. Based on the code used in this paper, it is prospective to integrate other sensors, i.e., GNSS and IMU, to the current visual/LiDAR SLAM in the graph optimization framework.

Data Availability Statement:
The source code of the paper is uploaded to GitHub and its link is https://github.com/BurryChen/lv_slam (accessed on 7 June 2021).