A Robust Framework for Simultaneous Localization and Mapping with Multiple Non-Repetitive Scanning Lidars

With the ability to provide long range, highly accurate 3D surrounding measurements, while lowering the device cost, non-repetitive scanning Livox lidars have attracted considerable interest in the last few years. They have seen a huge growth in use in the fields of robotics and autonomous vehicles. In virtue of their restricted FoV, they are prone to degeneration in featurepoor scenes and have difficulty detecting the loop. In this paper, we present a robust multi-lidar fusion framework for self-localization and mapping problems, allowing different numbers of Livox lidars and suitable for various platforms. First, an automatic calibration procedure is introduced for multiple lidars. Based on the assumption of rigidity of geometric structure, the transformation between two lidars can be configured through map alignment. Second, the raw data from different lidars are time-synchronized and sent to respective feature extraction processes. Instead of sending all the feature candidates for estimating lidar odometry, only the most informative features are selected to perform scan registration. The dynamic objects are removed in the meantime, and a novel place descriptor is integrated for enhanced loop detection. The results show that our proposed system achieved better results than single Livox lidar methods. In addition, our method outperformed novel mechanical lidar methods in challenging scenarios. Moreover, the performance in feature-less and large motion scenarios has also been verified, both with approvable accuracy.


Introduction
The past decade has seen a rapid rise in the use of high definition (HD) maps for high level autonomous driving; with mobile mapping systems (MMS) and simultaneous localization and mapping (SLAM) being two approaches in this field [1]. As a direct georeferencing method, the fundamental requirement for MMS is high-precision localization. Although an accurate position can be obtained by applying real time kinematic (RTK) measurements in open sky areas, various sensors need to be utilized to compensate for global navigation satellite system (GNSS) blackouts or multipath problems in downtown areas, under urban viaducts, or in the tunnels. With the framework of estimating both the odometry and mapping at the same time, SLAM gives a solution to this challenging conundrum [2]. Though great efforts have been devoted to the study of visual SLAM [3][4][5][6][7], light detection and ranging (LiDAR) SLAM is still of great importance [8]. Compared with cameras, lidars are not sensitive to illumination variations, robust for diverse lighting conditions.
The focus of recent research has been on lowering lidar cost, while increasing its detection range [9,10]; with solid state lidar (SSL) having the greatest interest and potential. Inspired by the Risley prism design, Livox lidar is attracting considerable interest with its desirable price and accuracy. Unlike traditional mechanical lidars, the Livox lidars have limited FoVs and an irregular scan pattern, bringing new challenges for lidar mapping and

•
An accurate and automatic calibration of multiple non-repetitive scanning lidars with small or no overlapping districts. • A novel feature selection method for multiple lidar fusion, which not only increases computation efficiency, but also raises the awareness against degeneracy. • A self-adaptive feature extraction method for various lidars; both the close-to-rectangle and circular scanning pattern of Livox lidar can be satisfied. • A scan context [28] for the place description of lidars with irregular scan patterns. Experimental results show that this method is robust in challenging areas.
The remainder of the paper is organized as follows: Section 2 presents the calibration, feature selection, and optimization methods applied in this research; Section 3 describes the detailed experiments and analyzes the results; Finally, Section 4 draws the conclusions.

Syetem Overview
The structure of the system is shown in Figure 1, mainly comprised of two parts: the front-end fusion and back-end optimization. In the front-end process, the algorithm first automatically detects the number and type of lidar inputs regarding different lidar IDs, with a separate synchronization status measured against IMU. Then the point clouds are sent to parallel threads for feature extraction. Subsequently, the good feature sets are picked using pre-defined thresholds. Lastly, a scan registration procedure iteratively optimizes the lidar pose due to the non-repetitive scanning patterns. In the back-end module, the system fuses the results from the front-end and outputs the final pose estimation. The scan context integrated loop closure is checked through maintained keyframes.

Automatic Calibration of the System
The extrinsic features of a multi lidar system need to be carefully configured before an experiment. This is generally done by detecting coincident feature points and solving the relative pose between different lidars by utilizing a target-based calibration method [29,30]. As this approach requires coincident FoVs among lidars, this is not applicable for nonoverlapping systems, and thus a map-based online calibration method is often used [31][32][33][34]. As shown in Figure 2a, although the five lidars on the UGV platform do have overlapped areas, we still adopt the automatic calibration method for fast deployment.
Given two lidar coordinates, P lidarA and P lidarB , the goal of calibration is to determine the tranlation and rotational relation between them, denoted as T A B and R A B . With the assumption of an isomorphic constraint model for the environment, these two matrices can be iteratively solved using continuously constructed submaps. The Fast-lio [16] is applied for creating a precise local map, and the generalized-ICP [35] is used for map alignment. We obtain the first T A B and R A B from a coarse manual physical measurement, and they act as the initial guess for estimation of the exact calibration matrix. As the generalized-ICP becomes unstable and encounters mismatching errors, where obvious geometrical features are insufficient, both the calibration processes were carried out in an underground parking garage. With large registration outliers removed by random sample consensus (RANSAC) [36], the final calibration matrices could be determined after a one-minute slow driving around the garage. As can be seen in Figure 3, the geometrical structures were maintained well with the computed calibration matrix, and the best calibration accuracy was at millimeter-level after several trials.

Feature Extraction
As shown in Figure 4, two different types of Livox series lidars are applied, the Livox Tele-15 and Livox Horizon. The first one is a long-range, high-density lidar, with structured circular scan pattern, the FoV is 14.5 • × 16.2 • . While the Horizon is totally different from Tele-15 with six laser scan diodes 'brushing' arbitrarily in the 81.7 • × 25.1 • FoV. Various lidars can be connected, either with a Livox Hub or a switch, and we used the time synchronizer in robot operation system (ROS) [37] to fuse and synchronize the lidars. With the aim of allowing an arbitrary number and type of lidars to qualify for our system, we designed a parallel feature extraction process with computational efficiency. Each incoming raw lidar frame is sent to a respective thread for feature extraction, according to its ID, and the extraction methods differ between Tele-15 and Horizon.

•
Tele-15: The feature points can be extracted through counting the local smoothness. Moreover, in view of the limited feature points in the tiny FoV, point reflectivity is also employed as an extra determinant. If the reflectivity of a point is different from the neighboring one for a threshold, it is also treated as an edge point.
where S contains the points in the recent district, and X L (k,j) and X L (k,i) represent the specific coordinates of points.

•
Horizon: We deployed a purely time domain feature extraction method for Horizon. All the raw point cloud data in a single frame are divided into patches with a 6 × 7 point, and an eigendecomposition is performed for the covariance of the 3D coordinates. All the 42 points are extracted as surface features if the second largest eigenvalue is 0.4 times larger than the smallest one. Then the points with the largest curvature on each scan line are found for non-surface patches, and an eigendecomposition is performed. If the largest eigenvalue is 0.3 times larger than the second largest one, the six points are extracted as edge features. Although highly accurate for feature extraction, this method can only solve low speed occasions due to the limited patch size. Therefore, a time-domain-based method is selected for the UGV platform, while a traditional approach is adopted for the passenger vehicle platform.

Feature Selection
Multiple-lidar fusion maximizes the awareness of surroundings, but inevitably extends the computation time, and data association is highly time-consuming if all the features are exploited. Furthermore, the numerous measurements from side lidars are prone to degeneration, such as the walls with planar environment and feature-poor areas in Figure 5. It would be better if only the informative features are selected and sent for optimization.
According to [38], well-conditioned constraints should distribute in different directions, constraining the pose from different angles. For instance, parallel constraints are more susceptible to degeneracy than their orthogonal counterparts. We therefore adopted a feature selection algorithm following [39]. Features that are most valuable for estimation are selected as good features, and both the data association and state optimization should utilize only them. Denoting F K as the number of extracted features in one sweep, G K as the amount of good features set, and M K as the maximum number of selected features, the feature selection problem can be expressed as: where Λ(G K ) is the information matrix of the good feature set and logDet represents the log determinant. As a NP-hard problem, low-latency data association is guaranteed through lazier-greedy algorithm [40]. The idea is simple: at each round, the current best feature is identified only from evaluation of a random subset, while all the n candidate set is searched for the simple greedy approach. With a predefined decay factor , the size s of a random subset can be expressed as: In this way, the time complexity is reduced from where ρ is the constant rejection ratio. As reported in [41], the approximation ratio of a greedy approach was proven to be 1 − 1/e, thus the lazier greedy reaches a (1 − 1/e − ) approximation guarantee in expectation of the optimal solution. Let f (·) be the non-negative monotone and submodular function, and set the size of the random set as s. Differently from the adaptive adjustment of M K with online degeneracy evaluation in [42], we set a constant ratio of all features M K = 0.1F K for time efficiency. Let G * K be the optimal set and G sub K the result of the lazier-greedy algorithm; G sub K achieves the approximation guarantee in expectation: The detailed process of feature selection is summarized as follows: 1. Down sample the current fused point cloud P k with voxel filter [43] and extract all the feature points F K . Set G sub K as empty at the beginning of each frame.

2.
For each lidar input, obtain a random subset R containing s elements from F K . For each feature point p i in R, search the correspondence and compute the information matrix Λ i from residuals calculated from (6) and (7). Add any p i to G sub K that leads to maximum enhancement of the objective, p i ← argmax

3.
For each lidar input, stop step 2 until M K good features are found.

4.
Send all the good feature sets to scan registration after every thread finishes step 3.

Scan Registration
The extracted edge and surface features from different lidars are synchronized and treated as a whole. Point-to-edge and point-to-plane metrics are calculated with respect to lidar IDs and optimized together.
The scan registration starts with projecting the current point cloud P k into the global map P W , with the predefined rotational and translational relationship (R k , T k ): Suppose that E k+1 denotes the set of all edge features in the time k + 1, for each edge feature E i in E k+1 , the five nearest edge points in the P W are found, with their mean value p 5 . To make sure that they indeed form a line, the covariance matrix is computed and an eigendecomposition is performed based upon it. If the biggest eigenvalue is four times larger than the second biggest one, the five points are on a line where E i should lie. The point-to-edge metric is defined as follows: here p 1 is a point among the five points, and similarly to the point-to-edge residual calculation, the five nearest surface points are searched for each surface feature S i in S k+1 . The eigendecompositions of the covariance matrix are also computed; if the smallest eigenvalue is four-times less than the second smallest one, the five points form a plane, and the point-to-plane distance can be calculated by: where p 1 and p 3 are the two points in the five points. The two residuals are also introduced as a measurement for dynamic object filtering. Residuals computed from (6) and (7) are sorted with one quarter of the largest residuals being removed. The pose estimation is finally performed after this process.
In the back-end, keyframes are established for global optimization, and we introduce two criteria for keyframe selection. The first is the time variation, a new keyframe is selected when the current time difference to the last keyframe exceeds 5 s, and the second is the odometry displacement, where a new keyframe is added if the distance to the last keyframe surpasses 10 m.

Scan Context Integrated Global Optimization
Place recognition provides candidates for loop closure optimization, which is critical for correcting long-term accumulating drift over a path. This is a challenging task on point cloud, as loss of textures and colors means only the geometric information is available. The well-known histogram-based methods only provide a stochastic descriptor of the scene [12], making it less discernible for place recognition problems. Thus, we follow a novel place descriptor scan context [28,44], targeting 3D lidar scan data.
With the idea of encoding the geometrical shape of the point cloud into an image, the scan context first divides a laser scan into azimuthal and radial bins in the lidar coordinate. As shown in Figure 6a, N s and N r are the number of sectors and rings. Suppose that the maximum detection range of a lidar is L max , then the central angle of a sector in the polar coordinate is 2π N s and the radial distance between two rings is L max N r . Therefore, for each point P k = [x k , y k , z k ], we have (8) where θ k = arctan y k x k , and a scan context Ω(i, j) can be encoded by assigning the maximum height of the 3D points within each bin through: L max is set as 130 m (detection range of Livox Horizon at 20% reflectivity), N s = 60 and N r = 40. As seen in Figure 6b, the scan context partitions whole points into equally distributed intervals. Thanks to the regular encoding pattern, far points and nearby dynamic objects can be treated as sparse noise and discarded for estimation.
As can be seen in Figure 7, the place re-identification can then be simplified as columnwise similarity matching. Given a scan context pair I q and I c , the geometric similarity function can be defined as: where c q j and c c j are the object and candidate column vectors at the same index. Since the column vectors describe changes in the azimuthal direction, they may be shifted even in the same place for a change of view-point. On the other hand, the row vector is dependent on the sensor location, and always consistent for the identical location. In this respect, the performance of scan context is limited in indoor environments, where variations in the vertical direction are insignificant. The change of view-point can be determined by column shifts: Φ g (I q , I c ) = min k∈{1,2,...,N s } ϕ g (I q , I c k ), (11) k * = argmin k∈{1,2,...,N s } ϕ g (I q , I c k ), where I c k is I c shifted by kth column. The pair I q and I c can be filtered out by an empirically defined threshold, Φ g (I q , I c ) describes the best alignment of them, and k * is the optimal distance. ∈ , ,…, , ∈ , ,…, where , and a scan context Ω i, j can be encoded by assigning the maximum height of the 3D points within each bin through: (a) Camera view and bin division of the same place. The yellow and cyan area in the right image indicate sectors and rings, respectively. The intersection of sectors and rings is a bin, in red.
(b) Scan context, with yellow and cyan column representing the same sector and ring in the above, while the bin is in black.   As a scan context is calculated for each keyframe, the candidate query will increase rapidly as vehicle moves around, and we set up a KD-tree for the fast-search of possible candidates. A local place re-identification score is defined by incorporating N temporal frames into the verification: where P m and P n are the query scan and candidate scan. This temporal score may further serve as an initial value for scan-to-scan ICP refinement.

Results
The performance of our approach was validated by extensive experiments on different platforms. All the datasets were processed by an onboard computer, DJI Manifold 2C, with i7-8550U, 8 GB RAM, with no GPU-accelerated computing. The proposed system can reach real-time performance with up to three lidars, and 90% and 70% real-time efficiencies are satisfied for five and eight lidars, respectively. GNSS outputs were set as ground truth for further comparison throughout all the experiments. Lever-arms were measured with a tape and stored in the hardware settings for online estimation. Results were also compared with the state-of-the-art (SOTA) following [45], with detailed description as follows.

System Setup and Scenario Overview
Presented in Figures 2b and 8, the platform consists of six Livox Horizon lidars, two Livox Tele-15 lidars, and a Livox Hub in charge of lidar connection. All the lidars are synchronized through gps-pps signals with a u-blox M8T the GNRMC format. With radio technical commissions for maritime services (RTCM) corrections from a Qianxun D100, the MTi-680G can provide RTK localizations.
The up-down placement of the two front-view Livox Horizon lidars has a significant advantage in the suburbs. When heavy traffic congestion is encountered, only the upper one is introduced for mapping and map-based re-localization, ensuring accurate mapping and odometry results in highly dynamic environments. As shown in Figure 9, two experiments were carried out on this platform: the first one was in a low-speed, low-kinematic urban scene, with the overall length and time cost of 1.5 km and 342 s. The second one tried to assess the mapping consistency in tunnels, and a long acoustic barrier on the overpass was chosen. The path is around 2 km long, with a time consumption of 141.5 s.

Results of Experiment #1
As the GNSS signals were occasionally blocked along the path, we only evaluated the horizontal accuracy of the proposed system with Livox Horizon Loam (official release), Lili-om, and Fast-lio. All the eight lidars were fused in our system, while only a front lidar was used for the other three methods. The following can be observed from Figure 10.

•
The advantage of integrating point clouds from multiple lidars is manifest; much more environmental information can be obtained and presented. While the front lidar mapping merely depicts the road shape, multi-lidar fusion generates generous supplementary geometrical features. Thanks to the long-range Tele-15 lidar, the playground is completely preserved on the map shown in Figure 10e. As feature quality has a great impact on lidar odometry, multi-lidar fusion should be a better approach for degeneracy problems.

•
The multi-lidar fusion system was more robust than the single lidar system in dynamic scenes. The system encountered heavy congestion at the beginning, with vehicles, bicycles, and people moving irregularly. Hence the scan-to-scan correlation was difficult to determine in a severely limited FoV. We can see from Figure 10c and the upper right corner of Figure 10d that the scan registration died with front lidar only, which was the reason for the loss of other three trajectories. This is an increasingly common situation in the suburbs; with limited observable features, single lidar mapping is vulnerable to dynamic scenes. From our experience, it could be easily misled by a truck or a bus passing by, especially at crossroads, while waiting for traffic lights. With the auxiliary features from side lidars and feature filtering for front lidars, our method had the highest accuracy in these scenarios. It can be noticed from Figure 10c that the incorrect matching of dynamic objects at the beginning ruins the entire odometry. For other three single-lidar-based approaches, the initial translational error had already surpassed 10 m. While for our system, the position errors always stayed at a low level along the trajectory.
For quantitative analysis, the absolute position and orientation errors are plotted in Figure 10d for detailed reference, showing that multi-lidar fusion had a great impact on improving the positioning accuracy. The mean and the root mean square error (RMSE) of position error were 2.452 m and 3.618 m. In addition, the mean and RMSE of attitude error were 11.967 • and 5.675 • .

Results of Experiment #2
Tunnels are a great challenge to lidar odometry due to their repetitive features; with the help of an initial guess from IMU and additional feature points from side lidars, our approach was able to generate a map of tunnels with a medium length. We choose the noise barrier fence on the overhead viaducts for quantitative analysis. The selected acoustic barrier wall was blank on the top, allowing GNSS signal penetration. The laser beam was unable to go through the surrounding walls, making it share a similar pattern with traditional tunnels. The mapping and odometry result are presented in Figure 11. The RMSEs of position (3D) and attitude (RPY) error were 6.520 m and 14.567 • .

System Setup and Scenario Overview
The small platform is shown in Figures 2a and 12a, comprised of five Livox Horizon lidars, a MTi-680G integrated navigation unit, and a Livox Hub for lidar connection. The small platform ran at up to 5 m/s in the campus, and we used a laptop to collect all the data for fast implementation.
Two loops in Wuhan university were performed to validate our approach, shown in Figure 12b, the first one is a small loop around the main building, with a time consumption of 283 s and overall length 600 m, while the second one is a big loop, 852 s in all, and around 1.8 km. As there are trees and buildings blocking the GNSS signals along the path, we only evaluated the horizontal position and roll-pitch attitude accuracy in this case.

Effects of Lidar Number
Since our scheme supports an arbitrary number of lidars for optimization, we first investigate the performance variation with respect to lidar number. The small loop is selected for illustration here, and Figure 13 describes the results from various numbers of lidars. With each of them carrying out pure odometry, no global optimization was enabled. As a result, the following was discovered.

•
A five-lidar fusion could eliminate the exaggerated height ramp of a single lidar approach. There were six speed bumpers on the path, causing large vertical vibrations to the vehicle. Some of these perpendicular motions are fatal to lidar odometry, as vertical displacement will be enlarged by incorrect registrations. The front lidar trajectory in Figure 13a encountered a vertical dump in the left corner, which was the result of two consecutive speed bump. With limited FoV from only the front view, the single lidar odometry was unable to correct these errors; it was further affected by error propagation in Figure 13c. Extending extra constraint features from side lidars, our approach was able to correct these errors and maintain a flat terrain.

•
The geometry of multiple lidar placement made a significant improvement to the system performance. We can notice from Figure 13a that the results from one front view lidar and two back view lidars had the most proximal accuracy with the five-lidar fusion. On the other hand, the positioning error of two back lidars was worse than the single front view lidar. Therefore, pentagon-like and triangle-like multi-lidar setups should deliver better mapping and positioning results in real cases.

Effects of Loop Closure Optimization
As shown in Figure 14, the big loop was taken into consideration here, and two other place representations were selected for comparison: a pose-information-based loop detector [46] and a histogram-based loop detector. The pose-information-based loop detection encountered the most exaggerated errors, in the top left corner. With the assumption of small pose deviations along the path, the pose-information-based approach took keyframes of similar poses as a detected loop. Therefore, this method is unable to handle large-scale mapping, where pose deviation is inevitable. Histogram-based loop detection also failed to close the loop, and the part in the green circle distorted seriously. As a feature-based method, it is hard to avoid the influence of dynamic objects and view angle variations with this approach. Since only a N r × N s scan context was encapsulated for each scan, this descriptor had the best accuracy for challenging scenes, with the 2D translational and roll-pitch rotational RMSEs of 1.997 m and 7.198 • .

Comparison with Mechanical Lidar
Each Livox Horizon lidar has a horizontal view of 81.7 • , with five of them completing a full coverage of 360 • , which is similar to traditional mechanical lidars. Here, two widely known lidars, Ouster OS1-64 and Velodyne VLP-16, were chosen for evaluation with our multi-lidar method. We selected three popular algorithms for demonstration, a tightly coupled Lio-sam [47], an advanced implementation of LOAM (A-LOAM), and LeGO-LOAM. Both the Lio-sam and LeGO-LOAM were also integrated with scan context as a global descriptor. The experiments on two loops are illustrated here, with the mapping results presented in Figure 15, and the following information can be observed.

•
Multi-lidar fusion had a commendable mapping accuracy and retained abundant features in the meantime. In the current market, the five Livox Horizon kit is (5 Livox Horizons and a Livox Hub) almost the same price as a VLP-16, and half the cost of an OS1-64. Moreover, the horizontal odometry results in Figure 15b and Table 1 also indicate that our approach had an accuracy comparable with Lio-sam.
• Multi-lidar fusion had a better performance in challenging scenarios. Shown in Figure 15c,d we carried out an experiment of small loop at class breaktime, with students and vehicles blocking the paths. Therefore, our vehicle had to avoid collisions with irregular motions, such as moving back and forth, sharp turnings, and fast accelerations. Lio-sam had the most significant failure in such areas, because of loss of landmark constraints. Since the IMU pre-integration process of the system depends heavily on lidar odometry [48], the system was liable to fail once the landmark constraint information was insufficient. The maximum error of A-LOAM came from back-and-forth motions, causing a more than 90 • deviation to the trajectory. The starting area was a crowded pathway with vehicles on both sides; as our platform was much lower than common sedans, most of the surface points were cast on vehicles.
Once the front view is blocked by pedestrians, the loops cannot be detected or are mismatched, hence LeGO-LOAM failed to close the loop. A crucial benefit of multilidar fusion is the capability of manipulating each lidar input freely. With a pre-set empirical threshold for the amount of edge points, all the edge and surface features from the individual lidars after selection were further down sampled to 10% of their original size once the threshold was reached. This is similar to shutting down some typical view angles of a mechanical lidar, and thus alleviating bad impacts in certain areas. Moreover, the major part of dynamic objects in the front view can be removed by point-to-edge and point-to-plane residuals. With the help of the two mentioned improvements, our multi-lidar fusion solution had a strong capability in severe situations, and the end-to-end errors were small, as presented in Table 2.

Conclusions
In this work, we proposed a robust mapping and odometry algorithm for multiple non-repetitive scanning lidars. The robustness is ensured mainly through two approaches, the first is good feature selection against massive data and degeneracy, and the latter is the novel place descriptor against dynamic objects and view angle variations. Extensive experiments on two platforms were conducted. The results show that the proposed algorithm delivers superior accuracy over SOTA single lidar methods.
There are several directions for future research. Adding additional GNSS measurements into our system is conceivable, which would further compensate for accumulated drifts and maintain the global map. Another research direction concerns map reuse. With a great variety of lidars on the market, challenges still remain for map-based re localiza-tion. It would be desirable if we could use low-cost lidars or even cameras to register high-definition maps.