LiDAR Odometry and Mapping Based on Semantic Information for Outdoor Environment

: Simultaneous Localization and Mapping (SLAM) in an unknown environment is a crucial part for intelligent mobile robots to achieve high-level navigation and interaction tasks. As one of the typical LiDAR-based SLAM algorithms, the Lidar Odometry and Mapping in Real-time (LOAM) algorithm has shown impressive results. However, LOAM only uses low-level geometric features without considering semantic information. Moreover, the lack of a dynamic object removal strategy limits the algorithm to obtain higher accuracy. To this end, this paper extends the LOAM pipeline by integrating semantic information into the original framework. Speciﬁcally, we ﬁrst propose a two-step dynamic objects ﬁltering strategy. Point-wise semantic labels are then used to improve feature extraction and searching for corresponding points. We evaluate the performance of the proposed method in many challenging scenarios, including highway, country and urban from the KITTI dataset. The results demonstrate that the proposed SLAM system outperforms the state-of-the-art SLAM methods in terms of accuracy and robustness.


Introduction
Simultaneous Localization and Mapping (SLAM) technology is one of the key technologies for autonomous vehicles to perform navigation and interaction tasks. A typical SLAM framework consists of the front-end and the back-end [1]. The goal of the front-end is to estimate the transformation between adjacent frames, which includes preprocessing step, data association and pose estimation. Once the front-end detects the loop-closure, a global optimization framework, i.e, the back-end, is adopted, which aims to obtain global consistency by reducing the historical accumulative error [2].
According to the sensor types, SLAM technology can be divided into three categories: vision based [3,4], LiDAR-based SLAM methods [5] and a combination of both [6]. Visual sensors can obtain rich environmental texture information. However, some inherent shortcomings of the visual sensor eventually lead to errors in the pose estimation. For example, the strong sensitivity on the illumination limits their applications and the scale uncertainty of depth information from monocular cameras [7]. Although scaled depth information can also be obtained from single images by current AI techniques, this has to be solved by the software. In contrast, LiDAR can directly output more accurate depth information. Moreover, SLAM suffers from errors due to the inaccurate depth information provided directly by the stereo vision and RGB-Depth (RGB-D) camera [8]. The reason is that the accuracy of 3D distance information is inversely proportional to the measuring distance [9]. In contrast, LiDAR can work even at night [10]. Another advantage of LiDAR sensors is their centimeter-level measurement accuracy and wide detection range. For example, a typical 3D LiDAR, Velodyne HDL-64E, has a measurement accuracy and range of ±2 cm and 120 m, respectively [11]. These advantages allow it to collect the detailed surrounding environment information with long ranges [12]. Hence, LiDARbased SLAM methods have been extensively studied in the autonomous robot community.

LOAM and Its Variants
To reduce the computational complexity while obtaining accurate results, Zhang et al. [29] presented a typical SLAM solution called Lidar Odometry and Mapping in Real-time (LOAM) to achieve low-drift and real-time pose and mapping estimation by performing point-to-line and point-to-plane matching. The system includes two individual nodes, i.e., the odometry and mapping nodes, where the former runs at high frequency, and the latter outputs precise pose estimations at a lower frequency by matching the current point cloud with the map. LOAM shows excellent performance in the autonomous robot localization and mapping. Inspired by this, multiple extensions of LOAM have been presented to further improve performance. In [30], the intensity information of LiDAR is incorporated into LOAM. Although this method improves accuracy, the computational cost increased by double. Rufus et al. [31] proposed a two-layer structure lidar odometry system. A Phase Only Correlation method is first applied to calculate the approximate pose estimate, which is then used as an initialization of the pointto-plane ICP to refine the matching. Their experimental results show that the method has promising performance in a high-speed environment. Zhou et al. [32] proposed a real-time LiDAR-based SLAM called S4-SLAM for complicated multi-scene environments. They follow a similar framework to LOAM. In the odometry module, the initial iteration value provided by the Super4PCS algorithm is fed into the standard ICP. To speed up the matching process of the mapping module, NDT is used to compute the spatial transformation between key-frames and the dynamic voxel grid-based local map. Unlike LOAM, a location-based loop detection and the optimized pose graph are also included in this framework.
However, LOAM and the above variants also have some drawbacks, which limit the accuracy of the algorithms. First, they rely on low-level geometric features solely, which may fail in complex environments. Then, dynamic objects, such as pedestrians and moving vehicles, are not removed in this system, which often occur in urban and highway environments. Moreover, considering the smoothness of the local surface as the only criterion for feature point extraction will inevitably lead to some errors. Moreover, they do not integrate semantic information into the system to improve performance. Semantic information is an enabling factor for autonomous vehicles to perform high-level tasks. It ·provides a fine-grained understanding of the scene [33]. In general, semantic information can be used as a priori to assist SLAM systems in three aspects: • Moving objects removal. Dynamic objects, such as pedestrians and vehicles, in the environment will lead to false corresponding points, which can cause large errors.
To this end, semantic information can be used to remove dynamic objects [34]. • Data association. Semantic information can be considered as additional constraints of searching for corresponding points [34,35] and loop detection [36]. • Semantic mapping. The constructed semantic map helps to carry out further path planning and obstacle avoidance tasks [37].

Semantic-Assisted LiDAR SLAM Method
Due to the application potential of semantic information in the SLAM field, integrating semantic information into SLAM has been gaining more and more popularity over the years. Some researchers first exploit low-level geometric features to obtain semantic information, which is further used as a priori of the SLAM algorithm. In [38], semantic labels are introduced to improve point-to-point ICP. This method not only improves the accuracy but also speeds up the convergence time, which is attributed to constraining the corresponding points to the same semantic category. However, it only uses the gradient relationship between points to the segment point cloud, which is designed for indoor scenes and cannot satisfy the complex outdoor environment. An extension of LOAM (LeGO-LOAM) is presented in [10]. This algorithm first segments the point cloud into ground points and nonground points by the scanning principle of LiDAR. Subsequently, edge points and planar patches are extracted from non-ground points and the ground, respectively. After that, a two-step L-M optimization is employed to obtain the accuracy trajectory. However, LeGO-LOAM simply divides the point cloud into ground points and non-ground points without considering point-wise semantic labels. Similar methods are also presented in [14,39]. Liu et al. [40] proposed a segmentation-based LiDAR SLAM method. Compared with LeGO-LOAM, they presented a more refined point cloud segmentation method, which includes ground, road-curb, edge and surface. However, they still use low-level geometric information to segment the point cloud, and the algorithm requires a priori map for matching, which limits its application. All the above algorithms use low-level semantic segmentation methods that cannot obtain point-wise semantic labels. Hence, semantic information has not been fully integrated into the SLAM system.
In recent years, deep learning technology has shown great potential in point cloud semantic segmentation. Due to the irregular format of 3D point clouds, researchers initially mainly used indirect methods, such as multi-view, 3D voxelization and projection, to perform point cloud semantic segmentation [41]. In 2016, Qi et al. [42] pioneered a deep semantic segmentation network (PointNet) that directly consumes point clouds. After that, many excellent deep learning frameworks for semantic segmentation spring up, such as Pointwise [43] and RandLA-Net [44]. On this basis, researchers take the semantic infor-mation from these excellent deep learning networks as a priori to improve the accuracy and robustness of the SLAM algorithm. Zaganidis et al. [28] used per-point semantic labels generated by PointNet to partition the point cloud into disjoint segments. NDT or Multichannel Generalized ICP (GICP) is then constructed for each segment separately. However, this algorithm works poorly in highway environments. In [45], a surfel-based mapping semantic SLAM approach called SuMa++ is presented. In their algorithm, the point-wise semantic labels provided by the projection-based semantic segmentation network (Rangenet++) [46] are used to constrain point cloud matching and remove dynamic objects, respectively. Chen et al. [47] presented a semantic-based LiDAR SLAM for the estimation of tree diameters. The pipeline utilizes a semantic segmentation network to segment the forest environments into trees and ground points. After that, a point-to-cylinder and point-to-plane distance functions based on tree features and ground features are employed to estimate pose transforms in odometry and mapping modules, respectively. However, this algorithm is mainly designed for forest inventory. Therefore, it cannot be applied to urban environments that contain rich semantic information. Zhao et al. [34] constrained the corresponding points to the same semantic label, which is similar to our work. However, they still use curvature to extract edge and plane points without considering point-wise semantic labels. Wang et al. [35] weighted the contribution of feature correspondences by their semantic similarity distribution. However, this algorithm still uses curvature to extract feature points instead of semantic information. Moreover, they do not consider point-wise semantic labels.
Motivated by the discussions above, this paper proposes a novel semantic-assisted LiDAR SLAM method (represented by "S-ALOAM"). Our goal is to develop a real-time LiDAR SLAM method with high robustness and low drift. The key idea is to combine point-wise semantic labels with LiDAR odometry and mapping (LOAM). The proposed system includes four modules, namely, scan pre-processing module, feature extraction module, LiDAR odometry module and LiDAR mapping module. It should be noted that this paper does not propose and introduce a specific point cloud semantic segmentation network. As mentioned above, many excellent deep learning networks are emerging for point cloud semantic segmentation. They can output high-precision point-wise semantic labels. Therefore, any point cloud semantic segmentation networks that can output pointwise semantic labels are seamlessly compatible with our algorithms, such as PointNet and RangeNet++ mentioned above. Meanwhile, the original LOAM algorithm uses IMU to assist LiDAR. However, this paper mainly focuses on the SLAM algorithm based on the LiDAR itself, and any additional sensors are not considered. Therefore, the following experiments all adopt the LOAM algorithm without IMU, i.e., A-LOAM. A-LOAM is an open-source implementation of LOAM [48]. The difference from LOAM is that it removes the IMU and uses Eigen and a non-linear optimization library to simplify the code structure. Due to the above changes, its accuracy is slightly inferior to the original LOAM. However, this does not affect the effectiveness of the algorithm proposed in this paper. The primary contributions are as follows: • Point cloud with point-wise semantic labels is used to coarsely remove dynamic objects and outliers. The proposed filtering method largely preserves the static parts of all movable classes while removing dynamic objects and outliers. • We use point-wise semantic labels instead of the smoothness of the local surface to extract edge and plane features. Semantic labels are first used to establish candidate feature points. Then, some down-sampling and culling strategies are presented to select feature points from these candidate feature points. • In the LiDAR odometry and mapping module, we constrain the corresponding points of frame-to-frame or frame-to-map to the same semantic label. Besides, a second dynamic objects filtering strategy is also presented in the mapping module. • To verify the proposed solution, extensive experiments have been carried out in several scenarios, including the urban, the country and highway, based on the se-manticKITTI dataset [33]. Experimental results show that the proposed methods can achieve high-precision positioning and mapping results compared with the state-ofthe-art SLAM methods.
The remainder of the paper is structured as follows: In Section 2, the proposed methodology is described in detail. Experimental results are then shown in Section 3, which is followed by a discussion in Section 4. Finally, the paper ends with a conclusion in Section 5.

Materials and Methods
An overview of our S-ALOAM is shown in Figure 1. The system takes the 3D LiDAR data as inputs and outputs 6 DOF pose estimation and a feature map. The overall system contains four sequential modules. The first, scan pre-processing, utilizes point-wise semantic labels generated by a semantic segmentation network to coarsely remove the dynamic objects and outliers. The remaining segmented point cloud is then input to the feature extraction module, where we extract edge and planar features by semantic labels instead of the smoothness of the local surface. Then, LiDAR odometry estimates the transformation between the consecutive point cloud by combining semantic constraints with geometric features. Meanwhile, this part compensates for any point cloud deformation caused by the LiDAR rotation and vehicle movement. Furthermore, these features are fed into LiDAR mapping, which obtains the accurate pose and feature map estimation by enabling a frameto-map framework. Moreover, a second dynamic objects filtering strategy is also presented in this step. The details of each module are introduced in the following sections.

Scan Pre-Processing
The scan pre-processing module consists of point-wise semantic label acquisition and dynamic objects removal. Since we aim to integrate semantic information into the LiDAR SLAM framework, all those semantic segmentation methods that can output point-wise semantic labels can be applied to our system. For example, all semantic segmentation algorithms ranked on the SemanticKITTI benchmark [33] can be integrated into our system. Therefore, in this section, we do not propose a special semantic segmentation method but use a pre-labeled dataset, i.e., SemanticKITTI, to verify the proposed method. The details of the SemanticKITTI dataset will be given in Section 3. By point-wise semantic labels, point cloud can be divided into non-overlapping sub-point clouds with rich semantic attributes, such as pedestrians and cars.
Dynamic objects, such as pedestrians and moving vehicles, will cause wrong associations. A simple solution is to filter out all movable semantic classes, including all vehicles and persons. However, the approach also removes many static objects, such as parked vehicles, which are valuable features for the data association and pose estimation. To this end, we propose a two-step dynamic object removal strategy. In this part, only the first step, the filtering method, is presented, and the second step, the removal strategy, will be shown in the mapping module. Specifically, the coarse filtering method achieves the goal by removing the objects with a larger probability of movement in the environment, which include persons, bicyclists, motorcyclists and railcars. Meanwhile, this step also filters out the points labeled as outliers and unlabeled. Overall, the proposed filtering method largely preserves the static part of all movable classes while removing dynamic objects and noise points.

Feature Extraction
LOAM and its variants extract edge and planar features by calculating the smoothness of the local surface. To evenly extract features in the environment, they divide a scan line into six equal subregions. Two edge features and four planar features are extracted from each subregion. The method that only relies on the local smoothness cannot guarantee accurate feature extraction. Moreover, one drawback of point-wise extraction is that some subregions without obvious features are forcibly assigned the same number of feature points as other subregions. In this part, we completely improve the feature extraction method of LOAM by considering semantic information.
Different from LOAM, the proposed method works in the point cluster with the same semantic label line-by-line according to the chronological order. Let P l (k,i) be the ith point of the kth point cloud P k acquired during the sweep k, P l (k,i) ∈ P k . Note that the superscript l indicates the semantic label of point P (k,i) . Point P l (k,i) is selected as an edge point only if one of the following three cases is satisfied.
1. The edge feature can be selected from the same semantic category by: Then, P l (k,i) is an edge point if (e.g., P l (k,i) in Figure 2a): Otherwise, if: is the nearest measurement point of P l (k,i) in scanning order, and they have the same semantic label l, which indicates these two points belong to the same category in the real environment. th denotes a threshold, and · represents the Euclidean distance. As shown in Figure 2a, we select the convex points with respect to the center of the LiDAR in the same semantic category as edge points. These points often occur in the real environment (e.g., the intersection of two walls).
2. The edge feature can be selected from the boundary of different semantic categories. Figure 2b intuitively describes the case. P l (k,i) and P m (k,i+1) are two consecutive points that belong to two different semantic categories l and m, respectively. Point P l (k,i) can be treated as the edge point only if: Otherwise, P m (k,i+1) can be labeled as the edge point. In this case, those boundary points that are closer to the center of the LiDAR between different semantic categories are selected as edge points, such as point P l (k,i) in Figure 2b. 3. The edge feature extraction in the cylinder-like structures. The edge points are generated when the LiDAR scans the cylinder-like structures (see Figure 2c). To maintain the real-time performance and the invariance of feature points, we extract edge points by downsampling the points of the cylinder-like structure. Specifically, if the number of points on the scan line where the cylinder-like object is located is less than six, the midpoint of these points is selected as the final edge point. Otherwise, we use these points to fit a circle by least squares. Then, the center of the fitted circle is considered as the edge point on the current scan line. This is based on the recognition that these points of the cylinder-like structure on a scan line can form a circle. Next, we will introduce how to use least squares to fit a circle. As shown in Figure 2c, let (P l (k,a) , · · · , P l (k,b) ) be the points of the cylinder-like structure l on the current scan line ((b − a) > 6). Now, we use these points to fit a circle. The equation of the circle is often defined as follows: where (x c , y c ) represent the center of the circle, and R denotes the radius of the circle. The goal of the least-squares-based fitting method is to minimize the following equation: where (x l i , y l i ), i = (a, · · · , b) are coordinates of P l (k,i) from the cylinder-like structure (labeled semantic classes l) on the current scan line. However, Equation (6) only has the numerical solution that needs to be calculated iteratively. To obtain the analytical solution that also meets the real-time requirements, we simplify it slightly as follows: Let g(x, y) = (x − x l c ) 2 + (y − y l c ) 2 − R 2 . Equation (7) can be expressed as: By now, Equation (8) can be solved by partial derivatives. Specific calculation details are not given in this paper. In summary, edge features are extracted by the above strategies. In contrast, it is easier to extract planar features. In our system, all points of semantic classes with plane attributes, such as road, sidewalk, traffic-sign and cars, are considered as planar features. Of course, those edge points that satisfy the above conditions are excluded.
After feature extraction, The sets of edge features E k and plane features H k of the current time k are obtained. Here, we also obtain the edge features E k and plane features H k by downsampling E k and H k , respectively. Thus, we have E k ⊂ E k and H k ⊂ H k . Specifically, the edge points E k are selected by uniformly-sampling E k , while plane features H k are selected by taking the center point of the point cluster with plane attributes on each scan line.
The proposed method follows a similar framework in LOAM, i.e., the line-by-line extraction and piecewise extraction in scanning order. However, we completely consider the semantic information instead of the local smoothness for feature selection criteria. In addition, LOAM forcibly divides each scan line into six uniform segments, which may break the coherence of feature extraction. By contrast, our segmentation extraction is completely based on semantic categories, which is more consistent with the distribution of feature points in the real environment.

LiDAR Odometry
This module includes deformation correction and LiDAR odometry. To clearly describe this section, we start with the deformation correction module, which is followed by the LiDAR odometry module.

Deformation Correction
Mobile vehicles equipped with LiDAR acquire the point cloud by rotating the laser beam while moving forward. Obviously, vehicle movements during the point cloud acquisition will deform the point cloud. To address this, some methods need to be used to compensate for the deformation. In this part, we adopt the same method as LOAM, namely linear interpolation, which models the motion within the point cloud as the constant angular and linear velocity motion.
Denote t k the end time of the current point cloud P k . Furthermore, t k−1 is the starting time of the current point cloud P k or the end time of the previous point cloud P k−1 . Thus, T k k−1 represents the relative motion transform of the vehicle between [t k−1 , t k ]. Given a laser point can be computed as: where t scan represents the scanning period of a point cloud. Thus, non-deformable point cloudsĒ k andH k are obtained by projecting E k and H k to the end of the sweep k, i.e., t k , by: where P (k,i) is a feature point of E k or H k . Furthermore, R (k,i) and D (k,i) are rotation matrix and translation vector of T (k,i) , respectively. R k k−1 and D k k−1 denote the rotation matrix and translation vector of T k k−1 , which represents the relative motion transform between [t k−1 , t k ].P (k,i) is the corresponding non-deformable feature in feature point setĒ k−1 and H k−1 . We first project P (k,i) to the initial moment of the current point cloud t k−1 . Therefore, P s (k,i) is obtained. Then, P s (k,i) is transformed to the end of the sweep k by T k k−1 .

LiDAR Odometry
The basic algorithm of this part is consistent with LOAM, but semantic constraints are introduced when searching for corresponding points. Therefore, in order to clearly describe how semantic information assists the corresponding point search, we re-discuss this part. The purpose of the LiDAR odometry module is to estimate the LiDAR transformation between two consecutive point clouds by minimizing point-to-edge and point-to-plane distance functions. To find corresponding points between current point cloud P k and the previous point cloud P k−1 , E k and H k are projected to the beginning of the sweep k, i.e.,Ẽ k andH k . The mathematical operation is given as follows: where R (k,i) and D (k,i) are rotation matrix and translation vector of T (k,i) , respectively. P (k,i) is a feature point of E k or H k . Furthermore,P (k,i) is the corresponding feature in feature point setẼ k orH k . For each feature inẼ k andH k , we start to search for the closest neighbor point inĒ k−1 andH k−1 . The point-to-edge scan matching is performed by minimizing the distance of each edge line inĒ k−1 to the edge point inẼ k . Furthermore, the edge line is represented by two edge points on adjacent scan lines. The point to edge line distance is defined as: whereĒ l (k−1,j) andĒ l (k−1,q) are points ofĒ k−1 from the previous point cloud. Furthermore, E l (k,i) is an edge feature inẼ k from the current point cloud. The superscript l indicates the semantic label of the point.
Furthermore, the point-to-plane scan matching is performed by minimizing the distance of each planar patch inH k−1 to the planar point inH k . Furthermore, the planar patch is formed by three planar features on adjacent scan lines. The point-to-plane distance is then computed as: where pointsH m (k−1,j) ,H m (k−1,q) andH m (k−1,r) form the corresponding planar patch. The superscript m indicates the semantic label of the point.
Combining Equations (9)- (14), a nonlinear function based on the point to the edge line distance is calculated as follows: Furthermore, a nonlinear function based on the point to planar distance is calculated as follows: Substitute each feature point of E k and H k into Equations (15) and (16), and the final nonlinear function to be optimized is defined as: where d E and d H consist of the corresponding distance function of each feature point in E k and H k , respectively. The goal of LiDAR odometry is to calculate T k k−1 by iteratively minimizing d towards zero.
where J = ∂f/∂T k k−1 . Furthermore, λ is a factor of the nonlinear optimization algorithm such as the Levenberg-Marquardt method.
Assume the LiDAR coordinate system {L} coincides with the world coordinate system {W} at the initial time t 0 . T W k can be calculated by accumulating the relative motion T k k−1 of all previous adjacent locations up to the current location k.
Compared to LOAM, we introduce semantic constraints in Equations (13) and (14), namely, the correspondences are only found in feature points with the same semantic label fromĒ k−1 andH k−1 . For example, in Equation (13),Ẽ l (k,i) has the same semantic label l as pointsĒ l (k−1,j) andĒ l (k−1,q) . Searching for the correspondences between the same semantic classes are more likely to find the correct correspondences. Furthermore, this can greatly improve the efficiency and accuracy of scan matching.

Lidar Mapping
After the LiDAR odometry algorithm, non-deformable point cloudsĒ k andH k are obtained, and simultaneously LiDAR odometry also output T W k , which indicates the pose of the current point cloud P k in the world coordinate system {W}. Let Q E k−1 and Q H k−1 be the edge features and planar features on the map, respectively, accumulated until time k − 1. The LiDAR mapping module matches features in {Ē k ,H k } to the global map to optimize the pose estimation.
The pseudocode of the LiDAR mapping is presented in Algorithm 1. The pose T W k generated by the LiDAR odometry module is set as the initial guess (line 1). Parameter iter max indicates the maximum number of iterations (line 2). To find corresponding points between the current point cloud P k and the global map accumulated until time k − 1, E k andH k are projected to the world coordinate system {W} (line 4 and line 12). The mathematical operation is given as follows: where R w k and D w k are rotation matrix and translation vector of T W k , respectively.P (k,i) is a feature point ofĒ k orH k . Lines 6-7 and lines 14-15 are implemented by computing the covariance matrix, eigenvalues and eigenvectors. Readers can refer to [29] for the detailed description. After this process, we can compute the point-to-line distance (lines 7-8) and the point-to-planar distance (lines 15-16) by Equation (13) and Equation (14), respectively. Compared to the LiDAR mapping module in LOAM, we introduce semantic constraints, namely, these five nearest points found in Q E k−1 or Q H k−1 have the same semantic label asP (k,i) inĒ k orH k . Then,T W k is updated (lines 22-27) based on Equation (18) as in Section 2.3.2.
As mentioned earlier, a second step dynamic objects filtering strategy is also presented in this section. This method was first proposed in [49]. Specifically, after all residual blocks are added, we first calculateT W k by one iteration (line 19).T W k is then used to recompute all residuals in Equation (17) (line 20), and feature points with the first 10% largest residuals are removed (line 21). After this step, the full iterative update is finally performed. The above dynamic objects filtering method is proposed based on the assumption that the dynamic points can lead to large distance errors in point cloud matching.

Algorithm 1 LiDAR Mapping.
Input:Ē k andH k from the P k , the point cloud map Q E k−1 and Q H k−1 , accumulated until the last point cloud. T W k from the LiDAR odometry. Output: The optimized poseT W k . 1: Initialization:T W k ← T W k 2: for i = 0; i < iter max ; i + + do 3: forP (k,i) ∈Ē k do 4: Compute P w (k,i) via Equation (19) 5: if these five nearest points are indeed in a line then 7: Compute point to line distance which is similar to Equation (13) 8: and add the residual d E which is similar to (17); 9: end if 10: end for 11: forP (k,i) ∈H k do 12: Compute P w (k,i) via Equation (19) 13: if these five nearest points are indeed a plane then 15: Compute point to planar distance which is similar to Equation (14) 16: and add the residual d H which is similar to (17); 17: end if 18: end for 19: Perform pose optimization with 1 iteration based on Equation (18). 20: recompute d E and d H in Equation (17), 21: then remove 10% of the biggest residuals 22: for a maximal number of iterations do 23: Perform pose optimization based on Equation (18) 24: if the nonlinear optimization converges then 25: break; 26: end if 27: end for 28: end for

Experimental Platform and Evaluation Method
In this paper, we do not propose a semantic segmentation algorithm but evaluate the proposed algorithm on SemanticKITTI, which is a point-wise annotated point cloud dataset based on the KITTI Vision Benchmark. Since semantic segmentation is not the focus of this paper, we aim to explore the application of the prior semantic information in the LiDAR SLAM framework. Hence, the proposed method is not limited to a specific semantic segmentation method. Therefore, all those semantic segmentation methods that can output point-wise semantic labels can be applied to our system. SemanticKITTI provides accurate point-wise annotation with 34 semantic classes for 22 sequences of LiDAR data on the KITTI Vision Benchmark [50]. Furthermore, point clouds from sequences 00 to 10 are training sets with the available labels, and the remaining sequences are used as the test set. Figure 3 is a visual inspection from a single scan with semantic annotations in SemanticKITTI. As shown in Figure 3, objects with the same semantic attribute are displayed in the same color, e.g., pink represents the ground, and yellow indicate the vehicle.
The LiDAR data in the KITTI Vision Benchmark is recorded with a Velodyne HDL-64E laser scanner at a rate of 10 Hz. It provides 22 sequences of point cloud data, which contain 11 training data sets with ground truth and 11 test data sets without ground truth. These sequences include multiple environments ranging from the busy city to highway traffic. The ground truth is provided by a high accuracy GPS/INS navigation system. We test our approach on the desktop computer with an i7-7700 3.60 GHz CPU. All algorithms are implemented in C++ and executed on Ubuntu 16.04.  Table 1 shows the KITTI sequences used in our experiment. To verify the robustness of the algorithm, we selected 6 typical sequences of 11 sequences. The reason of selecting these six sequences is they cover all the situations that our algorithm might encounter in reality. First, in terms of scenarios, these sequences cover a variety of environments, including urban (00,07), highway (01), country (04) and their fusion (02, 08). Then, as for the trajectory length, the longest path reaches 5067 m (02), while the shortest mileage is only 394 m (04). Moreover, the vehicle speed of the vehicle in sequence 01 reaches 85 km/h (01). Moreover, some of these sequences have loops (00,02,07,08), while others have no loops (01,04). In short, the six sequences selected above can fully verify the performance of the proposed algorithm. The proposed algorithm is evaluated by computing the absolute errors in [51] and relative errors in [50], respectively. The absolute metric includes the absolute root-mean-square error (RMSE) in respect to the translation and rotation over all point clouds, respectively. To compare our approach against the state-of-the-art SLAM methods, the relative errors are computed by averaging relative translation and rotation errors using different trajectory distances. Next, we will evaluate the proposed algorithm from three aspects, i.e., dynamic object removal, feature extraction and pose estimation.

Dynamic Object Removal
Moving objects in the point cloud will cause large errors when matching. Here, we propose a simple but effective method to coarsely remove dynamic objects and outliers. According to semantic labels provided by SemanticKITTI, we remove points labeled as person, bicyclist, motorcyclist, railcars, outliers and unlabeled. Figure 4a shows a point cloud where dynamic objects and outliers are displayed in pink (marked by red rectangle and arrows). After our method, these points are completely culled (see Figure 4b). Compared with the naive method, which filters out all movable semantic classes, including all vehicles and persons, our method effectively removes objects with a larger probability of movement in the environment, such as person, bicyclist, motorcyclist and railcars, while keeping features from static objects, e.g., parked vehicles. These static objects are valuable features for the data association and pose estimation. Meanwhile, this step also removes the points labeled as outliers (red arrow in Figure 4a). However, moving vehicles are not considered in this step. They will be solved in the LiDAR mapping module.

Feature Extraction Results
To evaluate the accuracy of the semantic label-based feature extraction algorithm, a representative urban environment is selected that is an intersection with parked cars, street lights, trees and buildings. The result is shown in Figure 5. Edge and planar features are colored red and blue, respectively. Solid yellow rectangles mark the feature point extraction on parked cars. As shown in Figure 5a, only a few edge features are extracted from the parked cars in A-LOAM. This is because the space between cars is too close, and the smoothness-based feature extraction method cannot extract effective edge features from the objects next to each other. In contrast, our extraction algorithm shows excellent results on parked cars (see solid yellow rectangles in Figure 5b). This corresponds to the extraction cases in Figure 2a,b.
Solid yellow ellipses mark edge features on cylinder objects. Cylinder objects, e.g., poles and trunks, contain remarkable edge points. However, these features were mistakenly extracted as planar features in A-LOAM (see red points in solid yellow ellipses of Figure 5a). As solid yellow ellipses in Figure 5b show, our semantic label-based extraction algorithm can achieve accurate edge features on poles and trunks. This corresponds to the extraction method in Figure 2c. Dotted yellow ellipses in Figure 5 mark edges feature at the junction of the sidewalk and the road. Figure 5b shows our method completely extracts these edge features, which corresponds to the extraction method in Figure 2b.
In addition to the above, A-LOAM also mistakenly extracts some points as edge points. For example, red points marked by dotted yellow rectangles in Figure 5a. These points are from the junction of the starting and ending positions of a LiDAR scan circle. Due to the influence of motion deformation, there is a large distance between points at the starting position and the ending position, and the extraction method that uses the distance to calculate the smoothness will mistake them as edge points. As a result, these points are always extracted as edge points when the LiDAR moves forward. Further, some outliers are extracted as edge features in A-LOAM, as shown in Figure 5a (marked by yellow arrows). By contrast, our semantic label-based method completely removes these false edge extractions. As for planar features, our method and A-LOAM show similar performances in Figure 5. However, we only extract planar features from objects with plane attributes, such as roads, sidewalks, buildings, traffic-signs and cars. This greatly increases the accuracy of planar feature extraction.
(a) A-LOAM (b) OURS Figure 5. The comparison of feature extraction after applying A-LOAM and our algorithms, respectively, at an intersection. Edge and planar features are colored red and blue, respectively. The yellow rectangles and ellipses are used to compare the accuracy of different feature extraction methods.

Pose Estimation Comparison
In this part, we compare the accuracy and robustness of the proposed method S-ALOAM with the current state-of-the-art algorithms, namely A-LOAM, LeGO-LOAM and FLOAM [52], in various challenging outdoor environments. The framework and basic principles of FLOAM are similar to A-LOAM, but it optimizes the A-LOAM algorithm, which improves the computational efficiency by three times. Next, we will use the sixsequence dataset with semantic labels described in Table 1 to evaluate the proposed algorithm. This experiment first conducted a detailed analysis on the trajectory error comparison graphs of three sequences. These sequences cover urban, highways and complex mixed scenes. Then, the quantitatively absolute and relative errors of all six sequences are given. Figure 6 shows the trajectory comparison graph of the sequence 00, which is collected in urban. As Figure 6a shows, the proposed approach, S-ALOAM, generates more consistent trajectories than other methods. This shows semantic labels-based feature extraction does improve the performance. However, compared with LeGO-LOAM, the other three methods suffer accumulation errors in loop closures. As Figure 6a(1),a(2) show, there are large gaps when the robot revisits the same place. By contrast, LeGO-LOAM performs better in loop closures. This is because loop closure constraints of LeGO-LOAM correct the accumulated drift. Meanwhile, Figure 6b,c show similar error curves, which demonstrates the position error is mainly caused by height errors in LOAM and its variants.  The position errors of all algorithms reach the first peak at scan 943 (see Figure 6b,c). Figure 7a is the real image corresponding to scan 943, we can see that the car is turning left and going downslope. This shows LOAM and its variants cannot correctly estimate height in environments with turning and slope. By contrast, A-LOAM has a smaller error here, while FLOAM and S-ALOAM show similar accuracy. Furthermore, LeGO-LOAM performs worst here. The reason is LeGO-LOAM cannot accurately extract the ground in such an environment, which leads to large height errors. After scan 943, S-ALOAM achieves, for the most part, a lower translational and height error. Table 2 shows the absolute root-meansquare translation and rotation errors. As shown in Table 2 (00), S-ALOAM is superior to other methods in the translation estimation. Furthermore, it achieves a similar performance to A-LOAM in the rotation estimation.   Figure 8 compares the trajectory error from a highway scene. As Figure 8a shows, S-ALOAM performs better compared to A-LOAM, LeGO-LOAM and FLOAM. The unsatisfactory performance of A-LOAM, LeGO-LOAM and FLOAM is caused by the low geometric structure and many fast-moving dynamic objects in highway scene. Figure 7b presents a photo corresponding to sequence 01. We can see that the environment contains many vehicles moving at a high speed and has only very few geometric structures. Most of the point cloud obtained in such a challenging environment is ground points, and only a few are from traffic signs and sparse trees. In this challenging environment, the purely geometric approach cannot obtain the correct data association, which eventually leads to a large trajectory error.
By comparing Figure 8a with Figure 8c, we can see the errors of S-ALOAM are mainly from height errors, which shows S-ALOAM achieves more accurate pose estimates in [x, y, θ yaw ]. This is because our semantic label-based feature extraction method can obtain accurate edge features that contribute three degrees of freedom for the pose, i.e., [x, y, θ yaw ]. Furthermore, there are many factors that cause height errors. First, few distinct structures and the incomplete removal of dynamic vehicles lead to wrong data association. Then, the height is mainly estimated by matching planar features. We extract plane extraction by simply downsampling points with plane attributes. Those uneven planes, for example, roads, that include many slopes will limit the correct estimation of height information. However, as Table 2 (01) shows, S-ALOAM achieves more accurate estimates of the height and also another translation, which is attributed to the semantic constraints and the two-step dynamic object removal strategy. Note that Figure 8c shows that LeGO-LOAM seems to achieve the most accurate height estimation. However, we can find from Figure 8a that LeGO-LOAM fails in the challenging environment.  We also evaluate the proposed method in a mixed scene, including an urban area and the country. Figure 9a shows S-ALOAM is slightly worse than A-LOAM in terms of 2D trajectory, which is mainly due to the fact that the three degrees of freedom [x, y, θ yaw ] are mainly estimated by the edge features. We can see from Figure 10b that the environment contains plenty of vegetation, which is extracted as edge features in S-ALOAM. This may lead to the wrong data association when calculating point-to-edge distance. However, we have to keep this vegetation for edge points extraction in the country and highway where few geometric structures exist (see Figure 10b). Figure 9c shows that S-ALOAM achieves more accurate height estimates. Overall, our approach is only slightly inferior to A-LOAM and outperforms LeGO-LOAM and FLOAM (see Table 2 (08)).
The absolute errors of the remaining sequences (02, 04, 07) are also given in Table 2. Table 3 shows the relative errors of all sequences that are evaluated using evo [53]. The relative error evaluates the local accuracy (drift) of the trajectory. We can see from Tables 2 and 3 that in most sequence, S-ALOAM is superior to A-LOAM and FLOAM in the translation estimation. As for rotation parts, S-ALOAM is only slightly inferior to A-LOAM. Next, some necessary discussions are presented based on the above results.

Discussion
Overall, S-ALOAM often achieves better local accuracy (Table 3) and global consistency ( Table 2) in terms of translational errors. Furthermore, S-ALOAM is slightly worse than A-LOAM in rotational errors. The overall performance of A-LOAM is higher than FLOAM and LeGO-LOAM. The accuracy of LeGO-LOAM is worse than the other three methods, which is caused by the inaccurate ground extraction.
In addition, we found that height estimation is one of main error sources in the LOAM algorithm and its variants. This is especially obvious in sequence 00 and sequence 01. For example, we can see from Figure 8a that the 2D trajectory of S-ALOAM coincides well with the ground truth. Meanwhile, Figure 8b,c show similar error curves. These demonstrate the height error accounts for a large proportion in the total absolute translation error. The main factors that cause a large height error is uneven ground. This is because the planar features provide necessary constraints for height estimation, and slopes in the environment may cause height estimation errors. If these small local errors are not corrected, they will continue to accumulate and eventually cause large height errors.
From the type of environment, all methods perform poorly on the highway (01). Even for S-ALOAM, which performs better than other methods, its absolute translation error reaches up to 106.2324 m. This is because there are few geometric structures and many fast-moving dynamic objects in a challenging environment. Meanwhile, The high dynamics of the carrier itself is also one of the reasons for the larger errors. As the previous analysis show, the error in this scenario is mainly caused by height. Even then, S-ALOAM still performs very well in 2D translation estimates (cf Figure 8a). Moreover, S-ALOAM also corrects the height error to a certain extent which benefits from our semantic labelbased edge features extraction method and dynamic object removal strategy. In general, compared with A-LOAM, S-ALOAM reduces the absolute translation error from 114.8519 to 106.2324 m. Furthermore, the relative translation error is reduced from 1.0312 to 0.9621 m (cf Tables 2 and 3). As for the rotation error, S-ALOAM is similar to A-LOAM. For example, the absolute rotation errors of A-LOAM and S-ALOAM are 7.0807 and 7.0901, respectively.
Another challenging scene is the urban/rural hybrid scene (sequence 02 and 08). The performance of A-LOAM is slightly better than S-ALOAM in the 08 sequence. For example, its absolute position error is 17.1512 m while S-ALOAM is 17.9965 m (see 08 in Table 2). However, its error reaches up to 157.2780 m in the 02 sequence (see 02 in Table 2). This shows that A-LOAM is less robust in mixed scenarios. On the other hand, in terms of absolute error, FLOAM is better than the other methods in the 02 sequence (02 in Table 2). However, its performance is the worst in the 08 sequence (see 08 in Table 2). In contrast, S-ALOAM maintains stable performance. Although, the absolute position error still reached tens of meters, which is only behind the highway scene (01).
S-ALOAM works well in urban. Especially for sequence 07, the absolute translation error are decreased from 1.0154 to 0.8988 (see sequence 07 in Table 2). Furthermore, the relative translation error is only 0.5080 m (see sequence 07 in Table 3). The reason is these environments contain much structural information. By contrast, the absolute translation error of sequence 00 is 6.2357 m, which is larger than sequence 07 (see sequence 00 and 07 in Table 2). The errors in sequence 00 are mainly from the long path and many loop closures (see Figure 6a(1),a(2)), which can cause accumulative errors. However, neither A-LOAM nor S-ALOAM has loop closures constraints. Even then, the performance of S-ALOAM outperforms A-LOAM and FLOAM (cf Figure 6a-c).
For country scenes (sequence 04), the performance of S-ALOM is slightly worse than other methods (see Tables 2 and 3). The reasons could be the scene contains a lot of vegetation, such as grass and rush, which can lead to false feature extraction. Besides, we also show two feature maps from sequence 00 and 07, as shown in Figure 11. The color changes with elevation.

Conclusions
In this paper, we presented a novel semantic-assisted lidar odometry and mapping system for performing accurate pose estimation and mapping in a large-scale outdoor environment. The system is composed of a point cloud pre-processing module, a feature extraction module, a lidar odometry module and a lidar mapping module. A two-step dynamic objects filtering method is presented in the pre-processing module and lidar mapping module, respectively. In the feature extraction module, we use point-wise semantic labels instead of the smoothness of the local surface to extract edge and plane features.
Another key feature we proposed is the semantic constraint is added to lidar odometry module and lidar mapping module, which ensures that the algorithm can accurately and efficiently search for the corresponding points.
Qualitative and quantitative experiments in many challenging environments demonstrate that the proposed method achieves similar or better accuracy in comparison to state-of-the-art methods. Specifically, The proposed algorithm is completely better than LeGO-LOAM and only inferior to FLOAM in sequence 02. Compared with A-LOAM, our approach have higher accuracy in the translation estimation. As for rotation parts, S-ALOAM is only slightly inferior to A-LOAM. This shows integrating semantic information into a lidar odometry and mapping system can achieve more accurate translation estimates than the pure geometric methods.
Despite these encouraging results, there are some limitations here. Our method achieves better accuracy in comparison to other methods in the highway scene; however, the magnitude of the error is still large. This is caused by the challenging environment, such as uneven ground, fewer structural features, many dynamic vehicles and the high dynamics of the carrier itself. Therefore, we have not made a significant correction. These can be improved in our future work. First, we consider adding loop-closure constraints to correct drift. Since our method suffers from large height errors. Furthermore, we plan to add a ground constraint or use IMU to correct the height errors.