Study of the Effect of Exploiting 3D Semantic Segmentation in LiDAR Odometry

: This paper presents a study of how the performance of LiDAR odometry is affected by the preprocessing of the point cloud through the use of 3D semantic segmentation. The study analyzed the estimated trajectories when the semantic information is exploited to ﬁlter the original raw data. Different ﬁltering conﬁgurations were tested: raw (original point cloud), dynamic (dynamic obstacles are removed from the point cloud), dynamic vehicles (vehicles are removed), far (distant points are removed), ground (the points belonging to the ground are removed) and structure (only structures and objects are kept in the point cloud). The experiments were performed using the KITTI and SemanticKITTI datasets, which feature different scenarios that allowed identifying the implications and relevance of each element of the environment in LiDAR odometry algorithms. The conclusions obtained from this work are of special relevance for improving the efﬁciency of LiDAR odometry algorithms in all kinds of scenarios.


Introduction
Localization is a critical function in self-driving vehicles. The materialization of autonomous navigation necessarily entails determining the position and orientation of the ego-car with accuracy requirements that reach the sub-lane-level and, therefore, go well beyond the capabilities of wheel encoders or even GNSS systems. In this context, modern odometry techniques based on exteroceptive data, such as images or range measures, are widely used in applications of this kind.
Different modalities are employed in automotive sensor setups to retrieve information about the surroundings of the vehicle. LiDAR sensors have been growing in popularity lately due to the compelling set of features inherent to this technology, which includes a broad field of view (usually spanning 360°), robustness against some weather conditions, and an excellent distance measurement accuracy. It is precisely this last point, together with an acceptable resolution, that makes LiDAR an ideal source of data not only for environment perception [1], but also for ego-motion estimation.
LiDAR readings are made of a set of 3D points, i.e., point cloud, each representing the measurement from an individual laser beam reflection. Provided that the point density is high enough, as is the case with modern high-end devices, these data present a wealth of features that can be used to perform registration between consecutive time steps, which is the procedure that forms the basis of most odometry approaches. These characteristics are also helpful for the broader goal of Simultaneous Localization and Mapping (SLAM), in which ego-motion is frequently embedded.
However, not all of these features are equally desirable for the ego-motion estimation goal. Traffic environments are characterized, on the one hand, by an abundance of dynamic objects that can introduce spurious correspondences into the procedure and, on the other hand, by a lack of structure that makes it difficult to foresee the presence and amount of the predictably helpful foreground static objects.
The main objective of this paper is to provide a systematic analysis of the sensitivity of LiDAR-based SLAM methods to the information conveyed by input data. Semantic knowledge about the points in the LiDAR cloud is introduced to filter the data before feeding them to the method, as shown in Figure 1. To that end, we use semantic labels from a point-wise annotated dataset, SemanticKITTI [2], thus decoupling the effect of the segmentation accuracy. We believe that this kind of analysis will provide a deeper understanding on how the points from the environment affect the LiDAR point cloud matching, and thus be useful in the design and development of new algorithms. Consequently, our main contribution is a thorough insight into how the performance of LiDAR odometry is affected by the preprocessing of raw 3D data. The analysis deliberately focuses on a state-of-the-art method, LiDAR Odometry and Mapping (LOAM) [3], which is a top-performing approach that uses only LiDAR information and has paved the way for many other approaches in the literature. Since most LiDAR odometry methods rely on the same universal principle, the conclusions of this study are widely applicable.

Related Work
LiDAR odometry aims to estimate the position and orientation of a movable platform by tracking the 3D points obtained by an onboard scanner. Usually, the odometry procedure is carried out by means of an incremental approach that computes the relative transform between the sensor poses at two different time steps; then, the absolute pose can be retrieved as the accumulation of all these transforms [4].
This paradigm necessarily requires matching consecutive point clouds. 3D data from modern multi-layer devices offer a multitude of features that can be used to establish correspondences. In turn, this also involves increased computational requirements to perform the matching. Some approaches have been proposed to mitigate the data processing burden with strategies that include projecting the LiDAR scan to a ground plane [5] or using a siamese network able to learn dimension-reduced representations [6].
The incremental nature of odometry implies that the method is prone to drift, caused by the accumulation of errors during the complete trajectory. For this reason, odometry is often included within a complete SLAM procedure able to estimate both the location and the map of the environment jointly. In that way, the algorithm is able to detect when the current location corresponds to a previously visited area, reducing the drift in the trajectory. One of the most popular LiDAR-based SLAM approaches nowadays is LOAM [3], which splits the SLAM problem into two concurrent procedures: high-frequency odometry and low-frequency mapping. LOAM, which makes use of edge and planar features selected according to a roughness estimate, has proven extremely effective in terms of localization, as shown by some public benchmarks such as KITTI [7]. More recently, IMLS-SLAM [8] reached similar levels of accuracy by using the Implicit Moving Least Squares surface representation.
The inclusion of additional data, such as images [9] or IMU [10] measures, has been often exploited to increase the performance of LiDAR SLAM. However, one of the most promising lines of research in this topic is the thoughtful preprocessing of data fed to the algorithms. For instance, LeGO-LOAM [11] is a lightweight version of LOAM that filters out noise and takes advantage of the ground plane. Another group of methods focuses on leveraging semantic information about the 3D points, such as SuMa++ [12], which embeds these labels into a surfel-based map representation, or SLOAM [13], oriented towards forest environments.
With the advent of deep learning, an increasing number of approaches have made use of convolutional neural networks to obtain features using learned extractors [14]. Recurrent neural networks, such as Long Short-Term Memory (LSTM), have also been employed to model connections among consecutive scans [15]. Despite this, the potential of deep learning approaches to significantly outperform geometrical alternatives is still to be proven.

Methodology
This section presents the evaluation procedure, including the dataset selection, the proposed input point cloud filtering configurations, the LiDAR odometry algorithm selected for the study and the metrics considered for the evaluation.

Data
Since it is one of the most popular datasets in autonomous vehicles research, the KITTI dataset [7] is the data source selected for this study. More specifically, the odometry benchmark was used, providing input LiDAR point clouds from a Velodyne HDL-64E sensor and the ego-vehicle ground-truth poses. In addition to its popularity, another reason to select the KITTI dataset is the recently published SemanticKITTI dataset [2]. This dataset is based on the original odometry sequences from KITTI and provides point-wise 3D semantic information directly annotated in the point clouds. The annotations include label categories such as vegetation, ground and traffic signs, among many others. Additionally, the semantic labels also divide static and dynamic road agents, i.e., vehicles, pedestrians and cyclists.
In this study, the semantic information provided by the SemanticKITTI dataset was exploited to filter the original point clouds from the KITTI odometry benchmark.

Input Configurations
To study the behavior of LiDAR odometry, six different input configurations are proposed. In each configuration, a new set of filtered data is generated, which is later fed to the odometry algorithm for evaluation. The different experimental configurations and their motivation are presented below: Raw: Raw original point cloud from the KITTI dataset. This configuration serves as the baseline for the comparison.
Dynamic: Remove all dynamic objects from the point cloud, i.e., vehicles and pedestrians. As stated in other works (e.g., [8]), dynamic objects are a source of spurious correspondences in the matching, leading to erroneous estimations.
(C) Dynamic Vehicles: Remove only dynamic vehicles from the point cloud. The only difference between this configuration and the previous one is that pedestrians are not removed from the point cloud. The hypothesis that motivates this configuration is that, due to their smaller size and lower velocity, the contribution of pedestrians to the error may be negligible.
Far: Remove all points that are far from the vehicle. All points p with a distance |p| > 30 m are removed from the point cloud. This configuration is motivated by the fact that LiDAR point clouds are dense and rich in details in close distances, but become more sparse over the distance, thus far points may introduce more noise to the system. (E) Ground: Remove the ground points. Removing the ground from the point cloud may have some advantages, but also some disadvantages. On the one hand, the rings of the LiDAR sensor that lay on the ground (assuming it is planar) will always have the same appearance. This should result in a misconceiving of the translation part because all those points will have the same coordinates in different frames, even if the vehicle was moving. On the other hand, ground points may be helpful for the rotation part, particularly for pitch and roll angles.
Structures: Leave only points from structures, such as buildings, and objects, such as poles and traffic signs. This test case is proposed to analyze what would happen if only the truly static and structured objects from the scene are kept. Everything is removed from the point cloud, including static vehicles, vegetation and the ground.
These new sequences generated after filtering the original point cloud are used as the input of the LiDAR odometry algorithm. As a result of this filtering, the total size of the point cloud is reduced. The total percentage of downsizing from the raw point cloud is visible in Figure 2, where values are averaged over all the test sequences.

Odometry Algorithm
Regarding the LiDAR odometry algorithm used in this study, we chose to evaluate LOAM [3]. The reasons to use LOAM are threefold. Firstly, it is a well-known method that has been constantly used and cited since its publication. Secondly, it ranks as the top method in the KITTI odometry leaderboard in both translation and rotation. Finally, there is an open-source implementation available [16].
LOAM algorithm divides the SLAM problem into odometry generation and map registration. These two processes run in parallel at different rates. The odometry generation algorithm runs at 10 Hz and computes a low fidelity motion estimation. This estimation is integrated into the registration algorithm, which runs at 1 Hz and optimizes the obtained odometry while building the map. To perform these computations with the point clouds, the authors proposed to extract two sets of features based on the smoothness of the points: so-called edge and planar features. Afterwards, these sets of features are processed separately, and a matching algorithm is applied to find the relative transformation between the extracted features and the map of the environment. This principle is applied by most SLAM algorithms; therefore, the collected results can be extrapolated to other LiDAR-based methods.

Evaluation Metrics
After obtaining the output odometry data for each input configuration, these estimated poses are compared with the ground-truth poses. To simplify the evaluation process, an open-source tool, evo [17], was used. This tool allows performing an insightful analysis of the generated odometry data while also providing several metrics and plotting functions.
Choosing a suitable evaluation metric has always been a subject of study. Since most LiDAR odometry algorithms are actually SLAM algorithms, we adopted two common metrics used to evaluate the estimated poses of SLAM methods. On the one hand, the Absolute Pose Error (APE) measures the absolute pose differences directly. This is computed in global coordinates and provides meaningful information about the consistency of the SLAM method because the error increases as the estimation gets farther from the reference trajectory. On the other hand, the APE metric is sensitive to errors at the beginning of the trajectory, because those early errors affect the whole estimation. For that reason, in [18], the Relative Pose Error (RPE) is proposed. This metric does not compare the poses directly; instead, it compares the measurement deltas between each pose. Therefore, RPE is more suitable for evaluating the drift of the algorithm.
Let x i andx i be the reference and estimated poses of the vehicle trajectory, respectively, where i ∈ 1 : T is the time step of each pose. Considering the inverse of the compositional operator [19], δ i,j is the relative transformation from pose x i to x j . Accordingly,δ i,j is defined as the estimated transformation fromx i tox j . Given this, it is possible to formulate APE and RPE metrics with (1) and (2).
Note that, although all poses in the trajectory are considered for computing the APE metric, this is not the case for RPE. In this case, the set of pairs {i, j} is obtained from all contiguous sub-sequences of a specific length inside the trajectory, in an analogous way as KITTI evaluation does.
Afterwards, each of the APE and RPE errors is decomposed into their translation and rotation components using (3) and (4), where ∠ [·] is the rotation angle [7].

Results and Discussion
In consideration of the extensive amount of data collected from the multiple experiments carried out, this section aims to synthesize the obtained results to provide substantial conclusions. Furthermore, the key points extracted from this analysis are presented and discussed.

Results
Tables 1 and 2 show the results obtained for APE and RPE metrics, respectively, from each odometry sequence in the KITTI dataset when applying the different proposed configurations (A-F) defined in Section 3.2. Both APE and RPE metrics are separated into translation (in meters) and rotation (in degrees). The RPE error values are obtained from all sub-sequences in the trajectory of length 100 m. In the case of APE, the metric was computed for each pose, while in RPE it was computed for each pair of poses in each sub-sequence. For this reason, both tables show only the mean and standard deviation values, being the last one in parenthesis. Finally, the best scoring method is highlighted in bold text for each sequence and metric.
Additionally, Table 3 presents the best performing configurations, for each sequence and metric.
Because of the nature of the APE metric, the obtained values in Table 1 depend on the length of the sequence. Thus, it is not straightforward to combine the results obtained through all KITTI sequences. Nonetheless, since the RPE metric computes the error for each pair of poses in each subsequence of 100 m, it is possible to better analyze the combined results of each method.  Because of the nature of the APE metric, the obtained values in Table 1 depend on the length of the sequence. This means that a larger sequence will have a larger error, due to the drift error associated to local odometry algorithms. Thus, it is not straightforward to combine the results obtained through all KITTI sequences. Nonetheless, since the RPE metric computes the error for each pair of poses in each subsequence of 100 m, it is possible to better analyze the combined results of each method. As shown in Table 2, RPE error results are smaller as this metric measures the error along each subsequence of 100 m. Consequently, Figure 3 contains a box-plot chart from the data obtained with the RPE metric. To generate this chart, large outliers were removed to visualize the whole data with the proper quality. In the figure, it can be observed that while Configurations A-C provide similar results (as shown in Tables 1 and 2), the differences among the rest are more significant.

Translation (m)
Rotation (deg) In addition to the odometry evaluation metrics, it is also of great relevance to analyze the processing time of each configuration, especially because the size of the input point clouds is reduced, which should contribute to a decrease in the processing time. Since the LOAM algorithm is implemented in different processes along different threads, we aggregated the average time of each individual process. For this reason, please note that the actual processing time is much lower, because most of the processes that were aggregated are, in fact, executed in parallel. Figure 4 shows the obtained processing times, which have a high resemblance to the point cloud sizes presented in Figure 2.

Discussion
From the deep study that was performed on the results obtained from the experiments, it is possible to extract several realizations.
First, the number of points that are reduced in the filtering step is an important factor in the odometry. As presented in Figure 2, the downsizing in configurations where dynamic vehicles are removed is very small, and the resulting size is around 99% of the original size. This is the reason the errors obtained from these configurations are very similar to the baseline. On the other hand, configurations with a heavy filter step (Ground and Structure) show more variations in the results. This is the case when we only keep structures and the point cloud is reduced up to 10% of its size. These configurations are very sensitive to the environment; therefore, environments with no structures will have several continuous frames where the odometry is completely lost. Nonetheless, these configurations also show some cases where their performance is highly superior to the baseline. A possible explanation for this is that only the most significant points are kept, while most of the noise sources are removed from the input. As a final remark, it is to be considered that smaller point clouds mean fewer points to be processed, thus reducing the computational time and allowing a higher output rate, as demonstrated with the results presented in Figure 4.
Although the disparity with the baseline is small in both downsizing and odometry error, the obtained results show a slight improvement for configurations where dynamic objects are removed. Additionally, the configuration where all dynamic objects are filtered also has better results than Configuration C, where only the dynamic vehicles are removed. This fact proves that pedestrians are also a source of noise in the odometry algorithm. As explained above, this result was expected and confirms our initial hypothesis. Nevertheless, because of the almost imperceptible change from the raw point cloud, the slight improvement that is appreciated may not be substantial enough to lead to a solid conclusion. A different dataset with a more crowded environment and a higher number of dynamic objects would be more suitable to finally consolidate this assertion.
Regarding Configuration D, which filters out all points with a distance from the sensor greater than 30 m, the obtained results quickly lead to a straightforward verdict: Even if the information in the point cloud is sparse and far from the vehicle, it is still useful for the odometry. This configuration consistently presents worse results than the baseline, which could be explained by the fact that having far references in the matching may prove helpful for the rotation computation.
Another interesting observation in the study is the variability in the results from the last configuration, where only the structures and objects are kept in the point cloud. These results are generally bad, with some noteworthy exceptions. On the one hand, because of the considerable reduction of the cloud size, the odometry algorithm encounters many areas with almost no points. This happens in roads without buildings around and with a limited number of traffic signs that could be used as references. On the other hand, this configuration works remarkably well in the highway sequence (Sequence 01). This case is of special relevance, because highway environments are the most challenging for SLAM algorithms, due to the small number of structures, the high speed of the ego-vehicle and the dynamic movement of all other vehicles. After analyzing the original point cloud, we noticed the presence of guardrails, which are labeled as "fence" and are considered structures. With only the help of these structures and traffic signs, the odometry algorithm is capable of outperforming all other methods. Furthermore, it may also appear that all points removed in this test case are prone to introduce noise to the system, as is the case for highly dynamic objects and road points, and therefore they may worsen the estimation from other configurations. Additionally, this configuration also shows adequate results in Sequence 05. This sequence also includes some kind of small walls close to the road, in a very similar way as the guardrails on the highway. We believe that having these close structures provides benefits in the matching. Nevertheless, the good results are obtained only in very specific cases, while most of the times, the algorithm gets lost because of the lack of information due to the high downsizing.
Finally, we can clearly observe in Figure 3 and Tables 1 and 2 that removing the ground points prior to computing the odometry generally produces better results. Among the 44 test cases (including all metrics and all sequences), this configuration gives the best results in 24 of them. The only sequence that presents a problem for this configuration is Sequence 01, because of the challenging highway environment. The negative effect of ground points in the odometry algorithm is due to the uniformity of the road. When the LiDAR laser beams hit the road, they create rings of points. In a flat road, these rings will always have the same radius, which depends on the height and the characteristics of the sensor. Consequently, the points that are created always have the same coordinates, even if the vehicle is actually moving. This effect is simplified in Figure 5, where it can be observed how the distance to the object reflects the translation of the vehicle, while the distances to the points laying on the ground remain static. This absence of movement in the points may mislead the odometry algorithm into believing that there is no translation movement from the car. However, as shown in Figure 3, removing the ground points does affect the rotation error, greatly increasing the variance, while also slightly increasing the median. This leads to the conclusion that the ground points induce a negative effect in the translation component, while still being beneficial for the rotation angle. However, this configuration is the one that produces better results overall, as presented in Tables 1 and 2.

Conclusions
An in-depth analysis of the use of 3D semantic knowledge to improve the performance of LiDAR odometry was carried out. The multiple experimental results confirm the effectiveness of preprocessing the raw data before feeding it to the localization algorithm. In particular, our study proved the convenience of filtering out those points belonging to dynamic objects, including pedestrians and the ground area, whereas far-off points proved useful to establish meaningful correspondences. Furthermore, it was also shown that the type of driving environment has a decisive impact on the suitability of each filtering approach. Hence, in some specific scenarios (e.g., highways), the results show that it is possible to drastically downsample the input data, thus lowering the computational requirements while improving the performance of the algorithm. This improvement is produced by reducing the occurrence of spurious correspondences in the matching process.
The analysis was performed on a very representative LiDAR SLAM approach to avoid loss of generalization; nonetheless, future work will focus on the analysis of different alternatives to further confirm the generality of the conclusions drawn from the study. Likewise, although the KITTI dataset presents a decent variety of environments, the use of additional datasets complementing its properties (e.g., featuring heavily crowded environments) is also being considered in order to strengthen some of the outcomes from the analysis. However, beyond the particularities of the study, this work serves to demonstrate that the benefits of exploiting 3D semantic information to preprocess raw LiDAR data well justify further research on this topic.

Conflicts of Interest:
The authors declare no conflict of interest.