A SLAM System with Direct Velocity Estimation for Mechanical and Solid-State LiDARs

: Simultaneous localization and mapping (SLAM) is essential for intelligent robots operating in unknown environments. However, existing algorithms are typically developed for speciﬁc types of solid-state LiDARs, leading to weak feature representation abilities for new sensors. Moreover, LiDAR-based SLAM methods are limited by distortions caused by LiDAR ego motion. To address the above issues, this paper presents a versatile and velocity-aware LiDAR-based odometry and mapping (VLOM) system. A spherical projection-based feature extraction module is utilized to process the raw point cloud generated by various LiDARs, hence avoiding the time-consuming adaptation of various irregular scan patterns. The extracted features are grouped into higher-level clusters to ﬁlter out smaller objects and reduce false matching during feature association. Furthermore, bundle adjustment is adopted to jointly estimate the poses and velocities for multiple scans, effectively improving the velocity estimation accuracy and compensating for point cloud distortions. Experiments on publicly available datasets demonstrate the superiority of VLOM over other state-of-the-art LiDAR-based SLAM systems in terms of accuracy and robustness. Additionally, the satisfactory performance of VLOM on RS-LiDAR-M1, a newly released solid-state LiDAR, shows its applicability to a wide range of LiDARs.


Introduction
Simultaneous localization and mapping (SLAM) is important for a wide range of robotic and industrial applications. SLAM is critical for various artificial intelligence applications, such as autonomous driving, AR/VR, and UAVs. SLAM technology is typically implemented by vision sensors [1,2], LiDAR [3], or both [4]. However, vision sensors have issues with scale ambiguity and illumination variations, and they require an environment with a rich texture, resulting in unstable performance in many cases. In contrast, LiDAR is not affected by the above issues and can provide consistently accurate measurements across a wide range of situations since it is an active sensor. Therefore, the LiDAR-based SLAM approach has become increasingly popular in recent years.
The core of a LiDAR-based SLAM system is scan matching. The goal of scan matching is to determine the relative transformation between two 3D point clouds. The iterative closest point (ICP) algorithm [5] is likely the most well-known scan matching method. The ICP algorithm determines the relative transformation by iteratively minimizing the point-topoint distance between the source and the target point clouds. However, the ICP approach works best when there is a unique one-to-one correspondence between the source and the target point clouds, which is rarely the case for real-world datasets. As a result, hundreds of variants have been studied [6]. In [7], the point-to-plane distance was used as an error metric instead of the point-to-point distance, which only assumes that the point clouds to be aligned are sampled from the same surface. This method generally provides a better result than the traditional point-to-point ICP approach. However, it is computationally intensive to determine the intersection point on the corresponding surface. Thus, a faster version of point-to-plane ICP was introduced [8], which used a point-to-projection technique to implement a faster point search. In [9], a generalized ICP (GICP) framework that considers both point-to-point and point-to-plane distances was proposed. In addition, in [10], a scanto-model framework known as IMLS-SLAM was proposed, which includes an environment model with previously localized LiDAR sweeps and a developed implicit moving least squares (IMLS) surface representation. IMLS-SLAM achieved impressive results on the KITTI vision benchmark [11]. However, IMLS-SLAM cannot operate in real time due to the considerable overhead of the plane normal computations. Since real-time performance is crucial for most ICP approaches, many ICP-based works [12][13][14] have employed parallel computations to improve system efficiency. In these approaches, explicit correspondences need to be established several times during the iterative process, which takes a considerable amount of time. In [15], the normal distributions transform (NDT), which does not require that explicit correspondences be found, was introduced and used for scan matching.
However, ICP variants do not attempt to remove outliers and dynamic objects, which are common in practice, reducing the accuracy. Therefore, feature-based methods are often used. In [16], a feature-based SLAM algorithm with low deviation and computational complexity known as LOAM was developed. In this algorithm, edge and planar features are extracted from raw point clouds according to the smoothness of the points in different laser beams. During feature extraction, a large number of redundant points are removed, which greatly improves the computational efficiency. The optimization process of LOAM is divided into two modules. On the front end, a scan-to-scan optimization process provides a high-frequency odometry output. On the back end, a scan-to-map optimization process is used to improve global consistency. The LOAM framework is effective and accurate, and it has served as the foundation for various works. For example, ref. Shan and Englot [17] proposed a ground-optimized version of LOAM that treats the ground as a separate feature [18]. They also proposed a two-stage pose optimization method to accelerate the computational process and better use the ground segmentation results. Furthermore, in [19], a truncated least squares method was used to reduce the impact of noise and mismatches.
Traditional methods use only low-level geometric features. Recently, deep learning [20] methods have been introduced, allowing outliers to be rejected and dynamic objects to be filtered out. Deep learning is a powerful tool that can be used to extract more information from point clouds, resulting in more useful optimization features. In [21], the LOAM pipeline was extended by integrating semantic information into the original framework, effectively removing dynamic objects and greatly improving the accuracy of the system. In [22], the Surfel-based mapping (SuMa) [23] pipeline was extended by integrating semantic segmentation information using the FCN RangeNet++ [24]. Similar methods are presented in [25]. However, deep learning methods usually require a significant amount of computational resources, limiting their applicability.
Most LOAM variants use an incremental mapping method to compute the relative pose of the current scan, which inevitably leads to drift. Inspired by the vision-based SLAM method, π-LOAM [26] uses an indoor SLAM framework based on plane adjustment (PA) technology to jointly optimize planes and keyframe poses. In addition, BALM [27] integrates LiDAR bundle adjustment into the LOAM framework as a back-end for optimizing all scan poses simultaneously. Compared to π-LOAM, BALM adds edge features to the global optimization process, increasing the stability of the algorithm and allowing it to be used outdoors.
Motion distortion is a major issue in LiDAR-based SLAM approaches. LiDAR-based SLAM algorithms usually assume that all points in a scan are acquired at the same time, so they have the same position during the acquisition. However, in reality, each LiDAR point is acquired individually. Sampling is done while the LiDAR is moving, which results in each point being sampled from a different position. This problem can be addressed by utilizing point clouds undistorted by linear interpolation between the poses of adjacent scans [28]. Similar approaches have been used in LOAM and its variants [16,17,19]. However, since the poses are estimated based on the raw point clouds, which are distorted, the result may not be sufficiently accurate. The process can be repeated several times to address this issue, which drastically increases the computational cost [29]. Several recent studies have directly optimized the in-scan motion to address this problem. In [29], the computationally inefficient iterative method for distortion compensation was replaced by a noniterative twostep method for distortion compensation. In [30], a first-order Taylor expansion was used to approximate the linear interpolation process and iteratively solve the optimization problem. However, this approach compares the current scan to previous scans; thus, it is insufficient, since it is affected by the estimated motion and may lead to a suboptimal solution.
The development of hardware, which has introduced a new type of LiDAR, solidstate LiDARs, to the consumer market, has introduced another issue. Compared with conventional LiDARs based on mechanical rotors, solid-state LiDARs are less expensive and have significant advantages in terms of weight and performance [31]. Although solid-state LiDARs have several advantages, their novel properties result in challenges for existing LiDAR-based SLAM methods. (1) Solid-state LiDARs typically have a small field of view (FoV) and can capture only a portion of the scene's structure, reducing the number of features that can be extracted. (2) Solid-state LiDARs often have irregular scan patterns, and the scan line itself can be highly curved. Furthermore, the scan patterns of various solid-state LiDARs differ; thus, it is difficult to develop an algorithm that is suitable for all cases. In feature extraction, the neighboring points of each point are needed to determine the type of feature point. However, searching 3D point clouds directly is very time consuming, so researchers usually find the nearest neighbors according to the scan line structure of a LiDAR. As shown in Figure 1a,b, mechanical LiDARs (Velodyne HDL-32E, Ouster OS1-64) have multiple laser-receiver pairs arranged in a vertical array, with all pairs rotating to form a collection of parallel rings. This design greatly simplifies the feature extraction process [32]. However, the scan patterns of solid-state LiDARs such as Livox Horizon and RS-LiDAR-M1 differ, as shown in Figure 1c,d. Therefore, it is not trivial to adapt algorithms developed for rotor-based mechanical LiDARs to solid-state LiDARs. Several new algorithms have been proposed to address these issues [32][33][34]. However, these methods can only be applied to specific types of LiDARs. Thus, this is still an issue for newly developed LiDARs with different scan patterns.
Thus, in this work, a new LiDAR-based SLAM framework based on a novel feature extraction method is proposed to solve these two problems. (1) To process various LiDAR inputs, we convert them into spherical images by spherical projection. Based on the natural neighborhood relationship between the points on the spherical image, we develop a new feature extraction module. The new module uses point-to-plane distance and point-toline distance to accurately determine feature points. (2) For motion distortion, we model the motion distortion as a constant motion process and then compute the pose of each point based on the acquisition time of each point; thus, our model is more realistic, which should have a positive impact on performance. These poses are jointly optimized in the optimization phase. Our method can be applied with a novel sensor, RS-LiDAR-M1, and has the potential to be used with newer solid-state LiDARs in the future, allowing this technique to be rapidly applied to new products. The main contributions of this work are as follows: 1.
The point cloud generated by the various LiDARs is represented as a spherical image by spherical projection for further processing. In contrast to most scan line-based feature extraction methods, our method classifies feature points based on a more precise point-to-plane or point-to-line distance. In addition, the features are grouped into higher-level clusters to prevent inaccurate data associations.

2.
We estimate the velocities while estimating the poses during the odometry stage. With a small bundle adjustment, we can perform a 12-degree-of-freedom estimation for multiple scans simultaneously, allowing us to directly obtain accurate velocities.

3.
We present an efficient method for computing the covariance matrix of the feature points, which significantly reduces the computational complexity of bundle adjustment during the mapping stage.

4.
We test the performance of our algorithm on four types of LiDARs, including two rotor-based mechanical LiDARs and two solid-state LiDARs. The experimental results show that the proposed method can be applied successfully to these LiDARs and that it achieves higher precision than existing state-of-the-art SLAM methods. The remainder of this paper is organized as follows. Section 2 describes the proposed method in detail, including the feature extraction, odometry, and mapping processes. Section 3 presents the experiments on the different datasets. Section 4 discusses the results. Finally, Section 5 presents the conclusions and directions for future work. Figure 2 depicts the workflow of the proposed SLAM system. The system processes raw point clouds from a 3D LiDAR and outputs scan poses and feature maps. The system includes three main components: a versatile feature extraction module, a LiDAR odometry module, and a LiDAR mapping module. The feature extraction module projects the point cloud onto a spherical image and extracts planar and edge features using a patch-based method. A graph-based approach [35] is used to group feature points into meaningful clusters to filter out smaller objects. Then, the remaining feature points are used in the LiDAR odometry module to build a feature map and estimate the pose and velocity. Finally, after the keyframe decision step, the feature point is undistorted based on the pose and velocity estimated during the odometry stage. These undistorted features and estimated poses are fed into the LiDAR mapping module, which uses bundle adjustment with a larger window to produce the final result. In addition, the odometry and mapping modules use the same logic for map maintenance, with only a few parameters differing. Moreover, we developed a novel design in which the feature extraction and odometry modules are implemented in the same thread and work iteratively; thus, we can use the odometry estimation results to improve the feature extraction accuracy. This is particularly useful for scenarios in which a car is traveling at a high speed. The details of each module are presented in the following sections.  Figure 2. Overview of our SLAM system, which has three main parts: a versatile feature extraction module, a LiDAR odometry module, and a LiDAR mapping module.

System Overview
In this work, the point cloud of the k-th scan is represented as P k . The world frame and local frame of P k are denoted as {W} and {L k }, respectively. For convenience, {W} coincides with {L 0 }. Since P k is acquired sequentially with the LiDAR motion, each point in P k is sampled at a different position, leading to ambiguity in the definition of {L k }. Therefore, it is important to note that we refer to the local frame {L k } as the frame defined by the first point in the k-th scan. Moreover, the transformation matrix from {L k } to {W} is denoted as T k ∈ Special Euclidean Group(3) (SE(3)): where R k ∈ Special Orthogonal Group(3) (SO (3)) is a rotation matrix and t k ∈ R 3 is a translation vector. The velocity, which has 6 degrees of freedom, is represented as are the angular velocity and linear velocity, respectively. The angular and linear velocities are defined in the world frame {W}, which differs from most SLAM methods, as there is evidence that globally defined angular errors [36] have better properties. For simplicity, we assume that the angular and linear velocities are not coupled. Therefore, the transformation matrix T j k from any point j ∈ P k to {W} can be formulated as follows: where s j k is the timestamp of point j, s k is the start time of P k , which is equal to s 0 k , and exp(w) is the exponential mapping so(3) → SO (3): where [w] × is the skew matrix of w. In addition, we define log (·) as the inverse map. For simplicity and clarity, we define a new function g that maps point j to its transformation matrix T j k shown in Equation (2):

Versatile Feature Extraction
One of our most important contributions is the proposal of a versatile and efficient feature extraction algorithm based on spherical projection. Spherical projection is a widely used method in the community [21,23,37]. The spherical image is capable of representing point clouds from various LiDAR devices. As far as we know, this is the first attempt to extract planar and edge points based on spherical projection for LiDAR-based SLAM. A detailed analysis of each point based on the point-to-plane or point-to-line metric is performed on the spherical image to find more accurate feature points. Then, we use a customized connected component labeling [35] technique to group the point cloud and filter out more outliers. Finally, the selected planar and edge features are passed to the odometry and mapping module to obtain an estimate of the pose of the current scan.
Conventional methods distinguish feature points using a smoothness metric [16] . The metric simply compares the distance between a point and its surrounding points, which cannot describe a plane or edge well and easily leads to misclassification. To improve the quality of the extracted features, we develop a new method based on the spherical image, in which the feature point is selected based on accurate point-to-line and point-toplane distances.

Motion Prediction and Point Cloud Compensation
Current methods typically use raw point clouds for feature extraction, which implicitly assumes that all points in a scan are acquired in the static state. However, the ego motion of LiDAR violates this assumption; thus, a reduced feature extraction performance is expected. In contrast, we use a coarsely undistorted point cloud for feature extraction, allowing us to better distinguish different features. After the odometry process is completed for the current scan, as discussed in Section 2.3, the accurate motion compensation procedure is repeated. Therefore, our algorithm performs the distortion compensation process twice.
Let s k be the start time of a new scan. The initial estimations of the transformation matrixT k and velocityV k can be predicted immediately from the previous scan with the constant motion model: where T k−1 and V k−1 are the transformation matrix and velocity obtained from the previous scan (see Section 2.3) and s k−1 is the timestamp of the previous scan. For the first scan, T 1 is set to the identify matrix, andV 1 is set to zero. For a point j in the current point cloud P k , let s j k be its timestamp; then, its transformation matrixT j k can be obtained using Equation (2). Then, the transformation matrix that projects point j into the local frame of Assuming that p j k is the position of point j, the new position p j k after compensation can be calculated as follows: are the rotation and translation components ofT j P k , respectively.

Classification of Planar Features
The planar features are classified in three steps: spherical projection, planar point selection, and planar point grouping. In the spherical projection step, a spherical image is generated based on the coarse undistorted point cloud. In the planar point selection stage, a smoothness value is assigned to each point based on a fixed window and used to select planar points. In the planar point grouping stage, connected component labeling is employed to divide the point cloud into different planar groups. The use of an explicit point group allows us to filter out unreliable smaller objects, thus preventing false matches during feature tracking. Each step is explained in detail below.
Spherical Projection: For a point cloud P k after compensation, the corresponding spherical projection coordinate [u, v] T for a point p = [x, y, z] T ∈ P can be calculated as follows: where r = x 2 + y 2 + z 2 is the distance of the point to the origin; are the vertical and horizontal fields of view of the LiDAR in terms of the radius, respectively; and M and N are the width and height of the projected image, respectively. M and N should be selected according to the LiDAR resolution. Planar Point Selection: A patch-based method is used to select the planar points. Each patch is a 5 × 5 square region in the image. We approximate the local plane of a patch π using a fast method that requires only the four vertices {p 11 , p 15 , p 51 , p 55 }: where c π is a planar point in π and n π is the plane normal of π. For a point p ∈ π, the point-to-plane distance can be calculated as d p = n π T (p − c π ) 2 . The points with values of d p less than the threshold α 1 are output as π S . Finally, the smoothness τ π of patch π is the mean value of π S . Each point j in the image is included in several patches. Let π * be the patch with the smallest smoothness value; then, the smoothness and point normal of j are given by τ π * and n π * , respectively. Finally, the points with smoothness values less than the threshold α 2 are selected as planar points. By choosing the least smooth patch to determine the planar points, we can prevent misclassifying points at the plane intersection.
Planar Point Grouping: We group the planar points using an adapted connected component labeling algorithm. Connected component labeling is a technique that assigns the same label to all connected points in an image. We consider two adjacent points p 1 and p 2 to be connected if they satisfy the following conditions: where n 1 and n 2 are normal vector of p 1 and p 2 , respectively. Each connected component represents a plane region with points that all belong to the same object. As a result, we can filter out objects with few points, since smaller objects are most likely noise. The label of each point serves as its feature ID. Finally, the planar features are downsampled for later use. The feature ID is used for tracking and to determine the map features to prevent inaccurate data associations.

Classification of Edge Features
The edge features are detected based on the extracted planar features. We use the same selection-then-grouping scheme as in planar feature detection to extract the edge features.
Edge Point Selection: A two-step edge feature detection procedure is presented, in which the coarse edge features are first selected based on the extracted planar features and then used to identify more reliable edge features. Let p i and n i be the position and normal vector of a point i in an image. As shown in Figure 3, a point is selected as a coarse edge feature if any of the following three scenarios are applicable:

1.
A point i is selected as an intersection edge feature if it and its neighboring point j are planar features with different planar feature IDs. Moreover, the position and normal of i and j must satisfy the conditions:

2.
A point i is selected as a jump edge feature if it is a planar feature with a significantly different depth than its neighboring point j:

3.
A point i is selected as a thin edge feature if its two adjacent points j and k in the same row or column satisfy the conditions: Then, a new point cloud S is obtained by downsampling the coarse edge features. For each point p 1 ∈ S, we identify two neighboring points p 2 ∈ S and p 3 ∈ S using a KD tree. This tuple is represented as L. The direction vector n L and smoothness τ L of the tuple L are given as: τ L also represents the distance from the center point p 1 to the line formed between its two neighboring points p 2 and p 3 . Similar to the case in planar feature selection, a point in S can belong to several tuples. Let L * be the tuple with the smallest smoothness value. Then, the normal and smoothness of the point are given by n L * and τ L * . Finally, the points with smoothness values greater than the threshold β are filtered out. The remaining points in S are processed further and grouped.
Edge Grouping: Adapted connected component labeling is used to label the edge features. Two edge features are determined to be connected according to the above KD-tree search result and an additional metric, n 1 T n 2 > 0.8, where n 1 and n 2 are the normals of the two edge points. Then, smaller clusters with less points are removed. The label of each point serves as its feature ID, which is used for tracking and identifying the map features.

Thin Edge
Jump Edge Intersection Edge 90° Figure 3. Our edge feature selection scheme. An intersection edge is a planar point with a smaller depth than its neighboring points and a larger angle, e.g., 90 • , between the corresponding normal vectors. A jump edge is a plane point with a much smaller depth than its neighboring points. A thin edge is a point with two neighbors in the same row or column that have much larger depths. The depth is infinite if the yellow plane with the dashed boundary does not exist.

LiDAR Odometry
The odometry module consists of two parts: Map Maintenance and Optimization. Map maintenance uses forward tracking as its core idea, which is inspired by π-LOAM [26]. We adapted this idea based on the proposed feature extraction method and developed a completely new feature tracking method. The developed method is more suitable for the pose optimization used in this work. The optimization is based on the bundle adjustment. The formulation for LiDAR bundle adjustment was first proposed by BALM [27]. However, BALM does not provide compensation for motion distortion, which may reduce the accuracy. Motion distortion is caused by the ego motion of the LiDAR, which results in each point in the point cloud being sampled at a different location. Based on the constant motion model, a unique pose can be assigned to each point to compensate for the distortion. Our main contribution is to jointly optimize the parameters of the constant motion model and the pose. The constant motion model ensures that our model better matches the real point cloud acquisition process, which can positively affect the accuracy of the estimation.

Map Maintenance
The map is a collection of map features, including planes and edges. A map feature is a spherical region fixed in 3D space that accumulates feature points from new scans that occur in its spherical region and satisfy certain conditions. Figure 4 shows an example of our map. We use a sliding window mechanism to ensure that the map size is bounded. After a new scan is inserted into the sliding window, the existing map features identify new feature points from the extracted features, and new map features are created from the remaining feature points. When the number of scans in the sliding window exceeds a certain size, three actions are performed to reduce the size of the map: (1) the oldest scan is removed, (2) the associated feature points are removed from the corresponding map feature, and (3)   Feature Tracking: The planar and edge features extracted from the current scan are denoted as H and E , respectively. We start by projecting map features that are neither edge nor planar features onto the current frame using the predicted transformation matrix from Section 2.2.1. Let c f be the center of a map feature f , and c f be the closest point to c f in the corresponding H or E . Then, we identify all feature points Q with the same feature ID as c f that belong to the spherical region of map feature f in the corresponding H or E . An eigenvalue decomposition of the covariance matrix of the 3D coordinates of Q is performed. Let λ 1 , λ 2 , λ 3 (λ 1 ≤ λ 2 ≤ λ 3 ) be the eigenvalues of Q and n 1 , n 2 , n 3 be the corresponding eigenvectors. For a planar feature, we use the following conditions to determine whether Q should be added to f : For an edge feature, we use the following conditions: where n is the normal vector or direction vector of the map feature f . Feature Creation: New map features are created based on the remaining untracked features. For each untracked feature point j, we locate the K nearest neighbors in H or E , depending on the feature type. Then, we filter out points that have a feature ID different from j and denote the remaining points as D. We perform an eigenvalue decomposition on the covariance matrix of D. For the planar features, if the second largest eigenvalue is much larger than the smallest eigenvalue ( λ 1 λ 2 < 0.1), all points in D are extracted as new planar map features. For the edge features, if the largest eigenvalue is much larger than the second largest eigenvalue ( λ 2 λ 3 < 0.2), all points in D are extracted as new edge map features. Note that once a point is selected as a map feature, it is never selected again to ensure that there are no redundant map features. Finally, the center and radius of the map feature are calculated using D, and the normal vector or direction vector is saved.

Bundle Adjustment with Direct Velocity Estimation
The k-th scan in the current sliding window is denoted as F k . The full state vector X to be optimized in the sliding window can be expressed as follows: where K is the width of the current sliding window, and x k is the state of F k , where x k includes the rotation, position, angular velocity, and linear velocity of F k in the world frame. Each map feature that is neither planar nor angular corresponds to a residual in the final optimization formula. Suppose f is a map feature, and p To construct the residual of f , we must first project its observation points into the global coordinate system using the transformation matrix obtained with Equation (2). The projection can be performed using the following equation: where p j k is the position of p j k in the world frame, s k is the start time of F k , and s j k is the timestamp of p j k . Equation (17) includes the velocity as a parameter, which is the main difference between our algorithm and BALM in the odometry stage. The velocity parameter, similar to the transformation matrix, is optimized during the final optimization process. The mean p and covariance A f of f can be calculated as follows: where N f = ∑ K k=1 N f k is the total number of feature points in f . For a planar map feature f , the loss metric ε f π is the sum of the squared distances of all tracked points to the best-fitting plane, which can be calculated as: For the edge map feature, the loss metric ε f L is the sum of the squared distances of all tracked points to the best-fitting line, which can be calculated as: Λ k (·) is a function that maps a matrix (·) to its k-th largest eigenvalue. The specific plane (or line) parameters are eliminated from ε f π (or ε f L ); the detailed procedure is the same as that of BALM [27]. Since the function Λ k (·) does not have a closed form, it is impossible to calculate the Jacobian matrix directly. Fortunately, BALM provides the first-and secondorder Jacobians in closed form for the function Λ k (·). Moreover, the results of ε π and ε L are quadratic, but we use the Ceres solver library [38] for optimization, which requires non-squared residuals. Therefore, we determine the final residual by directly computing the square root of ε f π and ε f L : Since we use a design with 12 degrees of freedom, it is difficult for the odometry module to guarantee perfect convergence, severely limiting the applications of the algorithm. This limitation can be addressed by introducing a continuous factor. This solution is very effective and allows our algorithm to produce valid results even in degenerate situations. The continuous factor is similar to the IMU pre-integration factor in VIO [39] and imposes a constraint on the motion of adjacent scans. The residual term of the continuous factor in the constant motion model can be defined as follows: We use M π and M L to represent the planar and edge map features in the current map, respectively. The final optimization problem can be expressed as a combination of Equations (21) and (22): arg min where the Huber loss ρ is defined as follows: We use the pose and velocity of the oldest scan in the current sliding window as a prior in the optimization process. The odometry factor graph is shown in Figure 5.

LiDAR Mapping
LiDAR mapping is based on odometry with some modifications to optimize the poses. Our process includes three enhancements: first, a keyframe selection step is added to reduce redundancy between scans. Second, the compensated point cloud is used as the mapping input; thus, the velocity does not need to be estimated. Third, a new formula for computing the covariance matrix is developed, which is faster than computing the covariance matrix directly from the raw points. The overall LiDAR mapping pipeline is similar to BALM [27] except for two improvements. One is the use of the feature map proposed in Section 2.3.1 and the other is the improved calculation method of the covariance matrix used in the optimization procedure.
The first step in the LiDAR mapping algorithm is to use a selection process to convert the new scans into keyframes, optimizing only the poses of the keyframes. To maximize sparsity while maintaining some degree of overlap between adjacent scans, the following criteria are used to determine if a new keyframe is needed: 1.
The number of scans within the sliding window has not reached a preset value.

2.
The ratio of the number of successfully tracked planar map features to the number of newly created features is less than 2.

3.
The ratio of the number of successfully tracked edge map features to the number of newly created features is less than 1.5.

4.
The time difference between the most recent frame and the last keyframe is greater than a specified value.
Each time a new keyframe is created, we use the odometry estimation result to remove distortions, as described in Section 2.2.1. Then, the undistorted features are added to the current map, and the pose of the new keyframe is optimized. The map maintenance and optimization processes are essentially identical to those in the odometry stage, and hence, we do not describe them here. Once the sliding window is full, a marginalization process based on the recursive form of the covariance matrix is performed [27,40], retaining the information of all points outside the sliding window using only a few parameters. Although the marginalization strategy is similar to those of most vision approaches, it is much simpler. The marginalization information is retained as a fixed component of the map feature and is added to the final covariance matrix. Moreover, the velocity can be eliminated from Equation (17), since we use an undistorted point cloud as the LiDAR mapping input. Therefore, Equation (17) can be simplified to: The notations in the above formula are the same as those in Equation (17), except that p j k is the undistorted point. Then, we calculate the covariance of each map feature and form the residual using Equation (21). In addition, the removal of the velocity leads to the removal of the continuous factor from the optimization objective because the dimension of the optimization is reduced. Thus, the entire optimization objective can be reduced to: arg min In the following section, we discuss a technique for accelerating the computation of the covariance matrix A f in residuals.

Fast Computing for A f
The computation of the LiDAR residuals r f L and r f π can be decomposed into two steps: the computation of the covariance matrix A f and the eigenvalue decomposition. The formal component must be computed from all feature points of a given map feature. We demonstrate a technique for computing A f by storing only a small matrix and vector for each scan and eliminating the dependencies of the raw feature points. Equation (18) can be rewritten as follows: where the notations are the same as those in Equation (18). (25) can be substituted into C and S; after the equation is simplified, we obtain: Since C k and S k are irrelevant to the pose of the current frame, C k and S k can be calculated in advance. With Equations (27) and (28), we can perform a fast computations of A f and reduce the large number of calculations required during the optimization process. Since the number of scans is often small, this method for computing A f can significantly reduce the memory requirements and computational complexity.

Experimental Results
We implemented the proposed algorithms in C++ and ran them using a robot operating system (ROS) [41] on a desktop with an Intel i7-9700 CPU and 16.0 GB of RAM. To the best of our knowledge, this is the first time that spherical projection has been used to unify inputs from different LiDARs. We achieved results comparable to other state-of-the-art approaches by simply adjusting some parameters for each type of LiDAR. Four types of LiDARs were used in our evaluation: Velodyne HDL-32E, Ouster OS1-64, Livox Horizon, and RS-LiDAR-M1. The detailed results are explained below.

Evaluation Metrics
Two metrics were used to evaluate the results of the different algorithms. The first was the relative pose error (RPE), and the second was the end-to-end position error. The RPE corresponds to the local drift in the trajectory. The EVO library [42] was used to compute the RPE. All estimated trajectories were aligned with the ground truth using Sim(3) Umeyama alignment over the EVO library. We assume that T k and T k+δ are the estimated poses of the ground truth G k and its successor G k+δ , respectively; then, the RPE at time k can be defined as: The RPE has two parts: the rotation component, which is known as the relative rotation error (RRE), and the translation component, which is known as the relative translation error (RTE). The mean and root mean square metrics were used to combine the results of the entire trajectory.
Moreover, the end-to-end position error was used for the Livox Horizon dataset, because no ground truth trajectory was available.

Feature Extraction Results
Feature extraction is an important component in the SLAM algorithm and has a considerable impact on subsequent pose estimation tasks. The Lego-LOAM method uses the segmentation information to extract features. We compare VLOM with Lego-LOAM using some representative scenes from the UrbanLoco dataset [43]. The results are shown in Figure 6. The planar and edge features are colored green and red, respectively. VLOM detects more correct edge features than Lego-LOAM (the dashed circle in Figure 6). Lego-LOAM misclassifies these edge features as planar features. The feature selection module of Lego-LOAM depends on the smoothness of the scan line, which is not suitable for accurately classifying points. The correct identification of these edge features is crucial for stable tracking. Moreover, because VLOM uses accurate point-to-line and point-to-plane distances to determine the feature points, VLOM does not classify planar points as edge points (the rectangle with a solid line in Figure 6). In addition, Lego-LOAM classifies a large number of points from nearby dynamic cars as edge features (the ellipse with a solid line in Figure 6). However, such features are detrimental to the performance of the algorithm. In contrast, VLOM almost completely filters out these points. Figure 7 shows the segmentation results for the same scenes. The various colors indicate different planes. Lego-LOAM does not segment planes; instead, it segments connected objects. Figure 7 shows that the ground and wall points are grouped correctly. Lego-LOAM is excellent at classifying ground points. Although VLOM divides the ground points into multiple clusters, this has no effect on our feature selection result because each cluster is large enough that it is not filtered out and can thus provide plane information. This good segmentation result is critical since it allows us to use points at the intersections of the planes as edge features. In addition, VLOM performs better near the wall than Lego-LOAM (the rectangle in Figure 7). The excellent feature classification performance of VLOM results in stable tracking. The segmentation results may also provide additional information for subsequent identification and obstacle avoidance tasks.

Evaluation on Velodyne HDL-32E
The sequence, namely HK-Data20190117, in this section was taken from the UrbanLoco dataset [43]. HK-Data20190117 were acquired with a Velodyne HDL-32E laser scanner at a rate of 10 Hz. The ground truth was acquired with Novatel SPAN-CPT, which is an integrated GNSS/INS navigation system with a position error of less than 2 cm. Since GNSS signals are easily affected by occultation and weather conditions, the selected sequence was acquired in a light urban area with good satellite visibility (as indicated by the author). HK-Data20190117 is a challenging dataset for LiDAR-based SLAM tasks because it includes a large number of dynamic objects, which can result in incorrect data associations and lower localization accuracy. In addition, the sequence provides a ground-truth linear velocity, which we used to evaluate our direct velocity estimation method.
For comparison, the most advanced pipelines were selected, including BALM, A-LOAM [44], and Lego-LOAM; the results are shown in Table 1. A-LOAM is an open-source implementation of the original LOAM algorithm. BALM replaces the mapping module of LOAM with a bundle adjustment approach. Based on Table 1, we can conclude that our method achieves higher accuracy than the other methods.  Figure 8a shows the estimated trajectories of each method versus the ground truth. As seen in Figure 8a, VLOM agrees with the ground truth across the entire trajectory and achieves a higher accuracy than the other methods. BALM and A-LOAM have significant tracking losses at various positions. Figure 8b shows a representative scene in which A-LOAM was unable to complete the tracking. This scene included many vehicles (indicated by the rectangle in Figure 8b) that were traveling parallel to the data collection vehicle.
As a result, the A-LOAM algorithm could not detect any movement, and the trajectory was stationary for some time. Although Lego-LOAM performed better on this scene than A-LOAM, Lego-LOAM was also affected to some extent, with a significant error in the end position. BALM uses a voxel grid to capture feature associations; this method results in a lower accuracy and thus may not be flexible enough for complex real-world scenarios. As a result of our precise feature extraction method, VLOM is less affected by dynamic objects and can achieve a stable tracking performance. To investigate the effectiveness of our feature selection scheme, we replace the feature extraction module of BALM and A-LOAM with our method. Since Lego-LOAM must clearly distinguish between ground features and non-ground features, it is not compatible with our method and therefore is excluded from this evaluation. The updated algorithms with our feature module are denoted as BALM-F and A-LOAM-F. Figure 9 shows a comparison of the new algorithms. The considerable tracking loss of the original methods disappeared, and consistent trajectories were obtained. Moreover, the RMSEs of the RTEs reached 0.286 and 0.267 for BALM-F and A-LOAM-F, respectively. This result proves that our feature extraction module is more accurate and robust than the other methods.

Evaluation of the Velocity Estimate
The most important function of the odometer is the velocity estimation [45]. An accurate velocity estimation is crucial for recovering a time-aligned point cloud that has been distorted by the ego motion of the LiDAR. Let T k and s k be the odometry output and timestamp of the k-th scan, and let the linear velocity v GT k and angular velocity w GT k be the corresponding ground truth. For the BALM, A-LOAM, and Lego-LOAM methods, the velocity can be determined according to the relative transformation of the neighboring scans. For these methods, we define the following error metrics: where Trans(·) and Rot(·) are the translation and rotation components of (·), respectively, and e v k and e w k are the errors of the linear and angular velocities of the k-th scan, respectively. Since the velocity is calculated explicitly in VLOM, the error can be expressed as follows: where v k and w k are the odometry velocity output of VLOM. The above definitions of e w k and e v k are based on the L2 norm, which eliminates the difficult task of aligning the estimated trajectory with the ground truth. Since the UrbanLoco dataset does not include the angular velocity, we evaluated only the linear velocity. The results are shown in Figure 10. As seen from the magnified area in Figure 10, the car remained in the start position for an extended period of time. Lego-LOAM had a low stability and a large error at this stage. However, while driving, BALM and A-LOAM performed worse than Lego-LOAM. Although Lego-LOAM showed excellent accuracy, the region marked by the vertical dashed line in Figure 10 shows that Lego-LOAM had substantial errors. These regions correspond to the four corners of the overall sequence, indicating that the iterative velocity estimation algorithm cannot handle rotation distortions. In contrast, VLOM performed well on the entire sequence. Thus, the combination of bundle adjustment with direct velocity estimation allows for a tremendous improvement in terms of velocity accuracy.

Evaluation on Livox Horizon
Livox Horizon is a solid-state LiDAR with a circular FoV of 81.7 • (horizontal) × 25.1 • (vertical) and a non-repeating scan pattern. Since solid-state LiDAR is a new technology, there are few publicly available datasets. We selected the KA-URBAN-Schloss-2 sequence (abbreviated as Schloss-2), which was provided in the work on LiLi-OM [34]. Schloss-2 was recorded at a typical sampling rate of 10 Hz while traveling with a sensor suite in Karlsruhe, Germany. The total length of this sequence is approximately 1.10 km. The walking speed is approximately 1.49 m/s. The author does not provide the ground-truth trajectory. Therefore, the RPE is not available. For quantitative comparison, because the sequence ends at its starting position, the end-to-end position error can be used as an evaluation metric. Lego-LOAM is more accurate than the other methods; however, it has a poor performance in the areas marked by the gray dotted line.
A-LOAM and Lego-LOAM were developed for conventional LiDAR and thus are incompatible with Livox Horizon. Therefore, Livox-Horizon-LOAM (abbreviated LiHo) [46] and BALM were used for comparison. LiHo is an LOAM variant that was adapted to Livox Horizon with a specially designed feature extraction module. In this study, we supported LiHo with IMU. The result is shown in Table 2, and the entry after "/" shows the result with IMU. In addition, we directly report the LiLi-OM result from their article. LiLi-OM is a state-of-the-art tightly coupled LiDAR inertial odometry. VLOM significantly outperforms the LiDAR-only systems. It is worth noting that after extensive testing, we found that this sequence has several frame losses (with a maximum time interval of 0.43 s), which negatively impacts our system because relatively long time intervals violate the constant motion assumption. This shows that our system is highly robust. Moreover, the results show that VLOM outperforms LiLi-OM, which has a special configuration for Schloss-2. This indicates some similarity in the roles of the direct velocity estimation and IMU, prompting us to reconsider the value of in-scan motion for LiDARs. Table 2. End-to-end position error (in meters) on Schloss-2. We also show the results of the IMUaided LiHo and LiLi-OM (denoted by the second entry after "/"). The bold indicates the best result. Our method outperforms the IMU-aided LiHo and LiLi-OM (without loop closure), which is an exciting result.  Figure 11a shows a top view of the final map of Schloss-2 produced by our method, and Figure 11b shows a side view of a representative structure place. The color is based on the intensity value. The map preserves fine structural details of the environment due to our accurate estimation of the scan velocity.

Evaluation on Our Campus Dataset
To test the performance of VLOM on real-world scenarios, we built a sensor suite using a conventional LiDAR, Ouster OS1-64, and a newly released solid-state LiDAR, RS-LiDAR-M1. RS-LiDAR-M1 has a resolution of 0.2 • in the horizontal and vertical directions and an FoV of 120 • (horizontal) × 25 • (vertical). The datasets were acquired at a frequency of 10 Hz using an RS-LiDAR-M1 sensor installed horizontally on the front of a vehicle and an Ouster OS1-64 sensor installed horizontally on the roof. We used an integrated high-end Newton-M2 navigation device to obtain the ground truth. Newton-M2, when used in conjunction with a differential GNSS, can achieve a position accuracy of 2 cm and an attitude error of approximately 0.1 • . The complete experimental platform is shown in Figure 12. Two sequences were collected by manually driving a car through the eastern campus of Sun Yat-sen University. The sequence corresponding to RS-LiDAR-M1 has a length of 5.94 km and is denoted as RS. The sequence corresponding to Ouster OS1-64 has a length of 1.37 km and is denoted as OS. The average velocity of both sequences is approximately 6 m/s.  Table 3. Since A-LOAM does not support Ouster OS1-64, we performed some minor changes. The results show that BALM has large translation errors and that our method performs better than the other methods. Since the OS dataset was collected on a narrow street surrounded by buildings, there are no large terrains in the dataset, resulting in a larger error with Lego-LOAM. Since previous works have not included RS-LiDAR-M1, only VLOM was assessed on the RS dataset. The ground truth aligned to Google Earth is shown in Figure 13a, and the estimated trajectory of VLOM is shown in Figure 13b. Since there is no module that can close the loop, drift is inevitable, yet our results rarely overlap with the ground truth. Moreover, Figure 14 shows the deviations in the position and orientation. The error is mainly caused by [z, roll, pitch], implying that the constraints in the vertical direction are insufficient. Since these constraints are mainly imposed by the ground features and the main direction of the LiDAR sensor is parallel to the ground, it is difficult to effectively extract features. Evaluation of the Velocity Estimate The velocities are compared in Figure 15. Since BALM has large errors on the OS dataset, the corresponding data are not shown here for clarity. A-LOAM and VLOM have overlapping curves and show similar performance; however, the magnified area at the top of Figure 15 shows that VLOM performs slightly better than these methods. Lego-LOAM appears to be unstable on the OS dataset and has more jitters.

Evaluation of Runtime
Furthermore, we calculate the average runtime of each module for all sequences used in the experiments; the results are shown in Table 4. Since the feature extraction and LiDAR odometry modules in VLOM operate in the same thread, we consider the total time spent by both and report it as Feature + Odometry. To evaluate the effect of the fast covariance calculation method proposed in Section 2.4.1, we carried out an experiment with the same conditions as Equations (27) and (28) and report the results as Old and New, respectively. VLOM demonstrates real-time performance capability (less than 100 ms at the LiDAR frame rate). By comparing the Old and New results, we can conclude that the new method effectively reduces the runtime.

Discussion
Feature extraction: Feature extraction is an important component of many LiDARbased SLAM algorithms. The feature extraction module often determines whether the algorithm can be used with a particular LiDAR type. The experiments in Section 3 show that only VLOM can complete all sequences (the other methods are incompatible with RS). In addition, the BALM algorithm supports both conventional LiDARs and solid-state LiDARs (Livox Horizon). However, BALM uses different feature extraction modules for conventional LiDARs and solid-state LiDARs. Depending on the LiDAR used, different startup programs need to be executed. Without adaptation, there may be the new LiDAR type to which BALM cannot be applied (e.g., RS -LiDAR-M1). The adaptability of VLOM is improved by our spherical projection-based feature extraction module. In addition, the spherical projections allow us to perform more detailed analyses of the planes and edges, improving the quality of the extracted features. This improvement was confirmed by the experiments with A-LOAM-F and BALM-F shown in Figure 9. Both algorithms with good results were developed by replacing the feature extraction modules of the original A-LOAM and BALM algorithms.
Velocity estimation: As described in Section 3.3.1, the velocity can be estimated either directly or based on the poses of two frames, followed by interpolation. Most open-source algorithms are based on the latter scheme. However, this can lead to closed loops between the pose and velocity estimates: the pose estimate requires that the velocity estimate produces an accurate and undistorted point cloud, while the velocity estimate, in turn, relies on the pose estimate. This type of iterative process cannot be repeated infinitely. In fact, most algorithms perform very few iterations; thus, it is difficult to achieve an accurate result in complex scenarios. This trend can be observed in Figures 10 and 15. Since the HK-Data20190117 sequences were collected on a street surrounded by a complex environment, the velocity accuracies achieved by A-LOAM, BALM, and Lego-LOAM were lower than that achieved by VLOM; however, because the OS dataset was collected on a campus surrounded by several buildings and few people, A-LOAM and VLOM had similar velocity accuracies.
We assume that the LiDAR is moving at a constant speed. However, since SLAM is a probability estimation problem, some error is acceptable, so this algorithm can work in most real scenarios with non-constant velocity. In the experiments of this paper, a walking sequence (Schloss-2), two driving sequences on campus (OS and RS), and a driving sequence on a city street (HK-Data20190117) were studied. In particular, the walking sequence, which has a high degree of randomness, is very challenging. The good performance on these sequences proves that the algorithm in this work can handle the real-world scenarios with non-constant motion well.
Bundle adjustment: We use the scheme proposed by BALM for bundle adjustment. Bundle adjustment is a strategy that was first used in vision-based SLAM. In bundle adjustment, the previous frames and current frame are optimized together, achieving consistency over a wider range. However, when used with LiDARs, two problems arose: first, a single point could not be used to represent a complete landmark (i.e., planes and edges); thus, multiple points were required to determine a landmark in each frame. However, it is not easy to determine the appropriate sets of points. Second, the distortion of the point cloud caused by LiDAR motion led to inaccurate landmark observations. To address the first problem, BALM uses an adaptive octree to determine the observed point set and considers all points belonging to the same voxel as observations of the same landmark. False associations are generated for voxels located at the intersection of two landmarks. To address the second problem, BALM uses a LOAM variant in its front end to estimate the velocity and compensate for the point cloud. However, in our tests, the LOAM variant used by BALM showed poor real-time performance, which significantly reduced the speed of the algorithm. Thus, these solutions are inadequate. The results on the HK-Data20190117, Schloss-2, and OS sequences show that the bundle adjustment module of BALM is not robust enough, resulting in large deviations in these sequences. We address the first problem by using the strategy described in Section 2.3.1 to adaptively capture the set of observations of a landmark. By incorporating the intraframe velocity estimate into the optimization process, we can evaluate the motion compensation of the point cloud during the bundle adjustment process, which addresses the second problem quite well. VLOM significantly outperformed BALM on all sequences except the RS sequence, which is not supported by BALM.

Conclusions
In this work, we developed a SLAM system with 12 degrees of freedom that allows for real-time state estimation. Since the feature extraction module is designed for versatility, VLOM is suitable for both spinning and solid-state LiDARs. In addition, by directly estimating the velocity, we can produce a more accurate point cloud, improving the accuracy of LiDAR mapping. We also improved the efficiency of computing the covariance matrix, increasing the speed of bundle adjustment during the LiDAR mapping. With extensive tests on various LiDARs, we show that our algorithm performs better performance than various state-of-the-art methods.
We use spherical projection to unify all inputs, primarily because the spherical image is small and provides a fast method for finding nearest neighbors, which is important for real-time performance. However, there are many other representation options, such as bird's eye view. In the future, other suitable representations can be explored to improve the performance of the feature extraction module. In addition, the proposed algorithm does not include a loop closure module, which is useful for reducing drift and is already used in many SLAM algorithms. In the future, a loop closure module will be added to VLOM to ensure long-term global consistency. In addition, IMU, GPS, and other sensors could be combined to achieve a more stable performance. Since the linear and angular velocities are explicitly included in the state estimation, our system may perform better with the addition of IMU.

Conflicts of Interest:
The authors declare no conflict of interest regarding the publication of this manuscript.