LiDAR Odometry by Deep Learning-Based Feature Points with Two-Step Pose Estimation

An accurate ego-motion estimation solution is vital for autonomous vehicles. LiDAR is widely adopted in self-driving systems to obtain depth information directly and eliminate the influence of changing illumination in the environment. In LiDAR odometry, the lack of descriptions of feature points as well as the failure of the assumption of uniform motion may cause mismatches or dilution of precision in navigation. In this study, a method to perform LiDAR odometry utilizing a bird’s eye view of LiDAR data combined with a deep learning-based feature point is proposed. Orthographic projection is applied to generate a bird’s eye view image of a 3D point cloud. Thereafter, an R2D2 neural network is employed to extract keypoints and compute their descriptors. Based on those keypoints and descriptors, a two-step matching and pose estimation is designed to keep these feature points tracked over a long distance with a lower mismatch ratio compared to the conventional strategy. In the experiment, the evaluation of the proposed algorithm on the KITTI training dataset demonstrates that the proposed LiDAR odometry can provide more accurate trajectories compared with the handcrafted feature-based SLAM (Simultaneous Localization and Mapping) algorithm. In detail, a comparison of the handcrafted descriptors is demonstrated. The difference between the RANSAC (Random Sample Consensus) algorithm and the two-step pose estimation is also demonstrated experimentally. In addition, the data collected by Velodyne VLP-16 is also evaluated by the proposed solution. The low-drift positioning RMSE (Root Mean Square Error) of 4.70 m from approximately 5 km mileage shown in the result indicates that the proposed algorithm has generalization performance on low-resolution LiDAR.


I. INTRODUCTION
A TONOMOUS driving is the trend of intelligent transportation in this society, and high-level self-driving car has higher requirements for the positioning accuracy of vehicles. To obtain robust positioning results, multi-sensor fusion in navigation and localization has been widely researched in [1]- [4]. In the procedure of localization by multi-sensor fusion, making the most of the each single sensor can help improve the robustness and the accuracy of the entire system. GNSS as a main kind of global positioning method can provide high accuracy position information for its users under ideal environment. While it is always vulnerable [5], especially in the environment such as urban canyons, where the shadow of buildings and vegetation will cause the signal to be interrupted or seriously degraded. As an active ranging sensor, LiDAR can work under more severe conditions, and can provide accurate depth information directly. Additionally, working in area with pre-built map, LiDAR can provide reliable global positioning results. However, high-quality pre-built map is hard to be accessed due to its high cost and the limitation of coverage. Hence, the research in LiDAR odometry is of great significance to the autonomous driving.
LiDAR odometry is a fundamental method for relative pose estimation. Data association and feature tracking in long distance are still obstacles for the improvment of accuracy. In conventional feature-based LiDAR odometry, feature points are always associated with the closest line or plane by the initial guess of pose [6], [7]. In data association with handcrafted descriptors, Random Sample Consensus (RANSAC) is implemented to determin the correspondences. While single threshold of distance in feature space inhibit these algorithms [8], [9] from improving the quality of correspondences and length of tracking. As a powerful tool for feature extraction, convolutional neural networks have made significant progress in recent years. These networks [10]- [13] are avaliable to extract feature points on images with descriptors and have already outperformed the handcrafted features [14]- [16]. To alliviate these problems, a deep learning-based approach is employed for feature extraction and description. After that, these feature points are culled along with pose estimation by a two-step strategy to perform LiDAR odometry. The main contributions of this paper are as follows: 1) An accurate LiDAR odometry algorithm by deep learning-based feature point detection and description.
The feature points are extracted from the BEV image of 3D LiDAR data. More accurate and robust key point association than handcrafted feature descriptor can be provided.
2) A two-step feature matching and pose estimation strategy is proposed to improve the accuracy of key points association and length of feature tracking. The first step is to ensure the accuracy of data association, and the secend step is to add more reliable feature points for long-range tracking.
3) The proposed method is evaluated by processing commonly used banchmark, KITTI dataset [17] and compared with the SLAM algorithm based on handcrafted feature points. The contribution of deep learning-based feature extraction and two-step pose estimation are varified by experiments respectively. In addition, generalization of the proposed algorithm has been proved by performing it on field test using low-resolution LiDAR, Velodyne VLP-16.

II. RELATED WORK
The ability of positioning and dead reckoning for autonomous vehicle is the fundamental guarantee for its autonomous driving capability. LiDAR odometry is an important means to provide positioning results when absolute positioning information is blocked. Currently, LiDAR odometry includes three main types, including geometry-based methods, deep learning-based methods and hybrid methods.

A. Geometry-based methods
The geometry-based methods takes geometric information in neighbor into account utilizing curvature to obtain feature points. LOAM [6], F-LOAM [7] and Lego-LOAM [18] extract feature points without descriptors by compute their curvature in neighbor. Uniform motion model is always adopted in the prediction of initial pose of current LiDAR frame. Then the feature point and the closest line or plane are associated by initial pose. Finally, the relative pose of LiDAR is estimated by minimizing the distance from point to line and point to plane. The information from adjacent scan lines is not fully utilized. In [19], [20], common landmarkers extracted from infrastructures in urban areas such as poles and traffic boards are also employed for LiDAR scan matching. These types of method are constrained in some certain areas, and might be ineffective in country or highway. Since there are no descriptors for feature points or landmarkers, mismatches can demage the performance of LiDAR odometry. In DLO [21] and DL-SLAM [22], height map of 3D LiDAR data is rasterized on XY-plane to generate image, and direct gray scale matching is utilized to compute translation and rotation. Ali [9] proposed a feature based LiDAR SLAM using rasterized image. ORB feature points are extracted on BEV of LiDAR data along with their descirptors. Then RANSAC algorithm is applied to associate these feature points. While, these approaches of feature extraction and feature point corresponding can be optimized in a further step for more accurate association and longer tracking.

B. Deep Learning based methods
This type of method generally preprocesses the point cloud by spherical projection to generate a multi-channel image.
Then the relative pose between LiDAR frames is estimated by end-to-end inference. LO-Net [23] takes the spherical projection image to perform end-to-end LiDAR odometry directly. DeepVCP [24] was trained on dense point cloud from Velodyne HDL-64 and accumulated them in 3 seconds and extracted 3D feature points for relative pose estimation. At the same time, some researchers are exploring unsupervised methods for learning-based LiDAR odometry, such as [25], [26], which reduces the data required for training effectively, but the accuracy is difficult to outperform the results by supervised learning. Point cloud preprocessed in form of spherical projection [23]- [28] makes it difficult to extract local feature accurately for low-cost scanning LiDAR as Velodyne VLP-16 due to its sparsity. To make the algorithm avaliable to process the data of low-resolution LiDAR, the BEV image is employed in the propsed algorithm. In the meanwhile, deep learningbased methods have not been invastigated on processing the BEV image of LiDAR data in LiDAR odometry, and it is worth being explored.

C. hybrid methods
In hybrid methods of LiDAR odometry, one type of them is to use traditional methods to extract feature points, and then regress relative pose of LiDAR frames by neural network. LodoNet [29] employs SIFT [14] for feature extraction on LiDAR spherical projection images, and then rearrange feature point correspondences into an n × 6 vector (each LiDAR point contains 3-dimensional coordinates). Finally convolutional neural networks are trained for feature point culling, and regression of relative poses. The other stream of hybrid methods represented by DMLO [27], 3D3L [30] extract and match the feature points by means of deep learning on the spherical projection images of LiDAR frames, and then perform ICP [31] on the correspondences to obtain the solution by Singular Value Decomposition (SVD) or nonlinear optimization. While they keep the common flaw as the end-toend method by taking the spherical projected image as input and it's not necessary to compute transformation by network modules.
In this paper, we combine feature extraction by deep learning neural network and nonlinear optimization to solve the rotation and translation in LiDAR odometry.

III. SYSTEM OVERVIEW
The scheme of the proposed algorithm shown in Fig. 1. The entire pipeline includes pre-processing module and module of LiDAR odometry. The red modules in the flowchart correspond to the main contributions of this paper. In the procedure of preprocessing, LiDAR point cloud is firstly segmented for retaining non-ground points to generate rasterized image. Then in the feature extraction stage, R2D2-Net is utilized for key point extraction and description. After that, feature points are stored in map database for further pose estimation and backend optimization. Step 1: Large neighbor + strict descriptor Step 2: Smaller neighbor + loose descriptor x, y,θ

Map Points Map Points Observation
Pre-processing LiDAR Odometry

IV. PRE-PROCESSING
This part corresponds to the preprocessing module in Fig.  1. LiDAR mounted on a moving vehicle receiving reflections from ground all the time. These reflections do harm to pose estimation on XY-plane when point-to-point matching is applied. Consequently, LiDAR data is first segmented, keeping non-ground points in the environment to be further processed. Since only few points reflected from targets at a distance, in this paper, we crop the LiDAR BEV image along the x and y axis of LiDAR, leaving a square area of [-37.5 m, 37.5 m] size. Then the retained LiDAR data is rasterized by orthographic projection, with resolution of 0.1 m, and the size of the processed image is 750×750 pixels. During the processing, the highest point z value in each grid is stored as the grayscale of the pixel, and its x, y value is also stored to provide information for subsequent pose optimization. This process is equivalent to linear transformation on the x and y coordinates of LiDAR frame. The projection transformation is indicated in (1).
For areas without LiDAR reflections, the gray level is uniformly set to 255. After the whole image is rasterized, the grayscale is linear stretched, and the height value of the original floating-point number is linearly transformed into the range of [0, 255]. In (2), the function clamp limits the output of it to a certain range [h min , h max ]. And the output is the grayscale of pixel on the image in range [0, 255], while input is the maximum height value in each grid, a and b are the coefficient and the bias of linear transformation.
After converting the LiDAR BEV into an image, it is further processed by Gaussian blur to fulfill the holes in neighbor on the image.

A. Feature Extraction and Matching
This part and the following ones correspond to the module of LiDAR odometry in Fig. 1. Data association in LiDAR odometry is curcial for accuracy, and the repeatability and relibality of feature points is of vital importance in data association. These attributes of feature points were considered the in desing of the R2D2-Net. It would neglect some areas with uniform textures in the optical image, such as the sky, ground, and water surface. Therefore, areas without targets would not be the region of interest to the neural network and no feature points would be extracted in these areas. The pre-processed LiDAR data is input into the R2D2 network as an image, and feature points are extracted, along with descriptors and response values. Similar to the response of the handcrafted feature point, the R2D2 feature point uses the product of repeatability and reliability from the network as the response. The higher the score is, the more reliable it is. After key point detection, those with the top-k strongest response is retained. Each feature point is descripted by a 128dimensional normalized vector. Those descriptors satisify that where d i is the i-th element in the description vector. Different from ORB descriptor, correlation is used to indicate the similarity between two key points, rather than hamming distance. An example of key points detected by R2D2 net is shown as Fig. 2. The background image is obtained by (1) and filtered by Gaussian kernel. It can be seen from the distribution of feature points that most of the points are located near points with large gradients in grayscale domain such as "corners" and "edges". In blank areas, only few feature points will be extracted.

B. Two-step Pose Estimation
In traditional LiDAR odometry or visual odometry, a uniform motion model is often utilized to make assumptions [6], [32]. According to the pose of the previous two frames, initial relative pose of current frame is obtained by linear interpolation as (3).
Where T b a is the relative transformation from frame a to frame b, including rotation and translation. Superscript w means world frame or map frame and subscript k is the timestamp.T k+1 k mean the initial guess of relative pose of frame k and frame k + 1, andT w k is the posterior estimation of relative pose of frame k and map frame w. After extracting the feature points of each frame of image, in order to avoid violent searching and matching on the whole image, KDTree [33] is used in this algorithm to store the pixel coordinates of the feature points for searching their correspondences. When searching for the first time, since the assumption of a constant speed model is no always suitable for real vehicle motion, there will be a greater uncertainty in the pose extrapolation of the current frame. Therefore, a larger search radius is adopted. To avoid mismatches, a stricter descriptor distance threshold is set to confirm the correspondences. Then, each pixel coordinate of feature point is reprojected into local LiDAR frame by the projection matrix K to search for the nearest original point cloud. Their coordinates are involved into the cost function for pose optimization. By minimizing the distance of the reserved map point and the predicted map point by current pose of LiDAR as in (4), optimized pose can be computed.
Where R and t is the rotation and translation of current LiDAR frame refer to the map frame w. p w i is the reserved map point, p l i is the i-th feature point in the current LiDAR frame corresponding to p w i .

C. Map Management
The position of the first frame is initialized as (0, 0) on XYplane, and the heading is 0 • . All feature points extracted by the network are recovered to the original LiDAR measurement as stated in Section IV, and their global coordinates are retained as map points. In the subsequent odometry calculation, for the feature points that are successfully matched, the observation information is associated with the corresponding map points, including the number of the LiDAR frame and the global pose of the LiDAR for backend optimization. When inserting a key frame, the LiDAR point cloud corresponding to all the feature points in the frame is transferred to the global coordinate system through the estimated LiDAR pose and marked as a map point for subsequent matching.

D. Key Frame Selection and Backend Optimization
When the number of tracked inliers is less than 100 points, or there are more than 5 frames between the last key frame and the current frame, then a key frame will be inserted. In the backend optimization, bundle adjustment is used to optimize the reserved 5 active key frames and the associated feature point observations. The position of the map point and the pose of the LiDAR key frames will be optimized in the same problem as several parameter blocks, by minimizing the distances between the feature points in LiDAR frames and the associated map points. In (5), p w i,j is the map point of the j-th feature point observed in the i-th frame. R i and t i represent the rotation and translation of LiDAR in frame i refer to the global frame.
In order to maintain the scale of the backend optimization, we reserve 5 active key frames in backend. Information of observations and frames out of the sliding window are removed. In general, when a key frame is going to be inserted, it will be judged whether the current frame meets the motion constraint: the closest key frame exceeds a certain distance threshold. If it is satisfied, the current key frame and the corresponding map point observation will not be inserted into map, otherwise, the earliest inserted active key frame and its map point observation will be removed. We apply ceres [34] library for non-linear optimization in this paper and choose Levenberg Marquardt as the optimizer.

VI. EXPERIMENTS
We have implemented the proposed LiDAR odometry algorithm and evaluated on the KITTI [17] dataset. The data is collected from three types of environments, i.e., country, urban and highway The types of feature point cover a variety of scenes. The KITTI dataset provides 22 sequences of LiDAR data. 11 sequences of them from sequence 00 to sequence 10 are "training" data. The training data are provided with ground truth translation and rotation. In addition, data collected by low-resolution LiDAR, VLP-16 is also processed to validate the generalization of the proposed algorithm by the accuracy of relative motion estimation.

A. Evaluation of LiDAR Odometry
In this paper, 2D translation on XY-plane and heading angle are computed by the proposed algorithm. In the first experiment, we used the Absolute Trajectory Error (ATE) to evaluate the performance of the LiDAR odometry proposed. According to the method of EVO [35], the predicted trajectory is first aligned with the ground truth trajectory directly, and the z value of the estimated trajectories is regarded as 0 constantly. Then we perform the RMSE (Root Mean Square Error) and STD (Standard Deviation). The results are shown in TABLE I, the best RMSE and STD values are marked as bold font for each sequence.
The proposed algorithm outperforms the baseline method in some scenes with loop closure such as Seq. 06. As shown  [36]. The unit of RMSE and STD is "m". Seq Fig. 3, scale errors of the trajectory appear in the result of ORB feature based laser SLAM [36] in Fig. 3b. Loop closure detection and optimization only made little contribution to reducing the error. To demonstrate the result of the baseline without their raw data, the right of the figure is adopted from the original literature [36], and the legend is modified, by placing "Ali's method" in the figures. Seq. 06 adopted from [36]. All trajectories shown above are aligned with ground truth.
This part of experiment proved the advantage of combining deep learning-based feature extraction and two-step pose estimation in an entire solution for LiDAR odometry.

B. Performance comparison between ORB and R2D2-Net
To demonstrate the contribution of deep learning-based feature extraction, we compared the multi-frame tracking performance of these two kinds of feature points. In the experiment, frame interval is set as 10 or 20 frames to demonstrate the robustness of feature extraction against viewing angle changes caused by long movement distance. We define the indication of performance as the average of matching inliers, the more points are tracked, the better performance it has. In this experiment, the number of key points extracted by R2D2 net is no more than the key points extracted by ORB and the raw LiDAR data is preprocessed in the same way. Similar to [12], [13], we set 3 pixels as the radius to search for the correspondence.
Firstly, pixel coordinates of feature points in the k-th frame are transformed into the target frame (k+10th or k+20th frame) by ground truth and the projection matrix. Secondly, the nearest feature point in feature domian is regard as the correspondences. If the distance in feature domain satisfied the threshold, it is considered as an inlier. Finally, the number of inliers is counted in each pair of frames to calculate the average inliers for each sequence.
The following figure shows the average of inliers for each sequence using two kinds of feature points at 10 and 20 frame interval respectively. The green triangles show the number of inliers detected by R2D2 net and the blue dots indicate the inliers detected by ORB. The result shows that R2D2 has a distinct advantage under the condition of 20-frame interval. Only in Seq. 06 and Seq. 10, the performance of R2D2 feature points is slightly worse when the frame interval is 10 frames. It means that more feature points extracted by R2D2 net could be tracked than those detected by ORB in the long run. It really helps to improve the accuracy by mapping and bundle adjustment in LiDAR odometry. One step further on analyzing the performance of ORB feature point and the feature points extracted by R2D2-Net.
There might be three reasons why the deep learning-based feature extraction method work well on the BEV image of LiDAR data. 1) Firstly, there are more rich and complex pattern in optical images than the BEV images of LiDAR. Trained by a large number of optical images, filters in the network can represent the feature space properly. 2) Secondly, 128-demensional floating-point descriptors are inferred by the network, leading to more powerful description for those key points than the 256-bit descriptors of ORB feature. 3) Thirdly, the network constructed by multi-layer convolutional neural network has a large perception field to capture global feature to make feature points distinguishable.

C. Performance comaprison between RANSAC and Two-step pose estimation
To validate the effectiveness of two-step strategy, contrast experiment between RANSAC and two-step pose estimation is performed. In the process of feature corresponding, 100 times of iterations are set in RANSAC pose estimation. Except feature corresponding process and motion estimation, other parts of the LiDAR odometry scheme are the same. The results are demonstrated in TABLE II. As metioned in Section V-D, key frames are inserted when the number of tracked inliers is less than 100. The two-step method outperforms the RACSAC strategy in all sequences by less key frames and more observations for the feature points. As a result, an accurate odometry can be obtained with more independent observations for the same feature point according to [37], and the effectiveness can be proved. Tracking length of feature points is also analyzed by number of key frames in each sequence as the less key frames inserted, the longer feature points would be tracked. The contrastive results are shown in Fig. 6. It can be seen that, by twostep pose estimation, less key frames would be inserted in each sequence, indicating that more key points would be tracked longer. The blue hists show the total number of LiDAR odometry computed by RANSAC pose estimation, and the orange hists show the result of two-step pose estimation. This part of experiment verified that two-step pose estimation strategy helps to improve the tracking distance of feature points as well as the accuracy of LiDAR odometry.

D. Evaluation on Velodyne VLP-16 dataset
In addition to the KITTI dataset, we also test the generalization of the proposed algorithm on low-resolution LiDAR data. We conducted data collection in Wuhan Research and Innovation Center, Wuhan City in January 2021. The street view of the park is shown in Fig. 7 and the scene is similar to the urban environment in KITTI dataset. The data collecting setup includes the quasi-navigationlevel INS (Inertial Navigation System) LEADOR POS-A15 with gyroscope bias stability of 0.027 • /h and accelometer bias stability of 15 mGal. GNSS RTK was also provided in from of positioning results. This equipment is mounted on a Haval H6 SUV whose height is about 1.7 m. Extended Kalman Filter is utilized to obtain the reference trajectory in centimeter level by loose coupling. The setup of data collection system is shown in Fig. 8.
The data sequence length contains 8399 LiDAR frames, and last for 14 minutes. This park is nearly a square and the span is about 400 m in both the cross directions. Through EVO trajectory alignment and evaluation, the RMSE is 4.70 m and the STD is 1.55 m, on the same level as the result from KITTI dataset. There is no obvious drift even without loop closure optimization, which proves the effectiveness and generalization of the proposed algorithm. The result is shown in Fig. 9.  Different from Velodyne HDL-64 whose field of view in vertical direction is 26.9 • , Velodyne VLP-16 collects 16 scan lines of point cloud in the elevation angle range of [-15 • , 15 • ], 30 • in total. Targets at different height will be precived by these two types of LiDAR even if the sensors are mounted on the same height. As a robust feature extraction method with a proper pose estimating strategy, the performance won't degrad seriously. In this paper, we provide quantitative results to proof the effectiveness and generalization of the proposed method. Experiment conducted in Wuhan Research and Innovation Center shows that, the strategy for key points detection is practical on low-resolution LiDAR. The statistical values of LiDAR odometry tested on this dataset are on the same level of KITTI dataset.

VII. CONCLUSION
In this paper, we propose an algorithm of 2D LiDAR odometry based on BEV image. Deep learning-based feature extraction and two-step strategy are combined together for pose estimation. The RMSE of the trajectories shows that it outperforms the corresponding ORB based LiDAR SLAM on KITTI dataset even without loop closure in the proposed method. The RMSE of positioning result reduces from 2% to 80% compared with the baseline method. The results of feature point tracking experiment show the effectiveness of deep learning-based feature extraction and two-step strategy. In addition, low-resolution LiDAR data from Velodyne VLP-16 was collected by us. The result is also presented to validate generalization of the proposed method. In the result of evaluation, the RMSE and STD of our dataset is 4.70 m and 1.55 m respectively in about 5 km-long mileage. Future research will focus on fusing more information from other sensors such as GNSS, IMU and wheel encoder to improve the accuracy and the adaptability in more complex environment.

APPENDIX I ODOMETRY RESULTS
In Fig. 10, we show all trajectories and their evaluations by EVO [35] on KITTI training dataset to provide an intuitive result.