POU-SLAM: Scan-to-Model Matching Based on 3D Voxels

Purpose: Localization and mapping with LiDAR data is a fundamental building block for autonomous vehicles. Though LiDAR point clouds can often encode the scene depth more accurate and steadier compared with visual information, laser-based Simultaneous Localization And Mapping (SLAM) remains challenging as the data is usually sparse, density variable and less discriminative. The purpose of this paper is to propose an accurate and reliable laser-based SLAM solution. Design/methodology/approach: The method starts with constructing voxel grids based on the 3D input point cloud. These voxels are then classified into three types to indicate different physical objects according to the spatial distribution of the points contained in each voxel. During the mapping process, a global environment model with Partition of Unity (POU) implicit surface is maintained and the voxels are merged into the model from stage to stage, which is implemented by Levenberg–Marquardt algorithm. Findings: We propose a laser-based SLAM method. The method uses POU implicit surface representation to build the model and is evaluated on the KITTI odometry benchmark without loop closure. Our method achieves around 30% translational estimation precision improvement with acceptable sacrifice of efficiency compared to LOAM. Overall, our method uses a more complex and accurate surface representation than LOAM to increase the mapping accuracy at the expense of computational efficiency. Experimental results indicate that the method achieves accuracy comparable to the state-of-the-art methods. Originality/value: We propose a novel, low-drift SLAM method that falls into a scan-to-model matching paradigm. The method, which operates on point clouds obtained from Velodyne HDL64, is of value to researchers developing SLAM systems for autonomous vehicles.


I. INTRODUCTION
Reliable localization of autonomous vehicles is an attractive research topic since it is a basic requirement for navigation and other tasks [1,2,3,4,5].Fully autonomous vehicles should reliably locate themselves, ideally by using only their own sensors on board without relying on external information sources such as GPS.In scenarios where such external information is absolutely unavailable, Simultaneous Localization And Mapping (SLAM) can always be an alternative option.
SLAM has been intensively studied during the past few decades and various solutions based on different sensors have been proposed for both indoor and outdoor environments.In outdoor scenes, the advantage of LiDAR with respect to cameras is that the noise associated with each measurement is independent of the distance itself and usually more robust to the illumination variation.Therefore, laser-based SLAM is becoming one of the mainstream solutions in outdoor scenes.Most laser-based SLAM methods tend to extract stable features such as planes and edges from points, and then do SLAM in a 'feature space'.They are therefore categorized as featurebased SLAM, whose performance is mainly determined by two factors.The first factor is the way features are designed.Ji Zhang et.al [6,7] propose a planar and edge feature designing method based on curvature.Michael Bosse et.al [8] incorporate shape information which is calculated from geometric statistics of the point cloud into the Iterative Closest Point (ICP) correspondence step.The second factor is the scan matching method.Scan-to-scan and scan-to-model matching are the two main scan matching frameworks in SLAM.Sen Wang et.al [9] propose a scan-to-scan odometry trajectory estimation method by using convolutional neural networks to process LiDAR point clouds.KinectFusion [10] implements scan-tomodel matching by using an implicit surface as model.The state-of-the-art laser-based SLAM method, Lidar Odometry And Mapping (LOAM) [6] extracts distinct features corresponding to physical surfaces and corners.To enable on-line implementation, LOAM switches between scan-to-scan at 10 Hz and scan-to-model operation at 1 Hz update frequency.Although LOAM has achieved good performance, there are still challenges in outdoor applications using 3D laser sensors, i.e., (1) inherent point matching error since the sparsity of the point clouds, which means that there are always existing errors when we directly use the sparse point cloud to fit the surface of environment.Furthermore, it is nearly impossible to obtain rigid point correspondences between scans; (2) it is hard to fix the matching parameter since the density variation of the point clouds.During the feature-based matching process, it is required to extract enough context information of one 3D point to make it discriminative, while the scope of the dominating context is greatly affected by the point density.In this paper, to tackle the above challenges, we introduce POU implicit surface representation to regress the environment surface with sparsity points, which can effectively encode the detailed geometrical information.We discretize the point cloud into 3D voxels.These voxels are classified into planar and edge feature voxels according to the geometrical character of the points contained in the voxel.Finally, we build a feature voxel map for further scan-to-model matching, during which the POU implicit surface representation is proposed to adapt to the voxel map.The key contributions of our work are as follows: (1) We propose a novel feature voxel map which stores voxels with salient planar or edge features.
(2) We propose a scan-to-model matching framework using POU implicit surface representation.The rest of this paper is organized as follows.In Sect.II, we discuss related work and conclude how our work is different from the state of the art.The feature extraction algorithm is presented with details in Sect.III, and the scan-to-model matching algorithm is given in Sect.IV.Experimental results are shown in Sect.V. Finally, Sects.VI concludes the paper.

II. RELATED WORK
SLAM with cameras and LiDAR has attracted wide attention both from robotic and computer vision communities.We acknowledge the large body of work in this field, but concentrate here only on approaches based on LiDAR.Since feature extraction, laser-based mapping and scan-matching framework are the three most important steps for laser-based SLAM, we therefore look into related work on these three aspects and provide general ideas on how they work as following.
Feature extraction: LOAM focuses on edge and planar features in the LiDAR sweep and keep them in a map for edge-line and plane-planar surface matching.Tixiao Shan et.al [11] also use edges and planar features to solve different components of the 6-DOF transformation across consecutive scans.While these works extract features based on curvature, there is still difference like the method from [11] is groundoptimized, as it leverages the presence of a ground plane in its segmentation and optimization steps.Using point cloud obtained from a high resolution and high density Zoller-Fröhlich Z+F laser, J Lalonde and Yungeun Choe et.al [12,13] exploit point cloud geometrical statistics to classify the natural terrain into scatter-ness, linearity, and surface-ness.Using point cloud from a SICK LMS291 laser range sensor, Michael Bosse et.al [8] implement scan-matching based on point cloud geometrical statistics features.They use ICP to align voxel centroids, which can reduce the number of points for scan matching as well as the size of the map.Different from [8], we consider the geometric features of the points more subtly and implement scan matching with original points, which would not lead to information loss.
Laser-based mapping: SLAM systems often build and maintain environment map like feature map [6,7,11,14,15], dense map [16,17], subsampled clouds [18,19] and Normal Distributions Transform (NDT) based map [20,21].In laserbased SLAM systems, feature map is usually a structural collection of surfaces and corners that are extracted from original point clouds.In this paper, we build a feature map, which can be distinguished from literature methods in the way that we maintain more types of features in the form of voxels.
Scan matching framework: Recently, some laser-based SLAM systems use scan-to-model matching framework and have achieved state-of-the-art performance on KITTI odometry benchmark.Jens Behley et.al [16] propose a dense approach to laser-based SLAM that operates on point clouds obtained from Velodyne HDL-64.They construct a surfel-based map and use a point-to-plane ICP [22,23] to estimate the pose transformation of the vehicle.Our method essentially belongs to scan-to-model matching framework.The distinction is that we build a feature voxel map and use POU implicit surface representation to adapt to the scan-to-model matching process based on feature voxel map.
Partition of unity approach: The POU approach has been widely used in surface reconstruction.Ireneusz Tobor et.al [24] show how to reconstruct multi-scale implicit surfaces with attributes, given discrete point sets with attributes.Tung-Ying Lee et.al [25] propose a new 3D non-rigid registration algorithm to register two multi-level partition of unity implicit surfaces with a variational formulation.

III. FEATURE EXTRACTION
Similar to [26], the scan-to-model matching is triggered when there are salient geometrical features in local regions.So, it is important to find these regions first.In contrast to 3D object classification [27,28] and place recognition [29,30], which use complex point cloud descriptor to segment and classify objects, we use a simpler shape parameterization because our task focuses on incremental transformation for which we have a stronger prior on the relative poses.

A. Computing Shape Parameters
A local region that is occupied by the input point cloud is repeated divided until the number of points fall into each voxel is equal to the threshold N p .The threshold N p is related to the horizontal and vertical resolution of Lidar point cloud.In the process of feature extracting, the densities of features in corresponding voxels vary.We consider the density while searching for the correspondences of the features.The lower the density of the features, the lower the probability of finding enough correspondences within a certain radius is.Our experiment results are achieved when N p is set to 25.Each voxel contains points that fall into it and voxel centroid.The points that fall into voxel are defined as {X i } = (x i , y i , z i ) T .The first and second-order moments µ, S describe the parameters for spatial distribution of the points {X i }: Inspired by [8], the matrix S is decomposed into principal components ordered by increasing eigenvalues.− → e 0 , − → e 1 , − → e 2 are eigenvectors corresponding to the eigenvalues λ 0 , λ 1 , λ 2 respectively, where λ 0 ≥ λ 1 ≥ λ 2 .In the case that the structure of points in a voxel is a linear structure, the principal direction will be the tangent at the curve, with λ 0 λ 1 ≈ λ 2 .In the case that the structure of points in a voxel is a solid surface, the principal direction is aligned with the surface normal with λ 0 ≈ λ 1 λ 2 .The two saliency features, named linear feature and surface feature, are linear combinations of the eigenvalues.Figure 2 illustrates the two features used.The quantity: is the linear feature of the points in voxel which ranges from 0 to 1. Similarly the quantity: is the surface feature of the points in voxel which ranges from 0 to 0.5.

B. Voxel Sampling Strategy
Essentially, the voxel based scan matching process is to construct the associations between voxels.It is believed that voxels with high quantity of linearity or planarity tend to be more stable than others.So for each input point cloud, we classify its 3d grids according to the proposed quantities into three types: edge voxel, planar voxel and others.Furthermore, it should be noticed that the ground points usually contain more planar features and non-ground points usually contain both planar and edge features.With this bears in mind, we first divide the whole point cloud into ground and non-ground segments using the method proposed in [31].Figure 3(a) is an example of the ground segmentation result.Then, for nonground points, we extract planar and edge features, while only planar features are extracted for ground points.
In each scan which is now represented as the voxel centroids, we sort the voxels based on their linearity values c and surface-ness values p and get two sorted lists.Two thresholds c th and p th are then employed to distinguish different types of features.We call the voxels with c larger than c th edge features, and the voxels with p larger than p th planar features.Then n Fe edge features with the maximum c are selected from each scan.Non-ground and ground planar features with the maximum p are also selected, and the numbers of the two types of features are n Fp and n Fgp , respectively.Edge and planar features extracted from all the scans are denoted as F e , F p and F gp thereafter.Visualization examples of these features are given in Figure 3(b)-(d).

IV. SCAN-TO-MODEL MATCHING WITH POU IMPLICIT SURFACE REPRESENTATION
Once the currant laser scan is transformed into a set of stable voxels, we implement scan-to-model matching process.In our method, the map consists of the last n located feature sets F e , F p and F gp .Let be the map that contains planar features and edge features at time k.To better fit the observed surface with these points and adapt to the feature voxel map, we take the POU implicit surface representation to construct the model.

A. Finding Feature Point Correspondence
In the curret scan, each point is labeled according to the type of its corresponding voxel.During the matching process, we construct the correspondences between voxels in the map and the current feature points with the guidance of labels and Euclidean distance.Since such correspondences generating mechanism narrows down the potential candidates for correspondences, it can help to improve the matching accuracy and efficiency.The detailed correspondence construction process is given below.
For each current edge point in feature set F e , we search for n e voxels of inside a sphere space whose radius is r e .For each corresponding voxel, we find an edge line as the correspondence of the current edge point.After finding all the correspondence edge lines, we can get the projection points S e of the current edge point on the corresponded edge lines.Then we use linear quantity c to determine whether S e can be represented as an edge line.If S e can be represented as an edge line, it is then regarded as the correspondence of the current edge point.After the edge feature correspondences are obtained, we compute the distance from an edge feature point to its correspondence.The procedure of finding an edge line and the distance computation can be found in [6].
For each current planar point in feature sets F p and F gp , we respectively find n p and n gp voxels of and inside a sphere space whose radius is r p as the corresponding voxels.Then, we use the corresponding voxels to construct the POU implicit surface representation of the model, and compute the distance from a planar feature point to its implicit surface representation of the model.The detail of building the POU implicit surface representation and compute the distance from a planar feature point to its implicit surface representation of the model will be discussed below.

B. POU Implicit Surface Representation For Planar Features
1) POU Approach: The basic idea of the POU approach is to break the data domain into several pieces, approximate the data in each subdomain separately, and then blend the local solutions together using smooth, local weights that sum up to one everywhere in the domain.More specifically, we assign a specific planar patch as the correspondence of a planar feature point in each correspondence voxel.The procedure of finding the specific points to form planar patches P c can be found in [7].We can calculate distances d between the planar feature point and the corresponding planar patches P c .Then, we project planar feature point x on the surface of planar patches defined by p s : p si = x − d i n i , where n i is the normal of the closest point to x in each planar patch and is a good approximation of the surface normal at the projected point p si .Each planar patch can be regarded as a subdomain.Then, we use the weight function to blend the subdomains together.
2) POU Implicit Surface Representation: The implicit surface is defined as the set of zeros of a function I(x), which also behaves as a distance function from x to the implicit surface.In this work, we use the POU implicit surface representation.
Similar to the POU implicit surface framework in [32], we define the POU implicit function I M k (x) as an approximate distance of planar feature point x ∈ R 3 at time k to the POU implicit surface defined by the map M k : where p si is the projected points on the corresponding planar patch s i , − → n si is the normal at point p si .For approximation, we use the quadratic B-spline b(t) to generate weight functions for planar patches.
where c si is the geometric center of planar patch s i and R si = max x∈si x − c si is a spherical support radius of the planar patch s i .The function is proportional to the distance between x and c si and inversely proportional to R si .It means that the closer c si to the x, the higher weight will be allocated.And the more scattered the points which form the planar patch are, the higher the weight.In addition to computing the distance from the feature point to the model surface, the surface normal at the projected point of the feature point x should also be computed.
In this work, we use the normal of the closest point p s closest to x as the approximation of the surface normal at the projected point.Compared with the method using only one plane to represent a surface (could be cylinder or sphere etc.), we adopt multiple local planes with various planarities to represent a surface which can achieve higher accuracy and more accurate distances from feature points to their correspondences.
), where d 1 and d 2 are the distances from the point x to the centers of the cell A and B, respectively.
3) Motion Estimation: With distances from feature points to their correspondences in hand, we assign a bisquare weight [7] for each feature point.The rules are twofold.In general, when the distance between the feature point and its correspondence is below a certain threshold, the weight is assigned inversely proportional to the distance.However, when the distance is greater than the threshold, the feature point is regarded as an outlier.We proceed to recover the LiDAR motion by minimizing the overall distances.Find an POU implicit surface as the correspondence, then compute point to implicit surface distance based on (9) and stack the equation to (12) for each ground planar feature in F k gp do Find an POU implicit surface as the correspondence, then compute point to implicit surface distance based on (9) and stack the equation to (12) for each edge feature in F k e do Find an edge line as the correspondence, then compute point to line distance and stack the equation to (12) Compute a bisquare weight for each row of (12) Update T for a nonlinear iteration based on Levenberg Marquardt method if the nonlinear optimization converges then Break; First, the following equation is used to project feature point x to the map, namely x, in which, R and τ are the rotation matrix and translation vector corresponding to the pose transform T between current scan and the map.By combining the distances and weights from feature points to their correspondences and equation (11), we derive a geometric relationship between feature points and the corresponding model, where each row of f corresponds to a feature point, and d contains the corresponding distances.Finally, we obtain the LiDAR motion with the Levenberg-Marquardt method [33].
We do scan matching between the current scan with the last n scans, and the final result is obtained by aggregating the successfully matched n m scans.Literally, by matching the current scan with the historical n scans, the error propagation problem can be suppressed.As for blunder, we limit the scale of the relative transformation between the current scan and the model.When the blunder occurs, we will abandon the result of the current scan and subsequent scan-matching implementation would be slightly affected.After computing the transformation between the scan and model, we add feature voxels corresponding to current scan to the map and remove the feature voxels corresponding to the oldest scan to always keep n scans for scan matching.

V. EXPERIMENTAL EVALUATION A. Tests On The Public Dataset KITTI
We evaluate our method on the KITTI odometry benchmark [34], where we use point clouds from a vertical Velodyne HDL-64E S2 mounted on the roof of a car, with a recording rate of 10Hz.The dataset is composed of a wide variety of environments (urban city, rural road, highways, roads with a lot of vegetation, low or high traffic, etc.).The LiDAR measurements are de-skewed with an external odometry, so we do not apply ego-motion algorithm to this dataset.Table I shows the results of our method for all sequences in detail.The relative error is evaluated by the development kit provided with the KITTI benchmark.The data sequences are split into subsequences of 100, 200,. . .,800 frames.The error e s of each subsequence is computed as: where E s is the expected position (from ground truth) and C s is the estimated position of the LiDAR where the last frame of subsequence was taken with respect to the initial position (within given subsequence).The difference is divided by the length l s of the followed trajectory.The final error value is the average of errors e s across all the subsequences of all the lengths.We include here for comparison the reported results of LOAM, a laser-based odometry approach that switches between scan-to-scan and scan-to-model framework, and the reported results of SUMA [16] a laser-based odometry approach using scan-to-model framework.We can see that the proposed method yields results generally on par with the state-of-theart in laser-based odometry and often achieves better results in terms of translational error.Overall, we achieve an average translational error of 0.61% compared to 0.84% translsational error of LOAM on KITTI odometry benchmark.We can see that the proposed method yields results generally on par with the state-of-the-art in laser-based odometry and often achieves better results in terms of translational error.

B. Discussion Of The Parameter And Performance
In our experiment, we define the thresholds c th and p th based on the KITTI Odometry sequence 01.Laser scans contained in this dataset are collected from a highway, the scenario where there is only very few distinct structural features could be used for scan matching.To extract as more as possible features to accomplish the SLAM task, the thresholds should be set to a fairly low level.It is because the smaller the thresholds, features on spheres or cylinders are more likely to be extracted.These sphere and cylinder features themselves are not planar, and if represented by planes directly can bring about errors that would further reduce the accuracy of the downstream algorithms.In this paper, we however blend each planar patch extracted with the weighted function under the implicit surface representation mechanism.This explains why we achieve obvious improvement on this sequence.Therefore, if the thresholds can perform well on such structural feature lacking scenarios, they should be easily transferred to and work in scenarios with richer structural features.For more structured environment such as urban case, the thresholds can be further increased to compromise between the feature richness and the planarity, which promises an accuracy improvement to the whole SLAM algorithm.Table IV shows the influence of c th to the result.When c th is set too large, we can not extract enough features and the result gets worse.Figure 5 shows the variation of the number of non-ground features in different frames on sequence 01 when c th is set to 0.85.The average number of non-ground features is about 7200.The quantity c as shown in (3) of ground features are generally large and we do not discuss it here.
Parameters n e , n p and n gp are set to 3 for the KITTI odometry benchmark.Table III shows the influence of parameter n p .We compare our POU model to the plane model which is fitted as in LOAM.The result of our model is better than plane model and the result reaches the best when n p is set to 3. When n p is set too large, the correspondences are far away from the current scan features and do not contribute to local surface reconstruction.The parameter n is set to 40 for the KITTI odometry benchmark.Table II shows the the influence of parameter n.

C. Discussion Of The Processing Time
The operating platform in our experiment is Intel i7-7820@3.60GHz with 16 GB RAM.Our method is evaluated by processing KTTI datasets rather than being deployed in an autonomous vehicle.With regarding to real time implementation, which heavily depends on the hardware configuration, we cannot assert if our method is real time or not.However, we can still provide some of the parameters so people can know the runtime performance of our method.First, we split the point cloud into voxels which contain certain numbers of points and calculate the linear-ness values and surface-ness value for these voxels.It takes 0.5s.Second, we do scan matching for each feature.The time cost is depending on the number of features.For KITTI Odometry sequence 01, the runtime is about 1.5s.Taking both the above factors into consideration, our SLAM runs at 2s per scan with one thread.For comparison, LOAM runs at 1s per scan on the same KITTI dataset.The reason is that compared to LOAM which extract two-dimensional curvature features, we construct voxel grids and extract features according to the spatial distribution of the points contained in each voxel.This results in better accuracy (refer to Table I) with a slight sacrifice of efficiency.We present a new 3D LiDAR SLAM method that is composed of a new feature voxel map and a new scan-to-model matching framework.We build a novel feature voxel map with voxels including salient shape characteristics.In order to adapt to the proposed map, we implement scan-to-model matching using POU implicit surface representation to blend the correspondence voxels in map together.As experimental results illustrate, the proposed method yields accurate results that are on par with the state-of-the-art.Future work will proceed in two directions.From the research perspective, a specific and efficient octree will be designed to get 3D grid.Meanwhile, we will deploy the method to fulfil real time application with the aid of multiple threads or GPU to accelerate data processing.

Fig. 1 .
Fig. 1.Illustration of the surface feature and linear feature.

Fig. 2 .
Fig. 2. (a) is the result of ground segmentation.In (a), the red points are labeled as ground points and the green points are labeled as non-ground points.(b)(c)(d) are the ground planar features, non-ground edge features and non-ground planar features.

Fig. 3 .
Fig. 3. Illustration of POU implicit surface.The Figure shows the POU implicit surface representation.Two cells, A and B, are associated with their support radius R A and R B , respectively.The value of a point x in the slashed region can be evaluated by I M k (x) =

Fig. 4 .
Fig. 4. Sample results using the KITTI odometry benchmark.The datasets are chosen from three types of environments: highway, country and urban from left to right, corresponding to sequences 01, 03, and 07.In (a)-(c), we compare estimated trajectories of the vehicle to the GPS/INS ground truth.The mapping results are shown in (d)-(f).An image is shown from each dataset to illustrate the environment, in (g)-(i).

Fig. 5 .
Fig. 5. Variation Of the Number of Non-ground features In Different Frames On Sequence 01

TABLE IV THE
INFLUENCE OF c th TO THE RESULT Parameter c th Drift on KITTI training dataset