STV-SC: Segmentation and Temporal Verification Enhanced Scan Context for Place Recognition in Unstructured Environment

Place recognition is an essential part of simultaneous localization and mapping (SLAM). LiDAR-based place recognition relies almost exclusively on geometric information. However, geometric information may become unreliable when faced with environments dominated by unstructured objects. In this paper, we explore the role of segmentation for extracting key structured information. We propose STV-SC, a novel segmentation and temporal verification enhanced place recognition method for unstructured environments. It contains a range image-based 3D point segmentation algorithm and a three-stage process to detect a loop. The three-stage method consists of a two-stage candidate loop search process and a one-stage segmentation and temporal verification (STV) process. Our STV process utilizes the time-continuous feature of SLAM to determine whether there is an occasional mismatch. We quantitatively demonstrate that the STV process can trigger false detections caused by unstructured objects and effectively extract structured objects to avoid outliers. Comparison with state-of-art algorithms on public datasets shows that STV-SC can run online and achieve improved performance in unstructured environments (Under the same precision, the recall rate is 1.4∼16% higher than Scan context). Therefore, our algorithm can effectively avoid the mismatching caused by the original algorithm in unstructured environment and improve the environmental adaptability of mobile agents.


Introduction
As the first step towards the realization of autonomous intelligent systems, simultaneous localization and mapping (SLAM) has attracted much interest and made astonishing progress over the past 30 years [1]. Place recognition or loop closure detection gives SLAM the ability to identify previously observed places, which is critical for back-end pose graph optimization to eliminate accumulated errors and construct globally consistent maps [2,3]. Benefiting from the popularity of cameras and the development of computer vision, visionbased place recognition has been widely studied. However, cameras inevitably struggle to cope with illumination variance, poor light conditions, and view-point change [4]. Compared with camera, LiDAR is robust to such perceptual variance and provides stable loop closures. Thus, LiDAR-based recognition has drawn more attention recently. LiDAR-based place recognition is achieved by encoding descriptors directly from geometric information or segmented objects. Then, similarity is assessed by the distance between descriptors, such as multi-view 2D projection (M2DP) [5], bag of words (BOW) [6], scan context (SC) [7], pointnetvlad [8], and overlapTransformer [9]. Descriptors are extracted from local or global geometric information (3D point clouds). Segmatch [10], semantic graph based place recog-nition [11], semantic scan context (SSC) [12], and RINet [13] leverage the segmented objects to define descriptors.
In this paper, we define relatively large regular objects as structured objects (buildings, ground, trunks, etc.) and others as unstructured objects (vegetation, small moving objects, noise points, etc.). In fact, vegetation is most likely to appear on a large scale and obscure structured information. Thus, we mainly consider unstructured scenes dominated by vegetation. One key issue faced by the above methods is that outliers will occur when two places show similar features due to large scale vegetation. As shown in Figure 1, large scale tree leaves will significantly increase the similarity of different places and reduce the influence of other critical objects in the scene, resulting in similar descriptors between different places. This type of unstructured environment often causes perception aliasing and limits recall rate. Finally, the SLAM system is severely distorted, and the mobile agent cannot perceive the environment correctly. Therefore, designing a place recognition algorithm that is robust in unstructured-dominated environments is of great importance for enhancing the environmental adaptability of autonomous intelligent systems (such as self-driving vehicles and mobile robots) and promoting the development of autonomous driving, field survey, etc. The vegetation on the right side makes them difficult to distinguish. Since the ground truth distance between these two frames is 148.64 m, they should not be considered as loop closure. Middle figures: colormap corresponding to scan context before segmentation. Bottom figures: segment scan context of corresponding frame represented by colormap. The left side of colormap indicates the preserved buildings, and the empty right side indicates that the vegetation has been removed. After segmentation, these two frames become distinguishable. If we directly use Scan context, the distance between them is 0.1488, resulting in false positive. Our segment scan context acquires a distance up to 0.327, thus, avoiding outliers.
In [14], segmentation is first proposed to deal with certain conditions, such as forest, and demonstrates potential for removing non-critical information. Inspired by this, here, we intend to enhance scan context with segmentation to make it suitable for unstructured environments. At the same time, considering the time continuity of SLAM and the occasionality of outliers, we use a piecewise thought. Specifically, temporal verification is exploited to candidate loop to decide whether to trigger re-identification module. Thus, reducing the time consumption of the whole system.
In this paper, we present segmentation and temporal verification enhanced scan context (STV-SC). We first design a range image-based segmentation method. Next, we explain why segmented point clouds can differentiate between structured and unstruc-tured objects. Then a three-stage search process is proposed for effective false positives avoidance. The STV process checks temporal consistency to determine whether triggering re-identification module. If triggered, we will segment point clouds and remove unstructured objects of the matching frames. Finally, outliers will be filtered out by the similarity score recomputed by segmented descriptors.
The main contributions of this paper are as follows: • We propose a range image-based 3D point cloud segmentation method introducing both geometry and intensity constraints for unstructured objects removal. • An efficient three-stage loop detection algorithm for fast loop candidate search is proposed while leveraging the STV process for perception aliasing rejection.

•
Thorough experiments on KITTI dataset [15] show that our method outperforms scan context and other state-of-the-art approaches. The algorithm is also integrated to a SLAM system to verify online place recognition ability.
This paper is structured as follows. Section 2 reviews the related literature of place recognition in both vision and LiDAR manners. Section 3 introduces the 3D point cloud segmentation algorithm proposed, followed by segment scan context and three-stage search algorithm. Then, the experimental test and its discussion are described in Section 4. Finally, a conclusion is made in Section 5.

Related Works
Depending on the sensing devices used, place recognition can be grouped into visionbased and LiDAR-based methods. Visual place recognition has been well researched and made significant advancement in the past. Generally, visual approaches represent scene features by extracting multiple descriptors, such as Oriented Fast and Rotated BRIEF (ORB) [16] and Scale-Invariant Feature Transforms (SIFT) [17], to construct a dictionary and then leverage bag of words (BOW) [6] model to measure distance between words that belong to different frames. Recently, a learning-based approach has been used for loop detection [18,19]. NetVlad [18] designed a new generalized VLAD layer and implemented it into CNN to achieve end-to-end place recognition. DOOR-SLAM [20] has verified this method in real world SLAM system. However, image representation usually leads to performance degradation when encountering scenes with light illumination and viewpoint change. To overcome such issues, researchers intended to develop robust visual place recognition methods [21][22][23] to fit change light and season. In spite of this, these methods can only handle certain scenes.
Unlike a camera, LiDAR is robust to environmental changes stated before, while being rotation-invariant. Now, LiDAR-based recognition is still an advanced and challenging problem for laser SLAM systems. LiDAR methods can be further categorized into local descriptors, global descriptors, and learning-based descriptors. Fast point feature histogram (FPFH) [24], keypoint voting [25], and Combination of Bag of Words and Point Feature [6] are state-of-art approaches based on local hand-crafted descriptors. FPFH [24] is coded by calculating key points and their neighbors' underlying surface properties, such as normal and curvature. Through reordering dataset and caching previously computed values, FPFP can reduce run time and apply to real-time systems. Wang et al. [25] proposed a new 3D regional descriptors based on gestalt features and then certain number of neighbors will be voted by key points to do a similarity score. Bastian et al. [6] used Normal-Aligned Radial Features to build a dictionary for bag of words model and realized robust key points and scene matching.
However, local descriptors rely on the acquisition of key points and the calculation of geometric features around key points, which usually lose a lot of information and lead to false matching. Especially for unstructured outdoor objects (e.g., trees), key points from such objects are unreliable.
In contrast, global descriptors are independent of key points and leverage the global point clouds. Multi-view 2D projection (M2DP) [5] is a novel global descriptor from multi-view 2D mapping of 3D point cloud. This descriptor is designed by the left and right singular vectors of each mapping's density signature. Giseop et al. [7] divided the 3D space into 2D bins and coded each bin by the maximum height of points in this bin. Then, the global descriptor is represented as a two-dimensional matrix called Scan context. The matching of frames is performed by calculating the cosine distance between scan context in column-wise way. Scan context outperforms existing global descriptors and shows remarkable rotation invariance, which allows it to handle reverse loops. Based on scan context, ref. [26] explored the value of intensity. By integrating both geometry and intensity information, they developed intensity scan context and proved that intensity can reflect information of different objects. Meanwhile, they proposed a binary search process, which reduces the computation time significantly.
In recent years, learning based methods have been proposed gradually. Segmatch [10] first segments different objects from original point clouds and then extracts multiple features from each object, such as eigenvalue and shape histograms. Finally, they utilized a learning-based classifier to matching objects of different scenes. Kong et al. [11] leveraged semantic segmentation to build a connected graph by the center of different objects and used CNN network to match scenes by judging the similarity of graphs. Refs. [12,27] proposed semantic scan context, which encodes each bin by semantic information. However, learningbased method is usually computationally expensive for the training process and cannot adapt to various outdoor environments due to the limitation of training data.
Global descriptors show excellent performance, but still cannot handle ambiguous environment caused by unstructured objects and generate outliers. In this paper, inspired by [14], we utilize segmentation to remove unstructured objects of scenes, but remain global information and key structured objects. Then we apply segmented point clouds to scan context and construct segment scan context, which makes different places more distinguishable and effectively prevents perceptual aliasing.

System Overview
An overview of the proposed framework is demonstrated in Figure 2. First, the system acquires original 3D point clouds from LiDAR and codes it into scan context. Then, subdescriptor is designed and put into KD-Tree, which is an indexed tree data structure used for nearest neighbor search in large-scale high-dimensional data spaces. A fast k-Nearest Neighbor (kNN) search is then implemented to find nearest candidates from KD-Tree. Then, by calculating minimum distance between query scan context and candidate scan contexts, we can tell whether there is a candidate loop closure. If it exists, our STV process is conducted. The temporal verification will determine whether to trigger re-identification procedure. Finally, once the temporal verification is met, we consider it to be a true loop. Otherwise, we will segment the original point cloud and then use the segmented scan context to calculate new distance. The re-identification procedure utilizes this distance to judge whether a loop is found. The detailed description of these modules is given below.

Segmentation
The segmentation module includes two submodules, ground removal and object segmentation. Scan context encodes each bin by taking the maximum height, hence ground points are usually useless and will lead to the increased similarity of different scenes in flat areas. On the other hand, the presence of numerous unstructured objects, such as trees, grass, and other vegetation, will cover the structured information, generating similar descriptors between different places. Meanwhile, it is evident that noises generally do not persist in a certain position over time. Thus, they generally appear scattered and form small-scale objects. Here, we use object segmentation to remove unstructured information in the environment and retain key structured information to prevent mismatches. We denote each frame of point cloud from the LiDAR as P = {p 1 , p 2 , . . . , p n }. For fast cluster-based segmentation, the 3D point cloud is projected into a M r × M c 2D range image R for point cloud ordering, where Res h is the horizontal resolution and N scans is the line number of LiDAR. Each value of the range image is represented by the Euclidean distance from the sensor to the corresponding point cloud in 3D space. Then, we use a column-wise approach for ground point evaluation on the range image like [28], while leveraging intensity for validation. After removing the ground, we perform a range image-based object segmentation to classify point clouds into distinct clusters, which is based on [29] but with some improvements according to the characteristics of LiDAR. Specifically, we integrate geometry and intensity constraints for clustering. Previous study [30] showed that different objects exhibit different reflected intensity. Since intensity can be obtained directly from LiDAR, it can serve as an additional layer of validation for clustering. We can judge whether two points p a and p b belong to object O k by the following mathematical expression. Meanwhile, we set (a 1 , a 2 ) and (b 1 , b 2 ) as their coordinates in the range image, respectively: In (2), as shown in Figure 3, d stands for the range value from LiDAR to 3D point cloud. θ is the angle between the line spawned by p a , p b and the longer one of OA and OB. g and i are predefined thresholds. Additionally, ψ(p) denotes the intensity of point p and κ is an intensity calibration function using distance, which can be obtained by practice.
Noticed that as the first-layer judgment, geometry constraint plays a major role. As the second-layer of validation, intensity prevents objects of different types from being clustered together, i.e., under-segmentation. Moreover, due to the fixed angle between laser beams, points distributed near the sensor are relatively dense, while points far away are sparse. If a fixed geometric threshold is used, we cannot balance the distant and near points. Specifically, if a large threshold is used, the distant points will be over-segmented, and if a small threshold is used, the nearby points will be under-segmented. Thus, in the near area, using a large g can prevent different objects from being grouped together, while using a small g in the far area can avoid the same object being segmented into multiple objects.
To achieve more accurate segmentation at different distances, we design a dynamic adjustment strategy. Threshold will be dynamically adjusted as where p denotes step size and q is the decay factor. i g stands for the initial value of g . Finally, a breadth-first search based on constraints in (2) is conducted on range image for object clustering. The idea of our segmentation comes from the fact that unstructured objects (mainly vegetation) are filled with gaps, such as leaves. When the laser beams pass through the gaps, the range difference will become large, which will cause large scale vegetation to be separated into small clusters. In the meantime, noise is also a small object. Therefore, we can distinguish structured and unstructured objects by the size of the clusters. In this paper, we treat clusters with more than 30 points or occupying over 5 laser beams as structured objects. As shown in Figure 4, noises, ground, and vegetation are removed, while structured parts, such as buildings and parking cars, are preserved.

Segment Scan Context
Scan context [7] encodes the scene with the maximum height and then represents it by a 2D image. Figure 5a is the top view of original point clouds. Taking the LiDAR as the center, N r rings are equidistantly divided in the radial direction. In the azimuth direction, N s sectors are divided by equal angles. The area where rings and sectors intersect are called bins. For each bin, a unique representation of the maximum height of point clouds within it is used. Therefore, we can project the 3D point clouds into a 2D matrix of N r × N s , called scan context. Let L max represents the maximum sensing range of the LiDAR, then the gaps of rings and sectors are L max N r and 2π N s , respectively. By adjusting them, we can set the resolution of scan context. However, since scan context uses the maximum height as the unique encoding, it usually results in perceptual aliasing when facing large scale unstructured objects. Like trees on both sides of road, they usually have the same height. Therefore, when encountering scenes dominated by unstructured objects, we merely maintain key structured information obtained via point cloud segmentation. Denote point clouds of a segmented LiDAR scan as P seg , segment scan context D is expressed by P seg ij are points in a bin with ring index i and sector index j and φ denotes the function to obtain the maximum height of all point clouds in this bin. Particularly, if there is no point in the bin, its value is set to zero. Visualization of our segment scan context is in Figure 5b. After segmentation, descriptors exhibit discrete blocks representing different structured objects.

Three-Stage Search Algorithm
After projecting original point clouds into scan context, the matching process is dedicated to calculating the minimum distance between the descriptor D t obtained at time t and the D = {D 1 , D 2 , . . . , D t−1 } stored previously. Then, the distance determines whether there is a loop closure. In order to achieve fast search and effectively prevent mismatches, we design a three-stage search and verification algorithm.
Stage 1: Fast k-Nearest Neighbor search. Obviously, searching in the database directly using scan context will generate numerous decimal operations, which will slow down the search speed. Here, we perform fast search by extracting sub-descriptors. First, scan context is binarized as follows. Let B denotes the matrix after binarization: Then, for each row r of B, we count the number of non-empty bins by calculating L 0 norm: ν(r i ) = r i 0 .
Finally, we construct a one-dimensional sub-descriptor H = (ν(r 1 ), ν(r 2 ), . . . , ν(r n )) that fulfills rotation invariance. By putting H into KD-Tree, we can achieve fast kNN search and provide k candidates for the next stage.
Stage 2: Similarity score with column shift. This step will directly use the corresponding scan context to find the nearest frame from the candidates obtained in stage 1. Let D q denotes the scan context of query scan. D c denotes one candidate scan context. A columnwise accumulation of cosine distances is used to measure the distance between D q and D c . The distance is: where c q i and c c i are the i-th column of D q and D c , respectively. In practice, mobile agents may revisit one place from different view-points. To achieve rotation invariance, we conduct a column shift process as where D c j means shift D c by j columns and ϕ min represents the final smallest value. If ϕ min is lower than the predefined threshold l , then we obtain a candidate D c for next stage.
Stage 3: Temporal verification and re-identification (STV process). To effectively prevent the generation of false positives, we design a temporal verification module for this candidate loop. Since the detection process of SLAM is continuous in time, the nodes near a true loop also have high similarity. Furthermore, true loops usually exist continuously, while outliers are sporadic. Therefore, we adopt a piecewise idea to verify candidate loop pair: where N t means the quantity of frames involved for temporal verification. If T less than a threshold t , we treat it as a true loop. Otherwise, we regard this frame as ambiguous environment and the re-identification module with our segment scan context will be triggered. Specifically, we segment original point clouds and calculate distance between segment scan context of candidate loop pair. Since we have obtained the shift value in the previous stage, we can directly use the result in Equation (8) to calculate the new distance: where j * represents the shift value when ϕ min (D q , D c ) reaches. Finally, if ϕ seg still less than a threshold s , we group it into inliers; otherwise, we discard it. Algorithm 1 depicts our search process in detail, where num_di f f represents the minimum interval between two frames that can become a loop closure. min_dis means minimum distance.

Algorithm 1 Tree-stage search process
Require: Original point cloud P of current frame at time t. Require: Scan context D q of current frame at time t. Require: Sub-descriptors of the previous frames in KD-Tree. Require: Previous scan contexts D stored before t 1: k ← 50, q ← index of current frame.

Experimental Results and Discussion
In this section, we conduct a series of experiments to verify the effectiveness of our STV process for unstructured scenes. Moreover, the discussion regarding each experiment is also presented. The performance of our algorithm is compared with other state-of-art global descriptors. All experiments are performed on a computer equipped with an Intel Core (TM) i5-10210U CPU. To compare with Scan context [7] and test online capability, our algorithm is implemented both in MATLAB and C++.

Experimental Setup
We select four sequences (00, 05, 06, and 08) from the KITTI dataset [15], all of which contain a large number of typical scenes dominated by unstructured objects (mainly vegetation). As shown in Figure 6, these outdoor scenes provide sufficient experimental resources for our algorithm. In order to show higher accuracy and exhibit the application value of the algorithm, our parameter settings are similar to scan context-50 [7]. This means that in the first stage we will select 50 nearest neighbors, while ensuring real-time performance. If the ground truth Euclidean distance of matched pair is less than 4m, we consider it to be an inlier. Since l and t have the same physical meaning, we make them equal in the experiment. Other parameter values used are listed in Table 1.

Statistical Analysis
To illustrate that our STV process can increase the distinguishability in scenes with large scale unstructured objects and effectively avoid the occasional mismatches brought by such scenes. We perform a statistical analysis.
The 4000∼4400th frames of KITTI sequence 00 contain a lot of places dominated by vegetation. Many of these frames are highly susceptible to mismatches, which are discovered through our temporal verification module.
We first carry out analysis on the structured and unstructured objects of the selected 400 frames to demonstrate that the segmentation module described in Section 3.2 can indeed separate unstructured objects from structured objects. Figure 7 presents our statistical results. We can find that the clustering number of structured objects after segmentation is much less than that of an unstructured one. The mean values in Figure 7a,b demonstrate a difference of about 30 times. We represent the size of a cluster by the number of points included. Figure 7c,d show that the former tend to be larger clusters, while the latter are small in size due to gaps in vegetation or noises. Generally, structured clusters are more than 10 times larger than unstructured clusters. Therefore, we naturally think of using the size of the cluster to remove vegetation, etc. In subsequent experiments, we will retain clusters with more than 30 points or occupying over 5 laser beams as structured objects. Second, we compare the similarity scores of these 400 pairs of false positives before and after segmentation. As shown in Figure 8, the scores between different places are significantly increased after removing the unstructured objects, as vegetation always has a high degree of similarity. It means improved distinguishability between false loop closures. This allows our algorithm to directly discard mismatches when encountering such places.

Dynamic Threshold Evaluation
In our segmentation algorithm, as the first step judgment, the geometric threshold plays a more critical role in accurate segmentation. According to the characteristics of laser beams, we design a dynamic adjustment strategy of g , as shown in Equation (3), which can prevent under-segmentation of near objects and over-segmentation of far objects compared with the fixed geometric threshold.
Here, we use the control variable method to test the influence of the dynamic threshold on the experimental results, so as to provide a parameter reference for next experiment. Specifically, we compare the precision and recall rates of fixed and dynamic thresholds with different initial values of g . Experiments are performed on KITTI sequences 00 and 08, which can provide more convincing references due to their large number of complex and typical unstructured scenes. From the results in Table 2, we can see that under the same initial value, the dynamic threshold tends to achieve higher recall and precision rates than the fixed one. Moreover, we can conclude that the initial value of g is best set between 50 and 60. Therefore, in the following experiments, we set parameters of dynamic threshold as i g = 60, p = 10 and q = 1.

Precision Recall Evaluation
We leverage precision-recall curves to comprehensively evaluate the performance of our STV-SC method in environments where large scale unstructured objects exist. The performance of our place recognition algorithm is compared with Scan context [7] and M2DP [5], since both are state-of-art global descriptors and neither specifically considers unstructured scenes. In particular, our algorithm is enhanced from Scan context, so the performance comparison with Scan context in unstructured environments is quite important.
As shown in Figure 9, the experiments are conducted on sequences 00, 05, 06, and 08. Since sequence 08 only has reverse loop, it can verify that our algorithm maintains the rotation invariance of Scan context.
Our proposed algorithm outperforms other approaches in all sequences. This is because in the suburban where the roads are surrounded by trees, the geometric information for place recognition is limited. For example, the frames we mentioned in Section 4.2, Scan context will cause mismatches due to the existence of vegetation. However, our method can mitigate the impact of vegetation and avoid many mismatches caused by unstructured objects. That is, under the same recall rate, STV-SC can obtain higher precision rate. As for sequence 08, M2DP performs poorly due to its inability to achieve rotation invariance. However, our algorithm achieves improved performance while maintaining rotation invariance. The residual outliers come from jungles with few or no structured objects or scenes where the structured parts are still very similar so that the geometric information can no longer meet the requirements of place recognition.
In the application, we pay more attention to the recall rate under high precision. Table 3 shows the recall of sequences 00, 05, and 06 at 100% precision. Since sequence 08 is more challenging, we take the recall rate when the precision is 90%. It is obvious that our method outperforms other approaches which do not consider unstructured objects. Compared with the original Scan context, the recall rate of our STV-SC algorithm on different sequences is increased by 1.4% to 16%. In particular, in sequence 08, an environment with a lot of vegetation. Other algorithms often have poor performance, while our algorithm improves the recall rate by more than 15%.

Time-Consumption Analysis
Compared to Scan context, our method adds segmentation and temporal verification (STV) process. Since the re-identification module does not require search and shift actions, the main time consumption of STV is concentrated in the segmentation module. In the meantime, subject to temporal verification, re-identification process is not always triggered, but only used when encountering ambiguous environment. As the main timeconsuming module, segmentation uses a range image-based breadth-first search, whose time consumption is fairly small.
Under the same conditions as Scan context-50 [7], we record the place recognition time consumption (cost time of STV-SC) for more than 100 triggered frames in Figure 10. Even at the peak, the time consumption is less than 0.4 s. The average time consumption of these 120 frames is 0.316 s (the original scan context is 0.201 s under 0.2 m 3 point cloud downsampling), which is within a reasonable range (2-5 HZ on Matlab).

Online Loop-Closure Performance
Now, we show the online performance of our STV-SC algorithm. Our algorithm is integrated into the well-known LiDAR odometry framework LOAM [31]. Specifically, our method is used as the loop closure detection module of LOAM, then the detected loop is added to the pose graph as an edge. GTSAM [32] is applied for back-end graph optimization. Finally, a drift-free trajectory is obtained. The experiments run on Robot Operating System (ROS Melodic) and perform on KITTI 00.
The white dots in Figure 11 represent examples of detected loop closures. As shown in the estimated trajectory, our method can effectively detect loop closures and eliminate drift errors in real time, even in unstructured-dominated environments.

Conclusions
In this paper, we have proposed STV-SC, a new Scan context-based place recognition method that integrates segmentation and temporal verification process, which gives the original algorithm the ability to handle unstructured environments and enhances the stability of mobile agents in special and complex environments. By summarizing the characteristics of unstructured objects, we design a novel segmentation method to distinguish unstructured and structured objects according to the size of clusters. In addition, for more accurate segmentation we adopt a geometric threshold that varies with range value. In the matching part, we design a three-stage algorithm. Based on the temporal continuity of SLAM system, if temporal verification is not satisfied, the re-identification module will be triggered. Thus, effectively avoiding mismatches caused by unstructured objects. Comprehensive experiments on the KITTI dataset demonstrate that our segmentation method can effectively distinguish different types of objects. STV-SC achieves higher recall and precision rates than Scan context and other state-of-art global descriptors in vegetation-dominated environments. Specifically, it is considered that under the same precision, the recall rate can be improved by 1.4∼16% by our algorithm in different datasets. Meanwhile, the average time consumption of STV-SC is 0.316 s which is within a reasonable bound and proves that the our algorithm can be run in the SLAM system online.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

SLAM
Simultaneous Localization and Mapping STV Segmentation and Temporal Verification SC Scan Context LOAM Lidar Odometry and Mapping