ClusterMap Building and Relocalization in Urban Environments for Unmanned Vehicles

Map building and map-based relocalization techniques are important for unmanned vehicles operating in urban environments. The existing approaches require expensive high-density laser range finders and suffer from relocalization problems in long-term applications. This study proposes a novel map format called the ClusterMap, on the basis of which an approach to achieving relocalization is developed. The ClusterMap is generated by segmenting the perceived point clouds into different point clusters and filtering out clusters belonging to dynamic objects. A location descriptor associated with each cluster is designed for differentiation. The relocalization in the global map is achieved by matching cluster descriptors between local and global maps. The solution does not require high-density point clouds and high-precision segmentation algorithms. In addition, it prevents the effects of environmental changes on illumination intensity, object appearance, and observation direction. A consistent ClusterMap without any scale problem is built by utilizing a 3D visual–LIDAR simultaneous localization and mapping solution by fusing LIDAR and visual information. Experiments on the KITTI dataset and our mobile vehicle illustrates the effectiveness of the proposed approach.


Introduction
Simultaneous localization and mapping (SLAM) supplies pose and map information for autonomous driving in unknown environments [1][2][3][4]. Precise localization ensures that vehicles run along pregenerated trajectories, and a good map provides a priori information about the surrounding environment that supports vehicles' decision systems in predetermining driving commands. When an unmanned vehicle re-enters a familiar scene or loses its own location information, the vehicle should quickly regain its correct position in its operating environment to ensure safe driving. Position loss may be caused by a number of factors, such as temporary failure of sensors, rapid vehicle movement, and environmental changes. Practical unmanned vehicles should be able to relocalize independently by using existing maps and current environmental information.
In complex urban environments, 3D maps that reflect road conditions, obstacle locations, and other environmental information should be built to ensure safe driving. Existing autonomous driving solutions rely on high-precision and dense maps to realize vehicle navigation and even localization.

Related Work
Many SLAM approaches are available in the literature. Various sensors, such as the most common cameras and lasers, have also been used in map building and localization [7]. However, due to the complexity of outdoor scenes, a single type of sensor cannot be effective in all cases [8]. Mur-Artal et al. proposed an interesting framework called ORB-SLAM [9] using three threads to achieve tracking, local mapping, and loop closing. Other monocular SLAM approaches [10,11] still suffer from initialization and inconsistent scale issues. Although ORB-SLAM utilizes binocular and RGB-D cameras [12], it is limited by insufficient detection distances and high susceptibility to illumination changes. For 2D laser SLAM approaches, filter-based [13] and graph-based approaches [14] perform well in indoor structured scenes. However, in unstructured outdoor environments, the failure rates of 2D laser SLAM approaches increase because they are only able to use 2D laser range finders to perceive information in open and littered outdoor scenes.
Meanwhile, 3D SLAM approaches are more suitable for complex urban environments than 2D SLAM approaches. The 3D approaches in the literature [15][16][17] extract features such as lines, corners, and surfaces from point clouds to accelerate motion estimation. Some algorithms [18] are extended variants of 2D SLAM methods that use point cloud segments and leveled range scans to achieve 3D perceptions. Wang et al. [19] utilized 3D-LIDAR to precisely locate the autonomous vehicle, where the curbs information was detected to assist the pose estimation. Several experimental results were provided to demonstrate the accuracy and robustness of the method. The approaches proposed by Zhang et al. [20] and Zhang and Singh [2] enhance visual features by associating the depth information from LIDAR-based point clouds and obtain low-drift motion estimation results. Point clouds are downsized to maintain a constant point density. The downsized points are stored by using a 2D k-d tree. The depth values of visual features are obtained by finding and interpolating the three closest points of these visual features from the k-d tree. The fusion of visual and LIDAR information makes the motion estimation increasingly accurate because the method can use a series of 3D-3D or 3D-2D relations to recover the transformation matrix between two image frames.
Utilizing 3D point clouds for loop closure or relocalization is a challenging problem that has attracted increasing attention in the autonomous driving field. Lenac et al. [21] proposed a loop detection method that uses planar surface segments in point clouds as features in maps. The proposed method can achieve accurate and efficient SLAM in structured scenarios but not in unstructured environments. Visual features contain rich information and are distinguishable. As such, some approaches use visual information to assist loop detection. For example, Zhu et al. [22] and Chen et al. [23] used visual appearances to aid the loop closures in 3D SLAM. However, they used visual and LIDAR information separately without exploring the complementarity between sensors. Meanwhile, visual features with rich information also suffer from many limitations; for example, when the illumination or viewpoint changes are excessive, even robust features such as SIFT or SURF [24] can fail during matching. Other existing approaches [25,26] extract keypoints directly from 3D perceptions and construct relevant 3D Gestalt descriptors to describe each keypoint. Then, vote matrix voting from the nearest neighbors is used to find loops. SegMatch [27,28] provides a real-time algorithm for loop detection and localization on the basis of 3D point cloud segments. This method clusters and segments all the received point clouds and then calculates a corresponding descriptor for each segment. The k-d tree and machine learning approach are used for descriptor matching. Finally, a six-degree-of-freedom transformation matrix is obtained by geometric verification. However, the aforementioned approaches [25][26][27] require high-density point clouds for describing keypoints or segments. In addition, the feature descriptors are susceptible to environmental changes in long-term applications, in which shrubs or trees may change greatly, thereby causing a large change in the feature descriptors of the same object at different times. Furthermore, objects with high similarities in appearance, such as street lamps, vehicles, or other synthetic facilities, challenge the feature descriptor-based solutions.
Finman et al. presented an interesting work on object-based place recognition [29]. Their approach involves the use of a pregenerated primitive convolution kernel to convolute an entire point cloud and extract objects. It also constructs an object graph that represents the locational relationship among different objects. Through the matching of object graphs, places can be recognized despite appearance changes in a single object. However, the method can only handle small-scale scenarios, because the time to extract objects and the complexity of object graphs increase significantly for large-scale environments. Bogoslavskyi et al. [6] proposed a real-time object extraction solution with small computational requirements. The solution removes the ground from 3D scans and then clusters the point clouds into different clusters. However, the process is merely a pre-segmentation. Thus, the method cannot distinguish different clusters. Some descriptors, e.g., point feature histograms (PFHs) [30] and fast point feature histograms (FPFHs) [31], have been developed by encoding the neighborhood geometrical properties of points and using the average curvature of multidimensional histograms around points. PFHs and FPFHs can provide informative signatures for the feature representation of 3D points. They are also invariant to 6D movement and cope well even with large noises. However, PFHs and FPFHs are only suitable for high-precision 3D reconstruction, and are thus not valid for practical outdoor applications because high-density point clouds in outdoor environments are difficult to provide.
In view of the above problems, the present study aims to develop a novel type of map, i.e., the ClusterMap, which is more compact than the OctoMap and PointCloudMap. The proposed map contains point cluster information only, and thus it does not require a large storage space. This characteristic favors storage and transfer processes in many cloud-based applications. A cluster descriptor is also developed to distinguish different clusters in maps. It is used for map matching to realize relocalization. Relative to existing approaches, the ClusterMap-based relocalization method performs well even with low-density laser range finders, e.g., VLP-16. In addition, the algorithm is strongly robust to changes in environmental objects' appearance, illumination, and observation direction. The ClusterMap-based relocalization method can be also used for loop closure, which we will tackle in our future work.

ClusterMap Building
In this section, the process of building the ClusterMap and the details of the descriptor for each cluster in the map are presented. The ClusterMap and the cluster descriptor are used to achieve the relocalization of unmanned vehicles in urban environments.

SLAM for ClusterMap Building
Generally, the basic requirement for map building is precise localization information. To enhance the consistency of the ClusterMap with the real world, we need to obtain accurate location information in urban environments. The SLAM can be treated as a black box that provides consistent pose estimation, and this estimation is very important to build consistent ClusterMap. Because SLAM research is not the focus of the paper, we implement SLAM by directly integrate the visual-LIDAR odometry [20] into the framework of ORB-SLAM [9,12]. The visual-LIDAR odometry fuses the sensor data received from a monocular camera and a VLP-16 LIDAR, to extract depth information for 2D visual features by using the advantages of different sensors. The ORB-SLAM implements loop closures and uses global nonlinear optimization algorithms [32] to adjust odometry and map features synchronously. The loop closures eliminate the accumulated errors generated during motion estimation. The ORB-SLAM integrated with visual-LIDAR odometry provides good pose references for constructing the ClusterMap. By considering the self-consistency of the work, the SLAM framework is introduced briefly. For the details, please refer to the references herein [12,20]. Figure 1 shows the pipeline of the mapping framework. The 3D SLAM is indicated by a dashed box in Figure 1. It includes the threads of TRACKING, LOCAL MAPPING, and LOOP CLOSING [12], which are further expanded with two blocks, i.e., PointCloud Registration and Depth Association [20]. The localization results from SLAM are continuously utilized to register point clouds and generate maps such as PointCloudMap, OctoMap, or ClusterMap.
The transformation between image frames is calculated in the TRACKING thread, as shown in Figure 1, according to the 2D-3D relations between 2D visual features and 3D local map points. With the localization result in the TRACKING thread, several recent frames of LIDAR data are accumulated in the point cloud registration to increase the density of the perceived point clouds. The process enables the usage of low-density LIDAR, e.g., VLP-16, to achieve a performance similar to that of HDL-64 or other dense LIDARs, where the point cloud is reprojected to the synchronized image by using the extrinsic calibration result between a camera and a LIDAR [33]. The point cloud is stored by using a 2D k-d tree constructed in accordance with the coordinates of each point. Meanwhile, to retain as much structural information as possible, the k-d tree is searched for all 3D points within a range from a visual feature f i in the camera frame. Let { i P} = { i P 1 , i P 2 , ..., i P n } denote the set of neighboring 3D points, where i P j = [x j , y j , z j ] T represents the 3D coordinates of the jth point. Let T denote the coordinates of the depth-enhanced visual feature, where z i is the unknown depth parameter. Then, the three points with the smallest distance are selected from { i P}. The three points form a local planar patch in the 3D space. The visual feature f i is treated as a point on the patch. The depth parameter z i is obtained by solving the following equation, where i P n 1 , i P n 2 , and i P n 3 are the three selected points with known 3D information. When detected visual features are initialized by calculating the depth parameter, they are directly registered in the map with low-scale deviation. The initialization of monocular odometry is therefore simplified to benefit the building of a consistent map. Using the above method, a significant portion of the features (~35-75% in our experiments) in a keyframe can be associated with depth information; keyframes are then selected and processed in the LOCAL MAPPING thread [12]. Compared with the triangulation method that only uses the same visual features observed in multiple image frames to estimate depth values, the depth association developed in the work by the authors of [20] exhibits enhanced computation efficiency. It also makes SLAM further accurate and robust because the triangulation method depends on visual odometry initialization, which suffers from scale and data association problems. The above SLAM method is used to provide location info for building the ClusterMap, and the ClusterMap is then used for relocalization. The relocalization problem is a key technology in many robotic navigation applications of unmanned vehicles. For long-term applications, visual appearance-based approaches are inefficient due to the significant changes in illumination or objects in environments, as discussed in Sections 1 and 2. In addition, unmanned vehicles driving in urban environments often face a situation in which they enter the same place, but from different directions. The appearances of the same place observed from different directions are significantly different. This characteristic leads to difficulties in relocalization. Furthermore, the existing LIDAR-based relocalization approaches all depend on dense point clouds. In the present study, a novel map named the ClusterMap and relocalization algorithms are developed for long-term unmanned vehicle applications.

Building ClusterMap
From the SLAM, accurate and scale consistent trajectories of unmanned vehicles can be obtained. In the block of Online Clustering in Figure 1, the method published by Bogoslavskyi and Stachniss [6] is utilized to segment the point cloud received from the current LIDAR frame into different clusters. However, the clustering method [6] only produces clustering results in consequent frames without providing associations for clusters belonging to the same object. Therefore, in the cluster registration block in Figure 1, each cluster is associated with corresponding odometry information; then, all clusters are registered into the same map frame. Assuming that dynamic objects do not appear frequently in the same place, clusters derived from dynamic objects are removed by judging whether clusters are appearing in the same location. Furthermore, the method by Bogoslavskyi and Stachniss [6] cannot always guarantee the consistency of clustering; for example, in some cases, only the trunk of a tree is segmented, whereas in other cases, only the canopy is segmented. To address this problem, Algorithm 1 is developed, which can piece together multiple clusters that belong to different parts of the same object.
, ap x } denote the cluster to be registered. In Algorithm 1, the function sqrDist() on lines 2 and 5 is used to calculate the squared distance between two clusters, in accordance with virtual center points; function radiusSearch() on line 10 returns the number of points belonging to {P i } and within the range of a sphere, with p j as the center and rad as the radius.

Algorithm 1 Cluster Registration.
Require: {C }:Set of registered clusters Require: C x :Cluster waiting for registration Require: else 8: count ← 0; 9: for all p j ∈ {P x } do 10: if radiusSearch(p j ,{P i },rad)>minNum then 11: count + +; 12: end if 13: end for 14: if count>sizeof ({P x }) / thresholdNum then 15: end if 17: end if 18: end for The registration algorithm retains the clusters with ap i greater than a predefined threshold. For example, if ap i > 10, then only clusters that appear more than 10 times in the same location are considered to be generated by a common static object. Figure 2 illustrates PointCloudMap in panel a and the ClusterMap in panel b created from the same dataset. The PointCloudMap is built by continuously attaching the 3D point cloud data along the trajectory poses estimated by SLAM. As shown in Figure 2a, moving pedestrians leave trailing smears, marked as circles in the map. Meanwhile, a large number of redundant points are filtered out in the ClusterMap, and only point clusters belonging to specific objects, e.g., trees, shrubs, street lamps, and wall columns, are retained. As shown in Figure 2b, the moving objects are removed. Compared with PointCloudMap, the ClusterMap is more compact, and it reserves almost all dominant static objects; thus, it can be processed much faster.

Cluster Descriptor for Clusters in ClusterMap
Similar with the feature descriptors in the field of computer vision [12], cluster descriptors are developed to distinguish different clusters in the ClusterMap. The cluster matching between the global and local ClusterMap is then achieved according to the descriptor. The descriptor is defined by describing the mutual spatial relationship among different objects. Generally, environmental objects are roughly distributed on a ground plane in many outdoor applications, and the height values of different objects are not clearly different. The main purpose of relocalization is to determine the location of a vehicle in a global map quickly. As such, considering the height difference of environmental objects is not entirely significant. Therefore, the 3D clusters perceived in the previous sub-section are projected on the 2D plane for simplicity, as shown in Figure 3. The cluster descriptor is created for the simplified 2D clusters. The cluster descriptor can also be extended to 3D ClusterMap, which will be studied in our future work. We project the 3D ClusterMap to the 2D plane by setting the third value of O i equal to zero. The 2D ClusterMap is used for relocalization in Section 4. Let us denote L i as the descriptor associated with cluster C i ∈ {C }, and SR as a neighboring radius of clusters. Note that L i is a third-order tensor that stores the mutual spatial relationship among different objects in the range of SR. The larger the SR, the more detailed the description for a cluster; however, the time needed to establish the descriptor and match clusters also increases. Let { i C } denote the set of neighboring clusters of C i , and let where n nb denotes the number of neighboring clusters in i C . Then, as illustrated in Figure 4a, a series of concentric circles are used to divide { i C } into X parts with equal annulus widths. Each annulus contains an unequal number of clusters belonging to { i C }. The green dots denote the clusters. For ∀ i C k ∈ { i C }, a subdescriptor is built, denoted as i L k . It is part of the entire descriptor for C i . To build the subdescriptor of i C k , we first define a reference axis i where | i k θ n | is the angle from k n − − → MA to i k − → RA obeying the right-hand rule. The sign of i k θ n is consistent with the z-value of i By stacking the above parameter vectors for members in { i k C }, defined in (2), we obtain the following subdescriptor of i C k , where n nbe = n nb − 1 denotes the number of elements in { i k C }. Each element in i L k is calculated with (3). Note that each neighboring cluster of a cluster is associated with a subdescriptor. And, by stacking all these subdescriptors, we obtain the complete cluster descriptor L i of C i , which is given as At the same time, n nb members exist in { i C }, and n nbe members exist in { i k C }. Therefore, the computation complexity of building the entire cluster descriptor L i is O(n nbe * n nb ).
To accelerate the descriptor matching process, we divide the subdescriptors into different parts in accordance with their location annulus, as shown in Figure 4. The series of annulus are labeled as A j , j ∈ 1, ..., n a , where n a denotes the number of annuli. The annulus's label A j is used to mark the subdescriptors in different parts. Then, the entire descriptor of C i is given as The structure of L i illustrated in Figure 5 is stored by using a third-order tensor. The numbers n A j , j ∈ (1, ..., n a ) are not the same. As shown in Figure 5, each subdescriptor i L k is represented using a page that contains n rows, where each row represents i k L n defined in (3), consisting of a distance (d) and an angle (θ) relative to i C k and i k C n . For example, if there are n A j clusters included in A j , then the A j area in Figure 5 contains n A j pages, each representing a subdescriptor i L k .

Relocalization Algorithm Based on ClusterMap
To relocalize a vehicle, we build a local ClusterMap of the surrounding environment at the vehicle's current location. The descriptor for each cluster in the map is established by using the proposed method in Section 3.3. By performing descriptor matching between clusters in the local ClusterMap and a prebuilt global ClusterMap {C }, a series of cluster correspondences between the two maps is obtained. The correspondences may contain wrong matches. Therefore, three geometric conditions are utilized to remove the outliers. Finally, the relocalization is realized by calculating the transformation between the local and global maps in accordance with the obtained cluster correspondences.

Cluster Descriptor Matching
The variable notations for a locally built ClusterMap are associated with a hat, e.g., the set of all clusters C j included in the local map is { C }. Each cluster is followed by a descriptor L j , and so on. The clusters from local and global ClusterMap are derived from the environmental objects. As such, two clusters that belong to the same object should have at least similar descriptors-the two clusters match each other. As discussed previously, the transformation between maps can be calculated on the basis of the matches. To measure the similarity between two clusters C i and C j , the distance D ji between descriptors L i and L j is defined. First, for each subdescriptor j L k 1 ∈ L j with the annulus label of A x , the subdescriptors in L i with the same annulus label are searched to find matches; the matching error between two subdescriptors is denoted as ji D k 1 k 2 . D ji is proportional to the sum of all subdescriptor matching errors; the smaller D ji , the smaller distance between C i and C j .
However, considering the matching error only can cause some problems. For example, in matching C 1 with C 2 and C 3 , L 1 has 10 successful subdescriptor matches with L 2 , with each matching error being about 1.0m, whereas L 1 has only one successful match with L 3 , where its matching error being 0.1m. If the matching error is unique judging criterion, then C 3 becomes the best match for C 1 . However, C 2 should be the better match for C 1 because C 1 and C 2 are likely derived from the same environmental object on the basis of the number of successful subdescriptor matches between them. Therefore, matching error and matching number are considered simultaneously. The distance D ji between L i and L j is obtained as where n suc is the size of successful match set S suc , n nb is the size of j C , and N l and N g denote the total number of clusters included in { C } and {C }, respectively. A key point in this study is to determine whether two subdescriptors are successfully matched. As shown in Figure 6, a geometric distance ji k 1,2 D n 1 n 2 between j k 1 L n 1 and i k 2 L n 2 is defined as where i k L n = ( i k d n , i k θ n ) is defined in (3). Then, the matching error ji D k 1 k 2 between two subdescriptors j L k 1 and i L k 2 is calculated as where n suc,k 1,2 is the size of the set S suc,k 1,2 , with the geometric distances defined in (7) being less than a predefined threshold; n nbe,k 1 is the size of j k 1 C . If the matching error ji D k 1 k 2 is less than a predefined threshold T , then j L k 1 and i L k 2 match successfully with each other and are used to calculate D ji in (6). If D ji is less than the threshold T, then C i is added to a set { j C} containing all clusters matched to C j , which is defined as where i ca denotes the size of { j C}. The more distinguishable L j is, the fewer candidates are, i.e., the smaller i ca is. Furthermore, the threshold T is an empirical value roughly set to 1.5∼2.0 times the accuracy of the ClusterMap, e.g., T = 0.06 m in our experiments. Although the entire {C } is queried for each member in { C }, the time consumption of descriptor matching is within an acceptable range owing to the small size of { C }, usually between 10 and 25. A computation analysis is further provided in the experimental section.

Removing Outliers Based on Geometric Verification
After acquiring the sets of { j C}, j ∈ (1, ..., N l ), several geometric verification rules are then defined to filter out match outliers. If a cluster j 1 C i 1 ∈ { j 1 C} and C j 1 matches correctly, then they should be generated with the same object. Thus, j 1 C i 1 satisfies the following three geometric conditions, as illustrated in Figure 7.
• Length condition: Use distances between clusters included in { C } to filter out some unsatisfied candidates. In any other set, { j x C}, a cluster, j x C i x , should be found so that where T is the same as the threshold mentioned above. • Inclusion condition: Let j 1 l max be the maximum distance between C j 1 and all other clusters in the local ClusterMap { C }. Therefore, N l clusters are present in the circle, with C j 1 as the center and j 1 l max as the radius. Correspondingly, in the global ClusterMap,~N l clusters are available in the circle, with j 1 C i 1 as the center and j 1 l max as the radius. The cluster j 1 C i 1 is preserved only if enough different groups exist in this circular range. • Triangular condition: A cluster C j 1 and every two other clusters in { C } can form a base triangle (the blue dotted triangle shown in Figure 7c); if clusters in the corresponding groups can form a triangle similar to the base one, then the cluster j 1 C i 1 is retained. By randomly selecting two clusters from { C } except C j 1 , denoted as C j x 1 and C j x 2 , j x 1 C i x 1 and a j x 2 C i x 2 should be derived from { j x 1 C} and { j x 2 C}, respectively, satisfying The conditions are applied one by one. Under the first two conditions, more than 80% of unqualified candidates can be filtered out in our experiments. Under the last condition, at most one candidate can be left in each set. Finally, several cluster correspondences between { C } and {C } are obtained. The transformation T between the local and global ClusterMaps is then calculated by where O i and O j are the virtual center points of the cluster mentioned in Section 3.2, and m suc is the number of remaining matching pairs after applying RANSAC selection. The RANSAC algorithm is used to eliminate the effects of possible mismatches and ultimately obtain more reliable matching results.
All candidates in each { j C} are transversed. Thus, the algorithm of geometric verification may reach a poor complexity O(n N l ), where n is the maximum size of { j C}. However, because the size of each { j C} is generally~2-15, the value of n is of small order of magnitude. By sequentially applying the three geometric conditions, a considerable number of candidate outliers are filtered out. The results show that not much time is spent on geometry verification. The time consumption of the entire relocalization process is evaluated in Section 5.

Experiments
In this section, several experiments are performed to illustrate the effectiveness of the proposed algorithms, i.e., ClusterMap building and ClusterMap-based relocalization. The KITTI dataset [34] for autonomous driving is used as a benchmark in performance evaluation. All experiments are tested on a laptop computer with Intel i7-4900MQ CPU@2.80 GHz and 6 GB memory. The proposed algorithms are evaluated in our campus with our unmanned vehicle platform, whose sensor configuration is illustrated in Figure 8. The sensor module is equipped 0.7 m above the ground surface. The LIDAR is operated with 10 Hz measurement speed, and the vehicle is moving at a maximum speed of 2 m/s. Only the information from the camera and LIDAR is used to complete the motion estimation. Figure 9 illustrates some snapshots of our experimental environment. Our campus is a part of Shenzhen's urban environment, which has moving pedestrians and parking cars. All the algorithms are implemented with C++.

Evaluation on KITTI Data Set
To verify the proposed map format and the relocalization ability, an evaluation experiment is first performed on KITTI dataset. In this experiment, we utilize ORB-SLAM to estimate the robotic trajectory; on the basis of which the ClusterMap of KITTI sequence 00 is built as illustrated in Figure 10.
The figure on the left shows all 3D clusters in the KITTI 00 scene, whereas the right one shows the results of the relocalization tests. Five tests are performed where the locations are randomly selected. In the right subfigures of Figure 10, red points denote clusters in the global ClusterMap, green ones indicate clusters of the local map, and each pink dotted line indicates the vehicle path to establish the local map. The places in the global map are randomly selected to evaluate the relocalization performance. The perceived data at these local places are obtained from the original dataset. As shown in Figure 10, the vehicle relocalizes successfully at all places. However, the relocalization on KITTI is ideal because the locally perceived data for the local map is the same as that for building the global ClusterMap, i.e., illumination, objects, and other conditions are exactly the same.

Evaluation with Our Experimental Vehicle
To verify the proposed approach, we perform an experiment in our campus, as shown in Figure 11. We use the experimental platform shown in Figure 8, which is equipped with VLP-16 of low density. As shown in Figure 9, the campus is a typical urban environment with various natural objects, e.g., trees, shrubs, concrete columns, parking cars, and walking pedestrians. The vehicle shown in Figure 8a is driven manually in the campus. The global ClusterMap is initially built by using the SLAM method given in Section 3.1. Then, the vehicle is manually driven after three months to verify the relocalization algorithm on the basis of the built ClusterMap. Within three months, the environmental appearance presents significant changes, e.g., parked vehicles disappear, and trees grow. In accordance with the experimental results, the relocalization performs well in our campus. The experimental video can be found in the supplementary materials. Two difficult cases are demonstrated for the relocalization, as shown in Figure 11. In each part, the figure on the left is a monocular image, whereas the right one shows the relocalization result. In the first case, as shown in Figure 11a,b, the parked vehicles disappear, and the shrubs grow after three months; the visual appearance changes significantly and leads to difficulty in relocalization. The results show that the relocalization is reliable despite large occlusions in scenes. However, long paths are required to collect additional local clusters. In the second case, as shown in Figure 11c,d, the vehicle enters the same place but from different directions. The perceived visual appearances are totally different due to the different directions and occlusion situation. Therefore, the relocalization is challenging if existing visual approaches are used. Nevertheless, the spatial relationships among long-term static objects are stable, even when some objects are occluded or observed from different directions. The relationships are considered in the proposed approach. As such, the relocalization in the two kinds of difficult cases is completed successfully.

Parameters Evaluation
As stated in Section 3.3, the parameter of search radius SR plays an important role in the performance of cluster descriptors. Figure 12a shows the numerical statistics of the computation time for establishing each descriptor under different values of SR on KITTI sequence 00 and the dataset from our campus environment. The built ClusterMap contains 965 clusters. Thus, 965 location descriptors are established. The cluster number in the local ClusterMap also affects the performance. Therefore, the number of local clusters is set to 20 in evaluating the parameter SR. Figure 12b shows the histogram chart of the relocalization success rate versus SRs. In the experiment, for each different SR condition, 100 tests are performed on the ClusterMaps built from the KITTI sequence 00 ( Figure 10) and the dataset from our campus environment (Figure 3). The testing places are randomly selected. The global ClusterMap building and relocalization experiments are performed on the same KITTI dataset. However, in terms of our dataset, the situation is different. All the datasets used for relocalization are collected after three months. Thus, they are different from those used for building the global map.
As shown in Figure 12a,b, the larger the value of SR, the more time it consumes, but the higher the success rate of the relocalization. To evaluate the factor of cluster number in the local ClusterMap further, we fix the value of SR to 35 m. The same method is used to obtain the relocalization success rates with different cluster numbers in the local ClusterMap. Figure 12c illustrates the experimental result. Large cluster numbers in local maps increase success rates. As shown in Figure 12c, when the value of SR is set to 35 m with the number of local clusters as 20, the relocalization success rate reaches more than 92% and 97% in the KITTI sequence 00 and our campus datasets, respectively.
With the same parameters, another 100 tests are performed to evaluate the computation time of the entire relocalization. The relocalization includes creating all descriptors for the local ClusterMap, completing the descriptor matching, and implementing geometric verification. Figure 13 illustrates the experimental results, where the horizontal axis is the test index. The duration of the entire process is short, and relocalization is achieved quickly. Because the GPS information is unstable in urban environments especially in the places with many buildings, the ground truth is unable to provide for the evaluation in our dataset.Therefore, we define the reprojection error as  Figure 14 illustrates the reprojection errors of different tests. The reprojection errors performed on our experimental platform in our campus are approximately~3-5 times the errors of KITTI sequence 00. Two reasons can explain this result. First, the LIDAR equipped on our platform is a Velodyne VLP-16, whereas the one in KITTI is a Velodyne HDL-64E. Second, our own datasets for evaluation are recorded using our experimental platform at different times with significant time intervals; many changes occur in environmental appearances and vehicle trajectories. The environmental changes cause a large difference between local and global ClusterMaps. Even for the same object, the virtual center point O i of its point cluster obtained by the simple clustering algorithm is inconsistent in different maps. Nevertheless, on the basis of the experimental results, the relocalization success rate still reaches a high level. The ClusterMap-based relocalization method exhibits strong robustness to noise and long-term environmental changes. To further illustrate the performance, experiments are also performed to compare with the recently published work SegMap [28]. SegMap utilizes 3D LIDAR information only and can achieve both localization and relocalization. We perform 100 tests by randomly selecting the relocalization positions. Because KITTI provides ground truth for evaluation, both the time consumption and relocalization error are compared. Figure 15 shows the computation time's comparison between SegMap and the proposed approach. Note that the computation time of our approach in Figure 15 contains the entire time for local map construction and map matching; therefore, the time is longer than that in Figure 13. Figure 16 shows the relocalization error of SegMap and the proposed approach. From Figures 15 and  16, it is seen that although the relocalization accuracy of our approach is similar or a little bit worse, the computation time is much lower than that of SegMap. The poorer accuracy may derive from the simple cluster extraction in our approach. It will be our future work to improve cluster extraction.

Discussion
From the experiments, it is seen that the proposed approach need to travel for a certain distance to collect enough local clusters. The limitation derives from the compact character of ClusterMap, because ClusterMap is composed with only salient clusters detected from the primary point clouds. However, the relocalization is achieved quickly,~3-10 s based on our experiments, and it will not be a problem in practical applications. Other solutions, like the SegMatch-based approach [27], require high-density point clouds for describing keypoints or segments. In addition, the feature descriptors are susceptible to environmental changes in long-term applications, in which shrubs or trees may change greatly, thereby causing a large change in the feature descriptors of the same object at different times. In contrast, our approach can address the problems. Therefore, the proposed approach can be applied in the low-speed fields with autonomous logistic cars, surveillance robots, or sanitation vehicles, and so on. These applications have strong requirements of low-cost and high performance.

Conclusions
The study proposes a novel representation of environmental maps called ClusterMap. The ClusterMap is composed of point cloud clusters, each of which is associated with a descriptor describing the location relationship with neighboring clusters. The new kind of map directly represents the distribution of long-term static objects in complex environments. To build a ClusterMap that is consistent with a real operating environment without scale problems, we utilize the multisensor fusion approach that combines visual features and LIDAR's point clouds. This approach can compensate for the shortcomings of different sensors and obtain low-scale drift localization and mapping results. Furthermore, a novel cluster descriptor and its matching algorithm are developed for finding cluster correspondences among different ClusterMaps. On the basis of the ClusterMap building and cluster descriptor, a robust relocalization algorithm is developed. Finally, several experiments are performed on well-known datasets and on our own experimental vehicle in our campus environment to illustrate the performance of the proposed algorithms.