Skip to Content
ElectronicsElectronics
  • Article
  • Open Access

18 January 2024

MixedSCNet: LiDAR-Based Place Recognition Using Multi-Channel Scan Context Neural Network

,
,
,
,
,
and
School of Mechanical Science and Engineering, Huazhong University of Science and Technology, Wuhan 430074, China
*
Author to whom correspondence should be addressed.

Abstract

In the realm of LiDAR-based place recognition tasks, three predominant methodologies have emerged: manually crafted feature descriptor-based methods, deep learning-based methods, and hybrid methods that combine the former two. Manually crafted feature descriptors often falter in reverse visits and confined indoor environments, while deep learning-based methods exhibit limitations in terms of generalization to distinct data domains. Hybrid methods tend to fix these problems, albeit at the cost of an expensive computational burden. In response to this, this paper introduces MixedSCNet, a novel hybrid approach designed to harness the strengths of manually crafted feature descriptors and deep learning models while keeping a relatively low computing overhead. MixedSCNet starts with constructing a BEV descriptor called MixedSC, which takes height, intensity, and smoothness into consideration simultaneously, thus offering a more comprehensive representation of the point cloud. Subsequently, MixedSC is fed into a compact Convolutional Neural Network (CNN), which further extracts high-level features, ultimately yielding a discriminative global point cloud descriptor. This descriptor is then employed for place retrieval, effectively bridging the gap between manually crafted feature descriptors and deep learning models. To substantiate the efficacy of this amalgamation, we undertake an extensive array of experiments on the KITTI and NCLT datasets. Results show that MixedSCNet stands out as the sole method showcasing state-of-the-art performance across both datasets, outperforming the other five methods while maintaining a relatively short runtime.

1. Introduction

Global localization is an essential problem in autonomous navigation and is used to determine a robot’s current location without any prior pose information [1,2]. Before performing global localization, it is common practice to first build a map of the environment and then match the collected sensor data with the map to determine the robot’s current pose with respect to the map. Generally, global localization is split into two consecutive phases: place recognition and pose estimation [3]. The purpose of place recognition is to provide an initial location estimate, while pose estimation aims to calculate the robot’s precise pose. Specific methods for global localization fall into two major categories. The first category couples place recognition and pose estimation, directly estimating the accurate pose of the robot [4,5,6,7,8]. The second one involves a two-stage approach, where place recognition provides a coarse location estimate and pose estimation refines this estimate to determine the precise pose [9,10,11]. For outdoor scenarios, Global Navigation Satellite Systems (GNSSs) are often used to provide an initial estimate of global location. However, in areas where GNSS signals are weak or unreliable, such as indoors, tunnels, or remote mountainous regions, a place recognition algorithm is a more generic and practical solution for a variety of environments.
Place recognition is essentially a retrieval problem, where the retrieval database is established during the processes of Simultaneous Localization and Mapping (SLAM) or Structure from Motion (SfM). The retrieval items within this database consist of global descriptors of keyframes. These global descriptors can be obtained by aggregating local descriptors [5,12] or directly extracting a single global descriptor from the original point cloud or image [3,13]. Each global descriptor is linked to the pose of its corresponding keyframe. Hence, when the global descriptor of the current frame is queried against the retrieval database, the keyframe exhibiting the highest similarity is identified as the nearest place. The pose associated with this keyframe can then be utilized as the initial pose estimate for the current frame.
In contrast to visual place recognition methods, LiDAR-based approaches offer two prominent advantages. Firstly, they are immune to significant variations in lighting conditions [3]. Secondly, LiDAR sensors boast a wide 360° field of view and the ability to acquire precise depth information. This enables LiDAR sensors to accurately capture the overall topological structure of the environment, rendering them less susceptible to local environmental fluctuations. As a result, LiDAR-based place recognition methods have gained growing interest in recent years.
Two representative methodologies in LiDAR-based place recognition are Bird’s Eye View (BEV)-based models [6,13,14,15,16,17] and deep neural network-based models [3,11,18,19,20,21]. BEV descriptor-based methods initially transform point cloud data into two-dimensional representations. This transformation is accomplished by either projecting the point cloud onto a 2D plane and subsequently extracting 2D features [16] or extracting features from the point cloud first and then projecting them onto a 2D plane [15]. Successively, image-matching algorithms are employed for place retrieval. However, these methods may exhibit limitations in reverse visit situations and narrow environments [13]. Conversely, deep neural network-based methods directly take point cloud data as input and produce a point cloud descriptor as output [3]. Nevertheless, this approach comes with certain disadvantages. Firstly, it requires the downsampling of the point cloud data, as it cannot directly process the complete point cloud. Furthermore, it may demonstrate relatively poor generalization capabilities, particularly when the data distribution in the test set deviates from that in the training set, resulting in a noticeable decrease in performance.
In this paper, we propose MixedSC, which leverages the concept of Scan Context to divide point cloud data into multiple bins along radial and azimuthal directions in the Bird’s Eye View. Within each bin, it records the maximum height, maximum intensity, and maximum smoothness, creating a three-channel tensor representation of the current point cloud, with each channel being represented as a fixed-size matrix. Subsequently, the MixedSC descriptor is fed into a compact deep neural network to generate a global descriptor enriched with high-level features. This descriptor is then employed in KD tree retrieval to identify the nearest keyframe. The primary contributions of our proposed method are as follows:
  • A strongly discriminative point cloud descriptor, MixedSC. The configurations of the height channel and intensity channel exhibit high similarities in their shapes, while the smoothness channel records the local structure, thus making a difference, which enables robust matching even in reverse visit situations.
  • An efficient feature extractor, MixedSCNet. By combining MixedSC and neural networks, it yields strong generalization in a distinct data domain with a relatively low computational overhead.
  • Thorough experimental evaluations on the KITTI [22] and NCLT [23] datasets. Comparative assessments against five representative LiDAR-based place recognition methods, PointNetVLAD [3], Scan Context [13], Intensity Scan Context [17], MinkLoc3Dv2 [24], and BoW3D [5], demonstrate that our proposed approach achieves state-of-the-art performance and comparatively excellent generalization alibity.

3. Methodology

3.1. System Framework

In Figure 1, the system framework for our proposed method is illustrated. The overall process consists of two stages: database construction and place recognition. During the first stage, point clouds from the preceding keyframes are transformed into global descriptors through MixedSCNet, which are then organized in a KD tree to build the database. Subsequently, during the place recognition phase, the current frame’s point cloud undergoes the same transformation to generate its corresponding global descriptor, which is later matched against those held by the database, yielding multiple candidate frames with the highest similarity scores.
Figure 1. System framework of the proposed method.

3.2. Smoothness Calculation

The smoothness of a given point is defined as the difference in horizontal polar distance between this point and the average value of its neighboring points. It reflects the fluctuation in the depth direction near that point. Higher smoothness suggests more dramatic depth variation, which means the point tends to lie on an object featuring protrusions or depressions in physical space. Therefore, the distribution pattern of smoothness values across a point cloud effectively captures local geometric features, particularly those related to objects exhibiting sharp corners or edges.
Suppose that P s r c = p 1 , , p n | p n R 4 is a LiDAR point cloud, where ( x k , y k , z k ) is the Cartesian coordinates of the point p k . First, express p k in a form analogous to spherical coordinates:
p k = r k , α k , ω k , η k ,
r k = x k 2 + y k 2 ,
α k = a r c t a n y k x k ,
ω k = a r c t a n z k r k ,
where r k , also known as the range, denotes the horizontal polar distance of point p k ; α k is the azimuth angle, and ω k is the elevation angle, with both being expressed in degrees; η k represents the reflection intensity. When considering a LiDAR with l laser channels and θ as its angular resolution, there are l × ( 360 θ ) laser beams emitted by the LiDAR during a single revolution. For any point beyond the LiDAR’s maximum or minimum detection range, its corresponding range, r k , is set to zero. Consequently, these disparate range values coalesce into a matrix denoted by R R l × ( 360 θ ) , termed as the range image. This mapping enables us to represent complex scenes captured using LiDAR using a simple yet powerful image format. Given a non-zero value r k , it can be designated to a specific position ( i , j ) within matrix R, i.e., r k = R i j . The index ( i , j ) can be calculated as
i = ϕ ω k ,
j = α k θ ,
where variable i denotes that point p k corresponds to the result obtained from the i-th laser channel of the LiDAR sensor, the mapping relationship between ω k and the laser channel index is represented by ϕ ω k , and the notation · indicates rounding to the nearest integer. The function ϕ varies depending on the specific LiDAR model. For instance, for the Velodyne VLP16 LiDAR model, ϕ ω k = ω k 2 + 8 .
Based on range image R, it is possible to compute the smoothness ( s k ) of point p k . Suppose that the range value of point p k is denoted by r k = R i j . The left and right neighborhoods of r k are defined as follows:
N k = R i , j q | q | 1 , 5 ] ,
N k + = R i , j + q | q | 1 , 5 ] ,
where | 1 , N ] denotes the set { 1 , 2 , , N } . Let N k = N k N k + ; then, the expression for calculating s k can be formulated as follows:
s k = r N k r N k 0 r k , if N k 0   2 , N k + 0 2 0 , otherwise
where the notation · 0 denotes the number of non-zero elements.

3.3. Mixed Scan Context

Inspired by Scan Context [13] and Intensity Scan Context [17], this paper introduces a highly discriminative Bird’s Eye View (BEV) point cloud descriptor termed Mixed Scan Context (MixedSC). Scan Context captures the highest height values in the local point cloud, aggregating them into a matrix that serves as the global descriptor for the entire point cloud. This effectively summarizes the point cloud’s contour structure. However, this approach exhibits limitations in narrow environments, where the height variations in LiDAR point clouds are not pronounced. Intensity Scan Context, on the other hand, selects the highest intensity values in the local point cloud. Yet, due to the distance-dependent nature of intensity, variations may occur for the same object under different observation positions and angles. Additionally, environmental factors, such as weather conditions and surface chemical reactions over time, can influence object surface characteristics, thereby impacting intensity values. Hence, relying solely on intensity-based descriptors lacks sufficient robustness. To address these limitations, this paper proposes MixedSC, which combines height, intensity, and smoothness information. By incorporating point cloud contour, object surface characteristics, and local geometric features, MixedSC shows to be a strong global descriptor to effectively accommodate narrow environments, fluctuating weather conditions, and variations in object surface properties.
Initially, it retains the height, range, azimuth angle, intensity, and smoothness information of a point cloud and filters out points within specified range and height intervals:
P = p k = z k , r k , α k , η k , s k P s r c | r m i n r k r m a x , z m i n z k z m a x .
Subsequently, as illustrated in Figure 2, akin to Scan Context, MixedSC divides the space into N s = 60 sectors azimuthally and N r = 20 rings radially. The intersecting region between each sector and ring is referred to as a bin. Let the set of point clouds within the intersecting region of the m-th ring and n-th sector be denoted by P m n , assuming that p k P m n . Then,
m = r k r m i n r m a x r m i n · N r ,
n = α k 360 + 1 2 · N s ,
where · denotes the floor function. By utilizing Formulas (11) and (12), each point is assigned to its corresponding bin. Afterward, employing the bin encoding function ψ : P m n R 3 , the highest height value, the maximum reflectance intensity value, and the highest smoothness value within each bin are recorded:
ψ P m n = m a x p k P m n z k , m a x p k P m n η k , m a x p k P m n s k .
Figure 2. Visualization of MixedSC. The light-blue region corresponds to a ring; the pale-yellow area represents a sector; and the light-green area indicates the bin resulting from their intersection. The amalgamation of the three right-side matrices results in a MixedSC descriptor.
After the aforementioned process, a single frame of LiDAR point cloud is ultimately transformed into a tensor (T) called MixedSC, i.e.,
T = ( t m n ) R 3 × N r × N s , t m n = ψ P m n .

3.4. MixedSCNet

3.4.1. Network Architecture

The network architecture is illustrated in Figure 3. MixedSCNet is a compact Convolutional Neural Network similar to ResNet18 [36]. Given MixedSC as input, an initial convolution operation with a 5 × 5 kernel is performed, elevating the channel dimension to 64. Subsequently, a 3 × 3 max pooling layer is applied, downsampling the image to 10 × 30 . In the subsequent main network, the convolutional kernel size remains 3 × 3 , with 1 × 1 kernels in skip connections. The final step involves reducing the feature map dimensions to a 1024-dimensional global descriptor through an average pooling layer.
Figure 3. The network architecture of MixedSCNet. Block denotes the number of convolutional operations. 1 × 1, 3 × 3, and 5 × 5 represent kernel sizes.
Using a Convolutional Neural Network to process MixedSC is motivated by its effectiveness in extracting local features from images. MixedSC can be likened to a low-resolution 360° panoramic snapshot of the real-world scene, with its three channels exhibiting similar overall shapes, reflecting the approximate geometric structure of the scene. However, each channel’s values differ from each other, as they represent three distinct features. Consequently, the data format of MixedSC resembles that of an RGB image. Moreover, the proposed Convolutional Neural Network has a relatively low computation overhead, totaling only 1.76 × 10 9 FLOPs, which ensures good performance and efficient computation speed at the same time.

3.4.2. Metric Learning

A training sample can be represented as a tuple T = P i , P p o s , P n e g . P i represents the query cloud, and o i denotes its origin. P p o s and P n e g denote the set of similar frames of P i , and the set of dissimilar frames of P i respectively, which are defined as
P p o s = P j | o i o j 2 D p o s , i j > Δ ,
P n e g = P j | o i o j 2 > D n e g ,
where D p o s represents the distance threshold between similar frames, Δ denotes the frame index difference threshold between similar frames, and D n e g signifies the distance threshold between dissimilar frames. The loss function ( L t r i p ) employed in this study is based on the lazy triplet loss function from PointNetVLAD [3], which can be formulated as
L t r i p T = α + m a x i f P a f P p o s i 2 m i n j f P a f P n e g j 2 + ,
where f ( · ) represents the mapping function represented by the trained MixedSCNet, α is a constant, and [ · ] + denotes the hinge loss. This loss function selects the positive cloud with the maximum distance from the query cloud and the negative cloud with the minimum distance from the query cloud. Therefore, in each iteration, the neural network is optimized by selecting the pair of the most challenging positive cloud and negative cloud. This process aims to minimize the distance between the query cloud and the positive cloud while maximizing the distance from the negative cloud.

3.4.3. Data Augmentation

Inspired by [31], this paper similarly employs the column-shift data augmentation technique to enhance the robustness of MixedSCNet against variations in viewpoints. Specifically, the MixedSC tensor is initially partitioned into N = 4 sub-modules along the vertical axis, denoted by T = T 1 , , T N . Subsequently, these sub-modules undergo cyclic right shifts for M times, where M is randomly determined during the training process within the range [ | 1 , N | ] . This process yields a new MixedSC tensor, denoted by T = T 1 , , T N , and the correspondence between the sub-modules of the old and new MixedSC is defined as T j = T i , where
j = i + M % N , if i + M N N , otherwise
with % denoting the modulo operation. Consequently, during the training procedure, the network encounters MixedSC from different perspectives, enabling it to learn invariance to viewpoint changes.

4. Experiments

4.1. Dataset and Experimental Settings

The KITTI dataset, derived from the Karlsruhe Institute of Technology and Toyota Technological Institute, stands as a widely recognized benchmark in autonomous driving research. It comprises an extensive array of high-resolution sensor data collected with a mobile platform operating in urban environments. The dataset employs the Velodyne HDL-64E LiDAR module. Leveraging its real-world complexity and richness, the KITTI dataset serves as a valuable resource for evaluating the robustness and generalization capabilities of place recognition models within challenging urban settings. The KITTI odometry benchmark consists of 22 sequences, with sequences 0-10 providing ground-truth poses. Notably, sequences with loop closures are 00, 02, 05, 06, 07, and 08. Consequently, this study primarily conducts experiments on these six sequences. Specifically, sequences 00, 02, and 08 are used for training, while all sequences are designated for testing. The training set of the KITTI dataset comprises a total of 2465 samples, while the test set contains 1538 samples.
The North Campus Long-Term (NCLT) dataset consists of data from various sensors collected with a Segway robotic platform. Spanning 34.9 h of recorded logs and covering a robot trajectory of 147.4 km across 27 mapping sessions, this dataset provides a comprehensive representation of diverse environments, including both indoor and outdoor settings. The Velodyne HDL-32E LiDAR module is employed for data acquisition. This dataset offers an extensive and varied collection of sensor data captured over an extended duration. In this study, we randomly select five sequences, namely, 2012-02-02, 2012-02-04, 2012-02-05, 2012-05-26, and 2012-08-20, to conduct experiments. Importantly, these sequences are exclusively utilized to test the cross-domain generalization performance of models trained on the KITTI dataset. Starting from the first frame of each sequence, query clouds are sampled at every 1-m interval. The NCLT dataset’s test set comprises a total of 9216 samples.
The number of frames and test samples for each sequence is detailed in Table 1. In the training set, for a training sample denoted by T t r a i n = P i , P p o s , P n e g , the distance thresholds in Formulas (15) and (16) are set to D p o s = 5 m and D n e g = 10 m, while the frame index difference threshold is Δ = 0 . The choice of Δ = 0 is made to maximize the number of training samples, aligning it with the number of frames in each training sequence. In the test set, for a test sample denoted by T t e s t = P i , P p o s , the distance threshold for similar frames is D p o s = 3 m, and the frame index difference threshold is Δ = 300 . Setting Δ to 300 is intended to filter out several frames before and after the current frame. These frames generally exhibit minor environmental changes, since their timestamps are very close to the current frame. Incorporating such frames in the set of similar frames is deemed less meaningful. This adjustment aligns the place recognition task more closely with real-world scenarios, rendering it more challenging.
Table 1. The number of frames and test samples for each sequence.
Two GeForce GTX 1080 Ti (Nvidia, Wuhan, China) are deployed to train MixedSCNet with distributed training techniques. The training batch size is configured to 1, comprising a query point cloud, two randomly selected positive point clouds, and eighteen random negative point clouds. Experience suggests that setting the batch size to 1 makes the model less prone to getting trapped in local optima and makes it converge more effectively. Adam is the chosen optimizer, with a base learning rate set to 1 × 10 3 . When the top 1% recall does not improve for five consecutive epochs, the learning rate is reduced to 0.1 times the rate of the previous epoch, and the minimum learning rate is set to 1 × 10 6 . In Equation (10), r m i n = 3 m and r m a x = 90 m. Due to variations in the installation height and angle of the LiDAR sensor in the KITTI and NCLT datasets, the z m i n and z m a x values differ. For KITTI, z m i n = 0.9 m, and z m a x = 3.2 m; on the other hand, for NCLT, z m i n = 20.0 m, and z m a x = 1.0 m. In Equation (17), the margin α is set to 0.5.

4.2. Comparison Methods

This study conducts comparative analyses between MixedSCNet and five other methods: PointNetVLAD, Scan Context, Intensity Scan Context, MinkLoc3Dv2, and BoW3D:
(1)
PointNetVLAD [3], proposed by Mikaela Angelina Uy et al. in 2018, is a deep neural network-based place recognition method that combines PointNet and NetVLAD. For the in-house datasets in the original paper, a 25 × 25 bounding box was split, and the points within the box were downsampled to the number of 4096 and then fed to the neural network. For the KITTI and NCLT datasets used in this study, we similarly define a bounding box centered at the origin of a single-frame point cloud and downsample to 4096 points, aligning the input format with the original paper. Subsequently, we utilize the refined pre-trained model in the official source code (https://github.com/mikacuy/pointnetvlad (accessed on 1 December 2023)) and retrieve candidate frames based on the global descriptor generated by PointNetVLAD.
(2)
Scan Context [13], introduced by Giseop Kim et al. in 2018, is a place recognition method utilizing manually designed feature descriptors. We refer to its open-source implementation integrated with LIO-SAM [37], SC-LIO-SAM (https://github.com/gisbi-kim/SC-LIO-SAM (accessed on 15 November 2023)), whose default parameters are retained. Following the method in [13], we employ the ring key to retrieve several similar frames as candidate frames, accomplishing the place recognition task.
(3)
Intensity Scan Context (ISC) [17] is an improved algorithm of Scan Context proposed by Han Wang et al. in 2020, incorporating both geometric structure and reflection intensity information. We reference the open-source implementation ISCLOAM (https://github.com/wh200720041/iscloam (accessed on 7 January 2024)), maintaining default parameters and adhering to the same criteria for evaluating candidate frame similarity scores, where the total similarity score for candidate frames is computed as the sum of geometric and intensity scores.
(4)
MinkLoc3Dv2 [24], introduced by Jacek Komorowski et al. in 2022, represents a deep neural network-based place recognition method. At present, MinkLoc3Dv2 exhibits the best performance among all open-source implementations on the Oxford RobotCar dataset. The input data requirements for MinkLoc3Dv2 align with those of PointNetVLAD, necessitating the adoption of identical point cloud preprocessing methods during reproduction. We utilize the refined pre-trained models provided by the official source code (https://github.com/jac99/MinkLoc3Dv2) (accessed on 7 January 2024) and perform candidate frame retrieval based on the global descriptors generated by MinkLoc3Dv2.
(5)
BoW3D [5], presented by Yunge Cui et al. in 2023, is a loop closure detection method based on the handcrafted descriptor LinK3D [38]. The original paper only demonstrates loop closure detection results on KITTI, and this study extends the evaluation to place recognition tasks on two datasets. Given that the original paper focused on loop closure detection, the Bag of Words (BoW) is dynamically updated in real time. However, when applied to the place recognition task, we traverse the data twice. The first traversal is conducted to establish a comprehensive vocabulary database. In the second pass, where the database is no longer updated, candidate frames for each frame are identified within the database and subsequently arranged in descending order based on their associated scores. The entire experimental procedure maintains the default parameters as specified in the source code (https://github.com/YungeCui/BoW3D) (accessed on 20 November 2023).

4.3. Place Recognition Performance

As conducted in PointNetVLAD, we assess several methods on the KITTI dataset and the NCLT dataset based on the average recall rate at top 1 and top 1%. Notably, recall rate at top 1% holds greater practical relevance in real-world scenarios, as paths traversed by robots may encompass similar scenes. Strictly adhering to the top candidate in such situations may result in missed detections, thereby compromising the robustness of localization.
Table 2 presents the recall rate performance of MixedSCNet and five comparative methods. Due to environmental variations, an evident performance drop is observed for PointNetVLAD on both the KITTI and NCLT datasets compared with its recall rates on the three test sets in the original paper. Notably, the pronounced decline in recall rate on the NCLT dataset is attributed to its 32-line LiDAR sensor, in contrast to the 64-line LiDAR used in the original paper’s test sets. Moreover, the superior performance of Scan Context with default parameters on KITTI can be attributed to the fine tuning of its parameters specifically for the KITTI dataset. However, a noticeable decline is observed on the NCLT dataset, primarily due to the inclusion of numerous narrow indoor environments, where Scan Context exhibits diminished adaptability. Intensity Scan Context, evaluated initially on the KITTI dataset, demonstrates the highest recall at top 1. The combination of geometric and intensity features, along with feature extraction from raw point clouds, contributes to its superior performance on KITTI. Nevertheless, a similar performance drop is observed on the NCLT dataset, akin to Scan Context. As the state-of-the-art method on the Oxford RobotCar dataset, MinkLoc3Dv2 maintains good generalization on NCLT, displaying the highest recall at top 1%. Its top 1 recall, however, slightly lags behind Intensity Scan Context and MixedSCNet. Notably, when transferred to the KITTI dataset, MinkLoc3Dv2 underperforms obviously compared with two manually designed BEV descriptor-based methods and MixedSCNet. BoW3D exhibits the poorest performance among the six methods. For a place recognition algorithm with high robustness, its performance in loop closure detection and place recognition tasks should ideally be consistent. However, in the original paper, BoW3D demonstrated excellent loop closure detection results on KITTI but considerably poor place recognition performance with the same set of hyperparameters under this paper’s task settings. Its performance further deteriorates when transferred to the NCLT dataset.
Table 2. Average recall (%) at top 1 and top 1% of different methods on KITTI and NCLT datasets. The bold number represents the highest recall among all the comparative methods in one dataset.
In contrast, MixedSCNet is the only method that demonstrates state-of-the-art performance on both datasets. On the KITTI dataset, MixedSCNet and Intensity Scan Context stand as the top methods. Notably, MixedSCNet achieves the highest recall at top 1%, with a slightly lower recall at top 1 compared with Intensity Scan Context. On the NCLT dataset, MixedSCNet and MinkLoc3Dv2 emerge as the top performers, significantly surpassing alternative methods. Specifically, MixedSCNet achieves the highest recall at top 1, with recall at top 1% closely trailing MinkLoc3Dv2. Trained on three sequences of the KITTI dataset, MixedSCNet is expected to perform well on KITTI, but its remarkable performance on the NCLT dataset, with almost no decrease in recall rate at top 1% and the highest top 1 recall rate, indicates its exceptional cross-domain generalization capabilities. Additionally, MixedSCNet’s performance on the NCLT dataset underscores its adaptability to narrow environments, as it retains intensity and smoothness information from point clouds, which proves to be effective in such confined spaces.
Figure 4 illustrates the importance of the smoothness channel. Frames 350 and 5333, extracted from sequence 2012-02-02 of the NCLT dataset, constitute a reverse visit. It can be observed that the similarity in the intensity channel of the two point clouds is weaker compared with the smoothness channel. Consequently, the Intensity Scan Context algorithm fails to include frame 5333 in the top 1% candidate frames of frame 350. In contrast, MixedSCNet successfully identifies it. Figure 5 showcases another pair of reverse visit frames from the same sequence. Intensity Scan Context also fails to correctly recognize it. However, the feature maps output by MixedSCNet reveal a minimal dissimilarity between the query frame and the reverse visit frame while maintaining a substantial dissimilarity with other negative frames.
Figure 4. Intensity and smoothness channel visualization for a reverse visit case. (a,b) Reverse visit in NCLT 2012-02-02 sequence (c,d) and corresponding intensity channels produced by Intensity Scan Context, where the latter one is the 180°-shifted version of the source channel. (e,f) The smoothness channels of their corresponding MixedSC.
Figure 5. Feature map visualization for MixedSCNet. The feature vector generated by MixedSCNet is reshaped into a 32 × 32 matrix, which stands for the final feature map.
Figure 6 illustrates the recall–candidate number curves for the six methods across all sequences. For the KITTI dataset, sequence 02 poses a notable challenge, exhibiting a considerable degradation in performance for all methods compared with other sequences. BoW3D, in particular, experiences a pronounced decline, while MixedSCNet displays the smallest performance drop among the six methods. Sequence 08 proves to be the most challenging for PointNetVLAD due to the substantial presence of reverse visits, which PointNetVLAD struggles to identify. Regarding the NCLT dataset, significant variations in scenes and LiDAR channel numbers lead to a marked decline in performance for the other three methods. In contrast, MixedSCNet demonstrates minimal deterioration in performance, indicating its robust cross-domain generalization capabilities.
Figure 6. Recall–candidate number curve for all sequences.

4.4. System Runtime

Table 3 presents the average runtime of the six methods on the KITTI 07 sequence, with all tests having been conducted on a computer equipped with an AMD Ryzen 7 5825U CPU (AMD, Wuhan, China). It is evident that excluding Intensity Scan Context, the majority of the processing time for other methods is allocated to the generation of global descriptors, while the database retrieval time remains relatively short. Intensity Scan Context, due to the necessity of computing both geometric and intensity scores for each candidate frame, shows the longest database retrieval time, which grows linearly with database size. To ensure fairness, all methods are tested on a CPU, and due to the utilization of neural networks in our method, PointNetVLAD, and MinkLoc3Dv2, the descriptor generation time is relatively extended for these three methods. If GPU testing were employed, the runtime would significantly decrease.
Table 3. Average runtime of all methods on KITTI 07.
Furthermore, the descriptor generation time for MixedSCNet is approximately 1 4 of that for MinkLoc3Dv2. This discrepancy is attributed, firstly, to the more streamlined architecture of MixedSCNet, resulting in reduced computational complexity. Secondly, MinkLoc3Dv2 incorporates ground fitting and downsampling operations during point cloud preprocessing, whereas MixedSCNet directly processes the point cloud, solely computing the maximum height, intensity, and smoothness information within each sub-region. Consequently, MixedSCNet exhibits an overall lower computational load. It is shown that MixedSCNet exhibits the shortest overall runtime among the three top-performing methods.

5. Conclusions

This paper presents a place recognition method based on Bird’s Eye View (BEV) descriptors and deep learning. By leveraging computationally efficient handcrafted descriptors and a lightweight neural network, the proposed approach achieves rapid place recognition. Additionally, it demonstrates robust adaptability to reverse visit situations and superior generalization performance in scenarios involving cross-domains. Initially, the method transforms point clouds into a three-channel BEV descriptor, termed MixedSC, which encodes height, intensity, and smoothness information. Subsequently, a compact deep neural network, MixedSCNet, is trained to transform MixedSC into a global descriptor enriched with high-level features. At last, the KD tree facilitates the retrieval of multiple candidate frames from the database based on their similarity to the query frame.
Experimental results indicate that MixedSCNet achieves commendable performance on both the KITTI and NCLT datasets. Compared with suboptimal methods, such as Intensity Scan Context and MinkLoc3Dv2, MixedSCNet exhibits consistent performance across two different datasets with the lowest computational cost. It is noteworthy that the outstanding performance of MixedSCNet persists even on the NCLT dataset, which exhibits a huge difference from the training domain, underscoring the robustness and generalization capabilities of the proposed method.
There are a few limitations to our methods, and future work will try to fix them. Firstly, the way we combine MixedSC and the neural network is not end-to-end and takes two steps to solve the place recognition task. Secondly, the neural network architecture employed in this study could benefit from further refinement, e.g., adding attention mechanisms. Lastly, in comparison to methods based on manually crafted descriptors, the speed of the proposed approach in this paper still requires improvement. It is worthwhile to explore the utilization of lighter network architectures to enhance the recognition speed.

Author Contributions

Conceptualization, T.S. and Y.S.; methodology, Y.S.; software, Y.S.; validation, Y.S. and W.H.; formal analysis, Y.S. and W.H.; investigation, Y.S., D.Y. and B.B.; resources, Y.S., J.D., X.Z. and T.S.; data curation, Y.S., D.Y. and B.B.; writing—original draft preparation, Y.S. and W.H.; writing—review and editing, Y.S., W.H. and T.S.; visualization, Y.S. and W.H.; supervision, T.S.; project administration, J.D., X.Z. and T.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key Research and Development Plan Project of Hubei Province under grant 2021BAA196.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LiDARLight Detection and Ranging
BEVBird’s Eye View
MixedSCMixed Scan Context
KITTIKarlsruhe Institute of Technologyand Toyota Technological Institute
NCLTNorth Campus Long-Term

References

  1. Thrun, S.; Burgard, W.; Fox, D. Probabilistic Robotics; Intelligent Robotics and Autonomous Agents; MIT Press: Cambridge, MA, USA, 2005. [Google Scholar]
  2. Yin, H.; Xu, X.; Lu, S.; Chen, X.; Xiong, R.; Shen, S.; Stachniss, C.; Wang, Y. A Survey on Global LiDAR Localization: Challenges, Advances and Open Problems. arXiv 2023, arXiv:2302.07433. [Google Scholar]
  3. Uy, M.A.; Lee, G.H. PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4470–4479. [Google Scholar]
  4. Cattaneo, D.; Vaghi, M.; Valada, A. LCDNet: Deep Loop Closure Detection and Point Cloud Registration for LiDAR SLAM. IEEE Trans. Robot. 2022, 38, 2074–2093. [Google Scholar]
  5. Cui, Y.; Chen, X.; Zhang, Y.; Dong, J.; Wu, Q.; Zhu, F. BoW3D: Bag of Words for Real-Time Loop Closing in 3D LiDAR SLAM. IEEE Robot. Autom. Lett. 2023, 8, 2828–2835. [Google Scholar]
  6. Jiang, B.; Shen, S. Contour Context: Abstract Structural Distribution for 3D LiDAR Loop Detection and Metric Pose Estimation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 8386–8392. [Google Scholar]
  7. Li, Y.; Li, H. LiDAR-Based Initial Global Localization Using Two-Dimensional (2D) Submap Projection Image (SPI). In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 5063–5068. [Google Scholar]
  8. Lu, S.; Xu, X.; Yin, H.; Chen, Z.; Xiong, R.; Wang, Y. One RING to Rule Them All: Radon Sinogram for Place Recognition, Orientation and Translation Estimation. In Proceedings of the 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Kyoto, Japan, 23–27 October 2022; pp. 2778–2785. [Google Scholar]
  9. Shi, X.; Chai, Z.; Zhou, Y.; Wu, J.; Xiong, Z. Global Place Recognition Using an Improved Scan Context for LIDAR-Based Localization System. In Proceedings of the 2021 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (AIM), Delft, The Netherlands, 12–16 July 2021; pp. 498–503. [Google Scholar]
  10. Xu, D.; Liu, J.; Liang, Y.; Lv, X.; Hyyppä, J. A LiDAR-based Single-Shot Global Localization Solution Using a Cross-Section Shape Context Descriptor. ISPRS J. Photogramm. Remote Sens. 2022, 189, 272–288. [Google Scholar] [CrossRef]
  11. Yin, H.; Wang, Y.; Ding, X.; Tang, L.; Huang, S.; Xiong, R. 3D LiDAR-Based Global Localization Using Siamese Neural Network. IEEE Trans. Intell. Transp. Syst. 2020, 21, 1380–1392. [Google Scholar]
  12. Galvez-López, D.; Tardos, J.D. Bags of Binary Words for Fast Place Recognition in Image Sequences. IEEE Trans. Robot. 2012, 28, 1188–1197. [Google Scholar] [CrossRef]
  13. Kim, G.; Kim, A. Scan Context: Egocentric Spatial Descriptor for Place Recognition Within 3D Point Cloud Map. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 4802–4809. [Google Scholar]
  14. Kim, G.; Choi, S.; Kim, A. Scan Context++: Structural Place Recognition Robust to Rotation and Lateral Variations in Urban Environments. IEEE Trans. Robot. 2022, 38, 1856–1874. [Google Scholar] [CrossRef]
  15. Li, L.; Kong, X.; Zhao, X.; Huang, T.; Li, W.; Wen, F.; Zhang, H.; Liu, Y. SSC: Semantic Scan Context for Large-Scale Place Recognition. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 2092–2099. [Google Scholar]
  16. Luo, L.; Cao, S.Y.; Han, B.; Shen, H.L.; Li, J. BVMatch: Lidar-Based Place Recognition Using Bird’s-Eye View Images. IEEE Robot. Autom. Lett. 2021, 6, 6076–6083. [Google Scholar]
  17. Wang, H.; Wang, C.; Xie, L. Intensity Scan Context: Coding Intensity and Geometry Relations for Loop Closure Detection. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 2095–2101. [Google Scholar]
  18. Hui, L.; Yang, H.; Cheng, M.; Xie, J.; Yang, J. Pyramid Point Cloud Transformer for Large-Scale Place Recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 6098–6107. [Google Scholar]
  19. Komorowski, J. MinkLoc3D: Point Cloud Based Large-Scale Place Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 1790–1799. [Google Scholar]
  20. Vidanapathirana, K.; Moghadam, P.; Harwood, B.; Zhao, M.; Sridharan, S.; Fookes, C. Locus: LiDAR-based Place Recognition Using Spatiotemporal Higher-Order Pooling. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 5075–5081. [Google Scholar]
  21. Xia, Y.; Xu, Y.; Li, S.; Wang, R.; Du, J.; Cremers, D.; Stilla, U. SOE-Net: A Self-Attention and Orientation Encoding Network for Point Cloud Based Place Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 11348–11357. [Google Scholar]
  22. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision Meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  23. Carlevaris-Bianco, N.; Ushani, A.K.; Eustice, R.M. University of Michigan North Campus Long-Term Vision and Lidar Dataset. Int. J. Robot. Res. 2016, 35, 1023–1035. [Google Scholar]
  24. Komorowski, J. Improving Point Cloud Based Place Recognition with Ranking-based Loss and Large Batch Training. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 3699–3705. [Google Scholar]
  25. Hausler, S.; Garg, S.; Xu, M.; Milford, M.; Fischer, T. Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14141–14152. [Google Scholar]
  26. Wang, R.; Shen, Y.; Zuo, W.; Zhou, S.; Zheng, N. TransVPR: Transformer-Based Place Recognition with Multi-Level Attention Aggregation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13648–13657. [Google Scholar]
  27. Ali-bey, A.; Chaib-draa, B.; Giguère, P. MixVPR: Feature Mixing for Visual Place Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2998–3007. [Google Scholar]
  28. Liu, Z.; Zhou, S.; Suo, C.; Yin, P.; Chen, W.; Wang, H.; Li, H.; Liu, Y.H. LPD-Net: 3D Point Cloud Learning for Large-Scale Place Recognition and Environment Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2831–2840. [Google Scholar]
  29. Zhang, W.; Xiao, C. PCAN: 3D Attention Map Learning Using Contextual Information for Point Cloud Based Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 12436–12445. [Google Scholar]
  30. Fan, Z.; Song, Z.; Liu, H.; Lu, Z.; He, J.; Du, X. SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition. AAAI 2022, 36, 551–560. [Google Scholar] [CrossRef]
  31. Kim, G.; Park, B.; Kim, A. 1-Day Learning, 1-Year Localization: Long-Term LiDAR Localization Using Scan Context Image. IEEE Robot. Autom. Lett. 2019, 4, 1948–1955. [Google Scholar] [CrossRef]
  32. Xu, X.; Yin, H.; Chen, Z.; Li, Y.; Wang, Y.; Xiong, R. DiSCO: Differentiable Scan Context With Orientation. IEEE Robot. Autom. Lett. 2021, 6, 2791–2798. [Google Scholar] [CrossRef]
  33. Ma, J.; Zhang, J.; Xu, J.; Ai, R.; Gu, W.; Chen, X. OverlapTransformer: An Efficient and Yaw-Angle-Invariant Transformer Network for LiDAR-Based Place Recognition. IEEE Robot. Autom. Lett. 2022, 7, 6958–6965. [Google Scholar]
  34. Li, L.; Kong, X.; Zhao, X.; Huang, T.; Li, W.; Wen, F.; Zhang, H.; Liu, Y. RINet: Efficient 3D Lidar-Based Place Recognition Using Rotation Invariant Neural Network. IEEE Robot. Autom. Lett. 2022, 7, 4321–4328. [Google Scholar] [CrossRef]
  35. Ma, J.; Xiong, G.; Xu, J.; Chen, X. CVTNet: A Cross-View Transformer Network for LiDAR-Based Place Recognition in Autonomous Driving Environments. IEEE Trans. Ind. Inf. 2023; 1–10, early access. [Google Scholar]
  36. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Bangalore, India, 20–21 May 2016; pp. 770–778. [Google Scholar]
  37. Shan, T.; Englot, B.; Meyers, D.; Wang, W.; Ratti, C.; Rus, D. LIO-SAM: Tightly-coupled Lidar Inertial Odometry via Smoothing and Mapping. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 5135–5142. [Google Scholar]
  38. Cui, Y.; Zhang, Y.; Dong, J.; Sun, H.; Chen, X.; Zhu, F. LinK3D: Linear Keypoints Representation for 3D LiDAR Point Cloud. arXiv 2023, arXiv:2206.05927. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.