Large-Scale Place Recognition Based on Camera-LiDAR Fused Descriptor

In the field of autonomous driving, carriers are equipped with a variety of sensors, including cameras and LiDARs. However, the camera suffers from problems of illumination and occlusion, and the LiDAR encounters motion distortion, degenerate environment and limited ranging distance. Therefore, fusing the information from these two sensors deserves to be explored. In this paper, we propose a fusion network which robustly captures both the image and point cloud descriptors to solve the place recognition problem. Our contribution can be summarized as: (1) applying the trimmed strategy in the point cloud global feature aggregation to improve the recognition performance, (2) building a compact fusion framework which captures both the robust representation of the image and 3D point cloud, and (3) learning a proper metric to describe the similarity of our fused global feature. The experiments on KITTI and KAIST datasets show that the proposed fused descriptor is more robust and discriminative than the single sensor descriptor.


Introduction
Place recognition has received a significant amount of attention in various fields including computer vision [1][2][3][4][5][6], autonomous driving systems [7][8][9][10] and augmented reality [11]. In these tasks, place recognition addresses a question of "where am I in a route". This question is related to how to recognize places based on their appearance and surrounding structure. Therefore, this technology plays an important role in the Simultaneous Localization And Mapping (SLAM) [12] for the robotic and autonomous driving system. Given a place recognition problem, a common and direct method is to find the most matching candidate place(s) of the current place, which is stored in a database of the environment map [13,14]. Its fundamental scientific question is to determine the appropriate representation (of a place) to distinguish similar and dissimilar places. Traditionally, features for visual information, such as Oriented FAST and Rotated Brief (ORB) [15], Scale-Invariant Feature Transform (SIFT) [16] and Speeded Up Robust Features (SURF) [17], have been proposed to describe the strong corner features in each image. Moreover, some methods like the Bag-of-Words (BoW) [18,19], Vector of Locally Aggregated Descriptors (VLAD) [20] and Fisher vector [21] have been used to aggregate these local features into a single vector representation for the entire image representing a certain place. Existing well-performing image retrieval [22][23][24][25] and place recognition [6,26] systems mainly utilize traditional hand-crafted information. Meanwhile, their point clouds are almost the same, meaning that the LiDAR information is helpful for obtaining a proper conclusion. As depicted in Figure 2, the LiDAR features B 3 and B 4 are different (marked by red triangular regions), but their visual data B 1 and B 2 are sufficient to prove these two frames are captured in the same place. Actually, these two cases show that there exists complementarity between the image and point cloud. Therefore, we hope to fuse the information of image and point cloud, since both of them possess their own peculiarities. Some pioneers have tried to fuse these two kinds of input by establishing the relationship between the rectilinear feature in image and the planar intersection in point cloud [54] or combining them in the final decision [55]. Unlike the above methods, our method uses networks to learn the features fusing two kinds of input, which has a more tight coupling.  In this paper we propose a camera-LiDAR sensors fusion method to extract fused global descriptors for place recognition via a deep neural network. The contributions of our work are summarized as-(1) applying the trimmed strategy for non-informative clusters to reduce the impact of some unrepresentative information in a 3D point cloud; (2) proposing a fusion neural network concatenating both the visual descriptor and 3D spatial global descriptor; and (3) using metric learning ideology to reduce the distance of fused descriptors among similar places and extend it among dissimilar places. With a metric learning loss, the whole network is optimized to get an appropriate mapping from the 3D Euclidean space to the descriptor space, in which the descriptor can distinguish the different places easier. This paper is organized as follows. Section 2 introduces some methods directly related to the approach in this paper. In Section 3 we introduce the framework of our neural network based method that can generate global fusion descriptor based on visual and 3D point information. Section 4 presents the datasets and experimental results. The paper is concluded in Section 5.

Visual Feature Extractor
ResNet [27] is the basis of our proposed approach for extracting image features. It performs well in image recognition, and solves the degeneration problem when training deep neural networks. This method presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously. A principal module named residual block is shown in Figure 3. This method shows that deep residual learning can solve the degeneration problem. For a stacked layers structure, X is the input of the first layer of the residual block. Formally, denoting the desired output as H(X), the residual block learns another nonlinear mapping F (X) = H(X) − X to get the residual value. It not only solves the deep learning degeneration problem, but also apparently reduces the computational complexity. With the accumulation of residual block, the ResNet is formed naturally to obtain the deeper features of images.

Global Descriptor of 3D Point Cloud via Netvlad
VLAD [20], as mentioned before, is a classical hand-crafted feature extractor method which can aggregate these local features into a single vector representation for the entire image representing a certain place. Fortunately NetVLAD [34] can combine the advantages of VLAD and data-driven methods. Based on PointNet [50] and NetVLAD, PointNetVLAD [36] focuses on the place recognition in scene retrieval, and outperforms the maxpool layer features extracted from the original PointNet. Its framework is shown in Figure 4. The main idea of PointNetVLAD is to view the point features of every point cloud learned from the PointNet block as the input of NetVLAD block, and then utilize the input N D-dimensional feature points to aggregate a single (K · D) × 1 global descriptor vector as where the cluster centers {c k } and parameters {w k }, {b k } are trainable from the network. One branch of the NetVLAD block computes the difference between local features and cluster centers, while the softmax part determines the weights of the points to each cluster centers. The advantage of PointNetVLAD is the acquirement of the global descriptor for the whole point cloud, which has less redundant feature comparing with those local feature extraction operators. However, the global descriptor may be disturbed by some outliers and isolated points. We will avoid these problems via our modified local-to-global fusion descriptor retrieval.

Proposed Method
We list here some notations in our algorithm. For each scene, the source data (I,P) include the image information I and the corresponding 3D point cloudP. Note thatP, acquired from LiDAR, may have different sizes. For feature extraction facility in deep neural network, we apply a down-sampling preprocess to 3D source point cloudP, to get the fixed size point cloud P = {p l |p l ∈ R 3 } N l=1 before feature extraction, that is,P . We then use the neural network to generate a robust and compact representation for a specific place (I, P). For acquiring this representation, we need to learn a proper mapping F (·) from the input data space S = {(I, P)} to a new space for facilitating the place retrieval. The whole framework is shown in Figure 5. A good mapping should consist of two modules-the efficient feature extraction operator F (the green, yellow and blue blocks in Figure 5) and the best metric M to evaluate of similarity of feature descriptors (the red block in Figure 5). This process can be formulated with for a certain place S = (I, P). Firstly, we should build a feature extraction operator F to represent the place accurately and robustly. Note that each place has the image and the corresponding 3D point cloud. So, the feature extraction operator F has two map branches F I and F P to extract visual feature x I and 3D point cloud feature x P respectively, that is, F = F I ⊕ F P . In this part, the input data of camera and LiDAR information, according to their data patterns, are fed into image feature extraction branch (the blue block) and 3D Spatial feature aggregation branch (the green and yellow blocks), and get the respective descriptors x I and x P . Specifically, we introduce a trimmed clustering method to extract the global descriptor x P for 3D point cloud via ignoring the non-informative point feature clusters.
To avoid the shortcoming of both visual and LiDAR information process, a concatenating and fully connection operation are applied to build the global fusion feature, that is, x = x I ⊕ x P . Furthermore, the triplet constraints of the sampling are used to learn the best metric M for finding a good viewpoint to describe the similarity of sampling. Finally, the model parameters are optimized to get an appropriate mapping relationship between the input data and the fused global descriptorx via an end-to-end training. We will introduce these modules in the following section.

Spatial Feature Aggregation with Trimmed Strategy
In this part, we apply a local-to-global methodology to aggregate the point cloud features. This procedure begins with a local feature extraction (the green block), then follows a global feature extraction (the yellow block) for point cloud.

Local Feature Extraction for Point Cloud
For the input point cloud P, the Multi-Layer Perception (MLP) and feature transform are applied to extract the local spatial feature information for point cloud by mapping each 3D dimensional point p l ∈ R 3 into a higher dimensional space p l ∈ R D , for D 3. That is, This part of the network (the green block) extracts some local rotation invariant spatial features for each single point p l ∈ R 3 from the input point cloud P. Thus, the original 3D point is mapped into a higher dimensional local feature vector. This redundant representation makes the points maintain much more their own information, which is widely used as the feature extraction module for point cloud processing, such as PointNet [50]. Some redundant information will be omitted to form a compact global descriptor in the following process.

Global Feature Extraction with Trimmed Clustering
Comparing with PointNetVLAD [36], we introduce a novel trimmed VLAD block for point clouds in this module (the yellow block in Figure 5). Originally, the K cluster centers K = {c k |c k ∈ R D } K k=1 and their corresponding weights W k (·) = W k (w k , b k )(·), k = 1, . . . , K are learned to represent K special and prominent parts of the current 3D space, so that the outputs related to these cluster centers are distinctive. However, this procedure is sensitive to the complex environments captured by a 3D LiDAR. For avoiding this redundant information and environment disturbance robustly, we will introduce a trimmed strategy via ignoring the non-informative clustering of 3D point cloud, that is, the trimmed VLAD block. The details of this new block are shown in Figure 6. Notice that not all the points and clusters are useful for forming a compact and robust representation of point cloud. For instance, some outliers or isolated points do not correctly reflect the features of point cloud P , so they are not helpful to the representation of P . Moreover, points belonging to this kind of dynamic clusters in a scene may exert a negative influence on the recognition precision. Under this consideration, we assume that there exist more cluster centers G = {c k |c k ∈ R D } G k=K+1 and its corresponding weights W k (·) = W k (w k , b k )(·), k = K + 1, . . . , K + G to non-informative clusters for separating their influences.
For the given proper cluster size K and the non-informative cluster size G, the trimmed weights of an input descriptors P = {p l |p l ∈ R D } N l=1 can be computed via where {w k }, {b k }, k = 1, . . . , K are the weight parameters for proper clusters, and {w k }, {b k }, k = K + 1, . . . , G are the weight parameters for non-informative clusters. All of them are trainable via the trimmed weights sub-block (the red block in Figure 5). Note that the weight parameters for non-informative clusters are included into the summation in the denominator of Equation (3), which makes these non-informative clusters contribute to the soft assignments in the same manner as the original clusters K. Moreover, only the trimmed weights for the meaningful clusters K are computed, since the weights of the non-informative clusters will be ignored in the aggregation process.
For an input local descriptor after the partial aggregation, in which V k can be written as where c k is the proper cluster center, W k (p l ) is the trimmed weight of the local descriptor p l for the meaningful cluster K, and p l − c k represents the difference between the local descriptor of sample and its corresponding meaningful cluster center c k . The cluster centers and the differences (partial residual ) are computed in a single branch network. Then we assign the trimmed weight to partial residuals of the meaningful clusters in partial aggregation process for getting a global descriptor V. This trimmed-weight based aggregation process is robust for complex LiDAR data, since the non-informative clusters are considered both in weights computation and the aggregation process. Moreover, we choose to do intra-normalization of each vector first, and then concatenate them, followed by the L 2 normalization. Thus, the K · D-dimensional feature vector is generated. After the trimmed VLAD block, we use the fully connected layer to select useful spacial features for getting a Q-dimension compact global descriptor.

Image Feature Extraction
As depicted in Section 1, the visual feature extraction is very important for place recognition, since vision is the main source of obtaining information for mankind. Here we choose the ResNet50 [27] for the following two reasons. Firstly, the images collected by a camera in outdoor large-scale scenes belong to natural images. The ResNet performs well at recognizing natural images. Secondly, as a deep network, the ResNet50 has a good capability for extracting the deep features of images, and it uses a unique building block to firstly reduce the dimension of the features and then to increase the dimension. By continuously utilizing such blocks, the computation consumption is acceptable. Compared with ResNet101, the ResNet50 consumes less computation and already meets our demand, since our motivation is to verify the efficiency of our fused network. That means the additional LiDAR sensor information can improve the place recognition. The original output of the ResNet is an 1000-dimension vector, here we resize it in Q-dimension. The details about the image feature extractor via ResNet can be found in [27].
Here we use the ResNet as our image feature extractor rather than that followed by a NetVLAD layer. The NetVLAD, inspired by the VLAD, is a clustering algorithm utilizing the soft assignment in the neural network instead of the hard assignment. Images contain a large amount of information, and the ResNet has the capability to obtain these sufficient and deep features by its residual blocks. These appearance based features have a mutual effect with the structural point cloud features. However, arranging these features through an additional NetVLAD layer may lose a part of information. As a result, we only use the ResNet to extract image features, followed by the L 2 normalization to make the image and the point cloud components in equal weights. The comparison of fusing these two kinds of image features with the point cloud features can be found in Section 4.3.4.

Metric Learning for Fused Global Descriptors
As mentioned, a good mapping (or descriptor) should consist of the efficient feature extraction and the best metric to evaluate sample similarity in the fused global feature space. Different from PointNetVLAD [36], our network needs to learn how to pick the appropriate viewpoint for the fused global descriptor of image and point cloud in the fusion process (the red block in Figure 5).
To deal with the fused information, we firstly concatenate the two kinds of features, f P and f I , into a long vector roughly. Then we use the fully connected layer to select useful parts in a feature vector that combines a mixture of image and point cloud features. It is worth mentioning that the global descriptor often needs to be normalized, i.e., x = 1. So, L 2 layer is conducted to balance each feature. This operation is able to eliminate the negative influence made by dimensionality. Thus, a coarse fused global descriptor x ∈ R Q is acquired.
After we get a set of descriptors x, the supervised label can be introduced to evaluating the similarity and dissimilarity relationship between them in the fusion global feature space. Metric is often used to evaluate the similarity of samples [56]. Meanwhile, finding the best metric is equal to finding the best mapping to a new space, where we can get the proper viewpoint to describe the data. So, metric learning is utilized to optimize the parameters in the network, i.e., it learns an appropriate mapping from the original fused descriptor x i to the further updated descriptorx i in this module. The place recognition here maps the original input point cloud and image into a new data space where the data have the better description and discrimination.
For well fitting the data and quantifying the similarity of the fused global descriptors, we introduce a triplet constraint in an intuitive way, that is, where x i is similar to x j , and dissimilar to x k ; δ ij means the distance between x i and x j , and δ ik is the distance between x i and x k . It means that the distance of samples in different places should be as large as possible, while as small as possible in the same place. Note that the supervised label information is needed in this triplet constraint. The traditional hinge loss function [57,58] is often used for the triplet constraint δ ij + α < δ ik as where [m] + = m if m ≥ 0 and 0 otherwise, and α is the margin value to balance the inner class and intra class. So the minimization of the loss function L H makes x i closer to x j and maintains a margin between x i and x k . Furthermore, deep neural network based metric learning is utilized to learn an appropriate mapping M from the original fused descriptor x i to the further updated descriptorx i in this module. The parameters of the mapping (model) are optimized via the deep network training. Similar to traditional triplet constraint based metric learning method, the descriptors in the same place should get closer and the ones in the different places get farther. Here, we want further augment this constraint, since the outdoor large-scale place recognition needs a good discrimination for the descriptors. Thus, we apply the lazy triplet constraint loss [36] to learn a discriminative mapping M.
Taking each current frame as the anchor frame S anc in the training dataset, its corresponding fused global descriptor is x anc , while x pos represents the descriptor of place that is similar to the anchor place, x neg represents a dissimilar descriptor. Then, we use the descriptors of anchor frame, that of the corresponding positive and negative frames to construct a set of tuples T = {x anc , {x pos }, {x neg }}.
Mathematically, the lazy triplet loss is calculated as: where δ pos is the Euclidean distance between the fused global descriptor of anchor frame and that of the one in {x pos }, that is, δ pos = {d(x anc , x j ), x j ∈ {x pos }}; the definition of δ neg is in the same way, that is, The objective of L lazytrip is to minimize the supremum in {δ pos } and maximize the infimum in {δ neg }, which is equivalent to reducing the distance between the global descriptor of x anc and {x pos }, and extending the distance between the global descriptor of x anc and {x neg }. With the closest negative distance and the farthest positive distance, the parameters can be updated efficiently. After using the metric learning to update the parameters in the network,x gets more discriminative.
Finally, having a trained model, we can build the desired descriptors of all the places in the dataset. Once searching an unknown place, we can map the source data S to the space X and then query its descriptor x in the database to find the candidates in a relatively small range.

Experiments and Results
To make a fair comparison with different methods, we compare our approach with the existing open source algorithm NetVLAD [34] and PointNetVLAD [36] on the same device. The device is equipped with a Tesla P40 GPU with 24 GB memory, and is implemented with Ubuntu16.04 operating system carried with TensorFlow. For the super parameters of our network, we set the number of original cluster size and non-informative cluster size as 64 and 4 respectively, and margin α = 0.8. The dimension of output descriptor in every part is uniformly set as 256. The training process of our method takes 40 h.

Datasets and Pre-Processing
We choose two large-scale outdoor datasets KITTI (Karlsruhe Institute of Technology and Toyota Technological Institute) [59] and KAIST (Korea Advanced Institute of Science and Technology) [60] for experiments. The scenarios of KITTI dataset are diverse, capturing real-world traffic situations and ranging from freeways over rural areas to urban scenes with many static and dynamic objects. KAIST dataset is composed of complex urban scenes. These two datasets satisfy our demand that the frequencies of camera and LiDAR information are the same or multipled, which is able to make the pair of image and point cloud in the same place. KITTI dataset supplies 11 scenes containing accurate odometry ground truth information. These 11 scenes are written as KITTI 00, · · · , KITTI 10, and then we utilize these scenes for our experiments. Each scene in the KAIST dataset has accurate GPS information, so we focus on 5 scenes, where most of them have loops for place recognition evaluation.

Kitti Dataset
KITTI Odometry split consists of 22 scenes that the images and point clouds strictly match. Half of the sequences have the ground truth poses of each frame. We only use these 11 scenes with the closed loops in our experiment.
As mentioned in Section 3, the quantity of points in each point cloud is not the same, so we need to resample the points in a fixed number. Figure 7 shows the pre-processing of the point cloud in the KITTI dataset. After checking the sample point cloud of HDL-64 LiDAR, we consider that the ground points are redundant to form a discriminative descriptor utilizing the structure information of each frame. Besides, removing these ground points can reduce the consumption of the memory. As a result, the ground points are removed using the method in [61]. Then the point cloud is downsampled into a certain quantity. Here we use the downsample API in the PCL Library, and the leafsize which represents the size of grid cells in the point cloud is set to (0.3, 0.3, 0.1). Finally, we randomly pick N = 6000 points as our input of LiDAR to give a slight disturbance to the point cloud. For the image part, considering the size of raw input, we resize them into 180 × 600 on all the sequences. We choose such a large dimension of input to keep more information of raw data, while the network can also be trainable on a single GPU.

Kaist Dataset
The KAIST dataset mainly focuses on the urban scene. Images, point clouds and GPS information are acquired in different frequencies in the dataset, but the timestamps are available for all the sensors. We firstly generate data tuples depending on the timestamp of each frame among the different sensors. The difference on time of data tuple is under 0.1 s. After getting sequential frames like the data format in KITTI dataset, KAIST dataset can be operated in similar way. Notice that the images in KAIST dataset are in 8-bit Bayer format, so we demosaic them to recover the raw data into RGB images. Besides, the point cloud in KAIST consists of two VLP-16 LiDARs. We use the extrinsics between sensors and carrier to transform these two point clouds in carrier coordinate system. The merging process is shown in Figure 8.

Our Campus Data
The hardware platform is constructed from the HESAI Pandora and the Trimble GPS in Figure 9. The Pandora contains a set of sensors, which are 4 greyscale cameras, a color camera and a 3D LiDAR.
While the GPS consists of a receiver (SPS461) and 2 antennas (GA530). Here we use the GPS data as ground truth. The LiDAR in Pandora has 40 scanners so that we can get a dense point cloud of surroundings. As a result, our campus data can use the same pre-processing in KITTI dataset. The related experiments can be found in Section 4.3.6.

Triplet Tuple
We use the ground truth trajectory to generate training tuples. Every tuple is composed of two parts, that is, image and point cloud, respectively. Both of them contain the keys S anc , {S pos } and {S neg }. For each anchor frame S anc , {S pos } is generated in terms of position distance and the time. The distance between the candidate positive frame S pos ∈ {S pos } and S anc is under 5 m and the time difference is under 10 s. Similarly, the distance between candidate negative frame S neg and S anc is over 50 m. Through random selection from these candidate frames, the input training tuple can be finally generated. This operation also increases the robustness of the trained models. Additionally, we take the frames passing through the same place as the evaluation frames.

Place Recognition Results
Here we evaluate our approach to the KITTI and KAIST datasets. To the best of our knowledge, few methods combine both the image and the point cloud on the place recognition task, so we compare the proposed approach with some methods on a single sensor mentioned before. The NetVLAD is an image-based method which utilizes the concept of VLAD to develop a unique layer which aggregates the image features extracted by a CNN architecture. Here we use the ResNet50 to extract visual information. Besides, the output of ResNet is a vector which can also be seen as a compact representation of an image. Therefore, the ResNet is also be included. The PointNetVLAD is a point-based approach combining the concept of PointNet and NetVLAD, which applies an image-based method to the point cloud. The PointNetVLAD with the trimmed strategy will also be included. For a fair comparison, we use the same training datasets to train the models and then do the retrieval.
The results are given in Figure 10 and Table 1. As mentioned in Section 1, the LiDAR and camera data have mutual effect on each other. Intuitively, when both the image and the point cloud parts have a good ability to distinguish the similar and dissimilar places, fusing these two features will have a better performance. Because of the lower recall rate on the LiDAR-only descriptor, in the curves of KITTI 05 and KAIST 30, the performance of our fusion feature is worse than the image-only method in the certain segment. But combining the visual information with point cloud information reduces the bad discrimination of the LiDAR-only descriptor. It can also be observed in Table 1. Although the frames in each scene are different, the results of the top 1% of candidates, as our benchmark, show the average performance of all the approaches. In the KAIST dataset, because the point cloud only contains a part of the structure of the whole scene, the results of LiDAR are much lower than that of the image. Generally speaking, our fusion approach is more accurate and robust. Table 1. Comparison against the state-of-the-art approaches (@1%). The symbol @1% means recall (%) on the top 1% candidates.

Analysis and Discussion
In this subsection, we discuss the results of our network in detail. We determine the parameters, show the impact of the non-informative clusters, compare the different image features extractor mentioned before, further demonstrate the advantages of our fusion descriptor and analyze the utility of our method.

Number of Points
In Table 2, we firstly discuss the quantity of the points in each point cloud P. The more points retained after downsampling from the original point cloud, a better the performance of model is.
However, the more points used in the network, the more computing resources are occupied. As a compromise, we set a fixed number of points N = 6000 in our method.  Table 3 shows the recall of top 1 candidate in different number of cluster centers K. Intuitively, this parameter is due to the points number N and the complexity of the scene. The more points or the more complex the scenes are, the more cluster centers is required. In our experiments, K = 64 achieves a better performance.

Effect of the Non-Informative Clusters
Here we use the recall rate of top N-number candidates in the database to examine whether the non-informative clusters can improve the PointNetVLAD. Figure 10 and Table 1 show the comparison of PointNetVLAD and our trimmed version. In conclusion, our trimmed strategy performs better than the original PointNetVLAD. Meanwhile, we rank the weights of the points to all the clusters. The cluster with the biggest weight is defined as the owner of the point. Then we marked out the points belonging to the non-informative clusters with the different colors in Figure 11. It is shown that the non-informative clusters are not so meaningful. Specifically, these points of abandoned clusters in our trimmed approach contain the outliers or isolated points. Therefore, the concept of trimmed strategy makes network learn how to ignore these non-informative clusters, which helps to improve the quality of features computed by the trimmed VLAD block. Figure 11. Visualization of non-informative clusters. The marked points in the same color belong to the same non-informative cluster. For each non-informative cluster, most of the points are gathered together, but some points are far away from the center.

Different Image Features
We fuse two different image features with the point cloud part in our fusion network. One is the features directly extracted from the ResNet50, and the other is the features extracted from the ResNet50 and followed by a NetVLAD layer. The ResNet is a deep network with stacked residual blocks. Its final layer is a fully connected layer, which has already downsampled the features. To some extent, it is also a concept of local aggregation. Therefore, in our fusion network, fusing the features from ResNet without the NetVLAD layer with the point cloud features is reasonable (Table 4). While the point features are orderless and there are only a few methods designed on the point cloud, using the VLAD here can efficiently aggregate the point features to a global point cloud feature. In Table 4, we use the KITTI dataset to further verify our observation. Here we show the result of the top 1% of candidates and the top 1 candidate to compare the effect of fusing these two image features with the spatial feature. In our fusion architecture, ResNet without NetVLAD layer has a better performance in overall cases. Figure 12 shows the top two candidates of two test frames using our fusion method. Our method can still find the corresponding frames in database, even there are some moving objects in datasets. Besides the top candidates of test frame in database are usually adjacent frames, our fused descriptor has a good performance on retrieval.

Effect of Learned Descriptors
Here we show the visualization of the features in several scenes. The numbers of frames in each scene are in different scales. In Figure 13, we use the t-SNE to represent three kinds of global descriptors-image, point cloud and fusion, which are generated from the ResNet, PointNetVLAD and our fusion way respectively. In addition, we also draw the legend with gradient colors in each scene. In the t-SNE plot, the more points in different colors are mixed, the worse the discrimination of the descriptors is. In other words, the adjacent frames are similar intuitively, so the corresponding points in the plots should be closer. From this perspective, our method has fewer mixed areas in overall cases. In Figure 14, we select short segments to explain the t-SNE plot in detail. Note that the marked area on the trajectory corresponds to the frames represented by the points in the plots. In our fusion method, there are less isolated points than those in other methods. It means our descriptor has a better performance in the feature space.

Usability
Finally, we further study the application of our method. In Table 5, we list the recall and the average processing time between the VLAD and our method. Here we use the ORB feature to detect the corners and aggregate the hand-crafted features in VLAD vector. Our method performs better in overall scenes. Note that the VLAD method needs to calculate the local descriptors in the image before aggregation, so the processing time here also includes the time to extract features and corresponding local descriptors. While our method can directly get a global descriptor from the model. Less time is needed in our approach to form a descriptor. Compared with the VLAD algorithm, which has been proven to be practical, our method has the potential to be applied in the field of autonomous driving or robotics systems. Table 5. Comparison of the VLAD and our approach. Here we list recall(%) @1 in the KITTI dataset and processing time of extracting the global descriptor of each frame in seconds. Besides, we use the heat map to visualize the minimal required candidates of finding the corresponding frame in database. In Figure 15, those tested frames are marked on the trajectory by specific colors. The color represents the least required candidates to find the corresponding frame. In our method, most of the marks on the map are dark, and only a few points near the intersection are bright. Compared with the hand-crafted method, our method is also capable of the place recognition task.  Here we also test the proposed approach in our campus. In Figure 16, we show the comparison against the state-of-the-art methods in our campus, the results on the sensors rather than the public datasets show our fusion method is competitive.

VLAD Ours VLAD Ours
(a) (b) Figure 16. Comparison of our approach with other approaches to our campus dataset.

Conclusions
In this paper, we have proposed a novel network for place recognition by fusing the information from the image and the point cloud. We apply the trimmed strategy in the point cloud global feature aggregation to avoid the perturbation of complex environments. Moreover, we fuse this compact global descriptor of the point cloud with that of the corresponding image to get a robust fused global descriptor for each place. Finally, we learn a proper metric to describe the similarity of our fused global feature to get an end-to-end place representation network. We implement our approach and some off-the-shelf methods on the open source KITTI and KAIST datasets. Experiments and visualization show that the fusion of two kinds of sensors can improve the performance, that is, the descriptor generated from two sensors is more robust than that from a single sensor.
Author Contributions: Methodology, Y.P.; Investigation, S.Y.; Supervision, S.X.; Visualization, C.P. and K.L.; Writing-original draft, C.P. and K.L.; Writing-review & editing, S.X. and Y.P. All authors have read and agreed to the published version of the manuscript.