Self-Supervised Point Set Local Descriptors for Point Cloud Registration

Descriptors play an important role in point cloud registration. The current state-of-the-art resorts to the high regression capability of deep learning. However, recent deep learning-based descriptors require different levels of annotation and selection of patches, which make the model hard to migrate to new scenarios. In this work, we learn local registration descriptors for point clouds in a self-supervised manner. In each iteration of the training, the input of the network is merely one unlabeled point cloud. Thus, the whole training requires no manual annotation and manual selection of patches. In addition, we propose to involve keypoint sampling into the pipeline, which further improves the performance of our model. Our experiments demonstrate the capability of our self-supervised local descriptor to achieve even better performance than the supervised model, while being easier to train and requiring no data labeling.


Introduction
Point cloud registration (PCR) is an essential task in various applications, including 3D reconstruction and simultaneous localization and mapping (SLAM). Usually, the accuracy of the calculated transformation will dominate the performance of higher level tasks. Thus, researchers either make back-end optimization on the high level task, such as SLAM [1][2][3], or improve on the PCR side.
Various techniques have been invented for the point set registration problem. As discussed in [4], it is extremely hard to find the optimal transformation T and correspondence matrix P simultaneously. The problem is addressed in [4] by alternating the optimization of T and P. In recent decades, a multitude of algorithms have been proposed on 3D registration. They are divided into rigid and non-rigid algorithms [5] and work either iteratively to solve for the transformation matrix with repeatedly matched points [6][7][8][9][10] or treat the problem as an optimization program that omits the necessity of computing correspondences [11][12][13].
Although the learned point descriptors score better, their supervising usually requires extra labor to label the data. Those algorithms either get the correspondences from the matched point clouds [21][22][23]25], which is costly, or they are labeling the inter-point cloud relation [26,27], which is inefficient to train. Moreover, the existing models either train on patches, which is not globally learned for the scene [21,22,24], or learn the scene as the training loss works globally for the whole point cloud but comes with a triplet siamese loss that is not directly related to the true transformation [26].
In this paper, we propose to learn a point cloud local descriptor for registration without any annotation and selection of patches. The input of the network is a raw point cloud for each iteration of training. In addition, our loss function is directly related to the solved true transformation of registration. To realize the self-supervision, we propose the Full Connection Form Solution (CF) to solve the PCR problem non-iteratively in one-step without correspondences. Then, it serves as a layer of a neural network in the end of the descriptor, the gradients are propagated back to the descriptor learner. Moreover, in our model, we use a keypoint detector to sample points in the layer of sampling and grouping [28] to avoid learning on non-interested points, which further improves the performance.
To summarize, the major contributions of this paper are: • We propose a self-supervised method to learn point cloud descriptors requiring no manual annotation and selection during training. • We propose a keypoint sampling manner during training, which can focus on interesting points and further boost the performance. • Experiments show that our self-supervised learned local descriptor has better performance than the supervised 3DFeatNet.
Experiments on various datasets, i.e., on the Oxford [29] and KITTI [30] datasets, demonstrate the performance of our descriptor.

Related Work
This section first reviews the technical advances in point cloud registration which are related to our registration layer. Then, it describes handcrafted registration descriptors and learned models.

Registration Model
The Iterative Closest Point (ICP) algorithm is the most famous registration method. It has been widely applied to various representations of 3D shapes [7] and is able to align a set of range images into a 3D model [8]. The generalized-ICP [9] even puts point-to-point ICP and point-to-plane ICP into one probabilistic framework. ICP consists of two steps, correspondence search and solving the transformation.
However, in ICP and related methods, the correspondences have to be recomputed each iteration. To avoid this, the kernel correlation (KC) method [11] uses an objective function that fully connects the point clouds. In each term of the summation, a robust function, the Gaussian distance, has been utilized. Similar to Maximum Mean Discrepancy (MMD), KC evaluates the distance between two distributions. Thus, it shows better sensitivity to noise and is more robust than ICP-like methods. Some recent publications do not rely on correspondences. Myronenko and Song [12] represented point clouds with Gaussian mixture models and solve the transformation by aligning the model centroids. Zheng et al. [13] built a continuous distance field for a fixed model and aligned the other point set model to minimize the energy iteratively. Yang et al. [31] reformulated the registration as a truncated least squares estimation (TEASER++), which is thus robust to many wrong correspondences. Resorting to frequency domain, Huang et al. [32] decomposed the registration problem of seven DoFs into multiple subproblems, which they solved with a closed-form solution.
Those methods either require correspondences, needed in frequency domain, or are solved iteratively, which cannot be applied as a differentiable layer in deep neural networks to solve the transformation without pre-knowing the match. Thus, we propose a registration layer to fill in this requirement.

Descriptors
Point Feature Histograms (PFH) are known as the most typical local 3D descriptors. They encode the geometrical properties of the neighborhood with a multi-dimensional histogram [14]. For real time application, Fast Point Feature Histograms (FPFH) break the full interconnection of neighbors in PFH. Thus, they achieve a linear time complexity and gradually have become the most commonly used handcrafted 3D descriptor [10]. Apart from the descriptors from point set geometry, spin images (SI) [15] and unique shape context (USC) [16] split the spatial space into bins and count the number of points in each as a histogram descriptor. In addition, the authors of [17,18] transformed local scans into range images to extract features. Flint et al. [19] proposed to extend the 2D-SIFT onto 3D images. Wu et al. [20] introduced a SIFT-like descriptor on projected 3D patches.
However, the correspondence from features requires a good distinctiveness of the descriptor, but the performance of descriptors usually varies on different point sets. Therefore, data-driven descriptors come into the view. 3DMatch proposes to learn a volumetric patch descriptor from correspondence labels [21]. Based on PointNet [33], PPFNet introduces a local descriptor that is highly aware of global context [23]. It learns from the truth correspondence matrix. With a voxelized smoothed density value representation, 3DSmoothNet also trains the network with a triplet of anchor and positive and negative samples [22]. Without using correspondence labels, PPF-FoldNet uses an encoder-decoder network to reconstruct the local patch fed in [24]. D3Feat proposes a joint learning of keypoint detector and descriptor [25]. The D3Feat provides descriptors and keypoint scores globally for all points, which introduces extra cost during inference. Thus, this method is also unable to provide descriptors solely for interested local patch.
As the training loss merely works on pairs of patches, the point-wise supervised models are not learning globally for the entire scene within the dataset. We classify the feature learning models into two groups, point-wise and point cloud-wise supervised models, on whether they learn directly from the relation between point clouds. For point cloud-wise supervised models, the training loss works globally for an entire point cloud, which is more related to the registration application. This intuition directs us to learn our model with only raw point clouds.
Weakly supervised on the positive/negative relation between point cloud frames, 3DFeatNet learns descriptors without explicitly specifying the correspondences [26]. As a by-product of its attention-aware loss function, keypoints are extracted by applying nonmaximum suppression on the all points attentions. To tackle the speed issue, RSKDD proposes to use random sampling to replace the Farthest Point Sampling (FPS) of 3DFeat-Net [27]. In addition, it embeds chamfer loss and point-to-point loss from the keypoint detection model USIP [34] to co-learn the keypoints and detectors. Since its learned descriptor is not for the cluster center but for shifted point instead, the detector and descriptor modules are not able to be decoupled. Therefore, 3DFeatNet provides a good basis to feed in whole point cloud as we demand. In addition, our model does not require any annotation and the loss function is directly on the solved transformation.

Method
In this section, the registration layer and keypoint sampling are introduced. Then, we demonstrate the whole training pipeline to learn the descriptor model.

The Registration Layer
We intend to use both full connection and the least squares form in this module. However, just replacing the kernel of KC with the quadratic distance will not work due to the distant pairs that would dominate the loss. As discussed in [11], the gradient of the quadratic function is very sensitive to outliers, so a more robust function, the Gaussian kernel, has been utilized. However, with Gaussian kernel, a solution in one step is impossible.
Thus, instead, our formula is a summation of weighted square distances for each fully connected point pair, which has a closed form solution for registration. Assume we have two point clouds P and Q with p i ∈ P| i∈{1,···N} and q j ∈ Q| j∈{1,···M} . p i , q j ∈ R 3 . N and M are the number of points in P and Q, respectively. Then, the optimization task is where R, t are the rotation matrix and translation vector to transform P into the coordinate system of Q. The weight w i,j in range (0, 1] will be assigned for each term. The other problem of Gaussian kernel distances in the KC method is that σ in the Gaussian kernel has to be properly set according to the scale of different data sources. We use the square distance, as it is invariant to scale [35]. For the weighted function (1), there is a full connection with quadratic distance between every point p ∈ P and q ∈ Q. Then, Equation (1) is reformulated with full connection as correspondences. The new point sets (X , Y ) are of size N × M and each pair is a connection.
The optimal solution is obtained with any algorithm that computes the transformation. We choose the SVD [36], as also detailed by Sorkine [37]. To make the paper self-contained, we briefly discuss this in Appendix A. Following Sorkine [37], we obtain a closed form solution for above formula by using weighted SVD. However, to have the desired suppression effect of pairs, weights cannot be arbitrarily chosen. To determine the weights, we use f X (x) to denote a function that extracts a feature descriptor of the point x from the point cloud X . Then, the similarity is obtained as The lower is the similarity, the lower is the weight of the pairs. Thus, the effect of the term on the objective function will be less. In this way, a pair of points with low similarity contributes only a little, as they have a large feature descriptor distance. The constant β in Equation (3) scales the feature distance. It depends on the selected feature descriptor.
More details and testing about this CF registration is provided in the Appendices A and B.

Keypoint Sampling
To learn with a whole point cloud as input, subsampling is a standard operation for PointNet-like model. 3DFeatNet uses FPS to sample points that are evenly distributed on the scene. RSKDD-Net uses random sampling to speed up on a large-scale dataset.
However, both sampling methods may result in the selection of non-interesting points, e.g., points that are not distinctive and do not contribute to the registration success, which requires to devote an extra pattern of features to those ordinary points. In the matching step, only interesting points are involved. It means that they waste both training power and feature space for non-interesting points.
Thus, in this work, we propose to use keypoint detectors in the sampling and grouping layer. Since the descriptors are learned for a specific detector, during inference, with the same detector, our descriptor scores better compared to the version with non-interesting points included.
We use one handcrafted ISS keypointer and learned 3DFeatNet keypoints (3DF kpt) because ISS are widely used handcrafted keypoints and 3DF kpt specially distributes points on the wall in Figure 1.

Network Architecture
We demonstrate the pipeline of the training process in Figure 2. The DESC module in between is the f we want to extract.  The whole training process consists of four parts. In the first stage, with a point cloud PC1 as input, we apply a random transformation to generate PC2. For both PC1 and PC2, we sample k points from a specific keypoint detector as centers. Then, neighbors are grouped around each center to obtain clusters. Then, those clustered are fed into the descriptor network f . Each cluster is processed separately and outputs a descriptor vector for the cluster center. Next, in the registration layer, CF (Section 3.1) solves Equation (2) for the transformation of sampled points with sampled centers and their descriptors from the two point clouds using the distance between the descriptors as weights according to Equation (3). The Rotation Matrix Distance Module computes the error between the solved R and R gt considering the distance between the descriptors as weights, which is the loss function for our model.
Given the ground truth transformation R gt , t gt , the loss function is the deviation from the identity matrix [38] as follows When training the network, we only supervise the rotation because also involving the translation as a loss would further introduce additional hyperparameters to tune the balance between the effect from rotation and translation. Furthermore, the rotation is more important when performing the registration task.
With the above four parts of network components, the system merely requires to feed in one raw point cloud to learn for each iteration. Since the whole pipeline is differentiable, the parameters in the descriptor network are updated with gradient back-propagation. Given a random rotation, we minimize its distance to the solved rotation by optimizing f . We call our model a self-supervised learning model because we generate labels (R random ) from nothing and train the unlabeled data in a supervised way. The model is learned from a raw point cloud itself.

Datasets
The Oxford RobotCar dataset [29] was used for network training and testing. Additionally, the KITTI dataset [30] was also used for testing the model.

Oxford RobotCar Dataset
The Oxford dataset contains repeated traverses through the Oxford city center from May 2014 to December 2015 that were collected with the Oxford RobotCar platform. We used the pre-processed data from [26], which have 35 trajectories for training and another 5 trajectories for testing. The points were scanned from 2D LIDARs and are accumulated into 3D point clouds, using the GPS/INS poses. Those poses were refined with ICP. The training point clouds were then downsampled to about 50,000 ± 20,000 points and the test point clouds to exactly 16,384 points. In this way, 21,875 training and 828 testing point cloud sets were obtained.

KITTI Dataset
Additionally, we tested our model on the 11 training sequences from the KITTI dataset [30] and processed them in the above-mentioned manner. The parts of the KITTI dataset used in the experiments include Velodyne laser point clouds, ground truth poses, and calibration files. The point clouds were also downsampled with a grid size of 0.2 m. We obtained 2369 point clouds in the end.

Setting
Our implementation makes use of the open source release (https://github.com/ yewzijian/3DFeatNet) of 3DFeatNet [26]. In our pipeline, the descriptor directly uses the descriptor body of 3DFeatNet. Since this descriptor only considers a z-axis rotation, our provided R random is generated by rotating around z-axis with φ ∼ N (0, σ 2 r ). We used σ r = 0.6. In addition, we applied a 3D jitter with ∆p ∼ N (0, σ p I) (σ p = 0.01) for each point in PC1 and PC2.
During the training of our model, we set batch size 6, Adam optimizer, and 32dimensional descriptor. The training point clouds were randomly subsampled to 4096 points before feeding into the pipeline. We used the ISS and 3DFeatNet detectors (3DF kpt) to provide the keypoints as cluster centers to train. The setting of the 3DF kpt, e.g., β attention and r nms , followed Yew and Lee [26]. We chose 256 keypoints from the point cloud to align the batch. We also used FPS to sample points as comparison. The FPS samples 512 points, which is the same as 3DFeatNet. In the cluster, each point is of dimension d, which can be 3 (xyz), 6 (xyzrgb), etc. We used d = 3, thus we only used the xyz location of the point.
3DFeatNet states that it is hard to train. It takes 2 epochs to pretrain 3DFeatNet descriptor and the whole model can be trained in 70 epochs with lr = 10 −5 . In contrast, our network is easy to train: without any pre-training, our model is randomly initialized and saved at 10 or 20 epochs training with a learning rate lr = 10 −3 .
We used a PCL implementation of ISS to provide the ISS kpt and the released Tensorflow [39] checkpoint to achieve the network weight of 3DFeatNet to provide keypoints and run evaluation. We compared our method with handcrafted descriptors FPFH, SI, USC, and CGF and learned descriptors 3DMatch and 3DFeatNet.

Precision Test
Using exhaustive search as in [26], this test searched for the nearest descriptor neighbor in the paired models for each keypoint. Then, the Euclidean distance between the neighbor location and ground truth location as computed. We show the plot in Figure 3. The x-axis is a threshold to consider a pair as correct and the y-axis is the correct proportion. For both 3DfeatNet descriptor and our descriptor, the test with 3DF kpt works better than ISS kpt. Without using the keypoint sampling (with FPS instead), our proposed unsupervised model achieves a similar result to 3DFeatNet descriptor on 3DF kpt and a better result on ISS kpt. We used the x = 1 m line as a cut. Both 3DFeatNet descriptor and our descriptor achieve around 15% precision, which is close to the best score in the record of Yew and Lee [26].
While using the keypoint sampling, we learned our ISS descriptor and our 3DF descriptor. ISS kpt + our ISS descriptor scores similar to our descriptor that used FPS. Both of our descriptors are better than 3DFeatNet descriptor on ISS keypoints. However, using ISS keypoint sampling during training does not improve our learned descriptor in the precision test. As shown in Figure 1, the ISS keypoints are evenly distributed in Oxford point cloud, which may introduce similar points as FPS. On the x = 1 m line, with 3DF kpt, our 3DF descriptor learned the pattern and scores best. It is around 2% higher than the supervised descriptor.
Overall, our proposed unsupervised method scores better than the 3DFeatNet from which we borrow its descriptor part of model.

Geometric Registration
With ISS keypoints and 3DFeatNet keypoints, we evaluated the descriptors on the geometric registration. The registration uses nearest neighbor matches RANSAC to estimate the transformation. RANSAC iterations were limited to 10,000 and adjusted with 99% confidence. The Relative Rotation Error (RRE) and Related Translation Error (RTE), with respect to the ground truth, were computed to evaluate the accuracy of the registration. A success was decided when RTE < 2 m and RRE < 5 • . The speed of converging was reflected by the average number of iterations. Since we used the same datasets (Oxford and KITTI) as 3DFeatNet experiment [26], we compared to the results from their table.
The evaluation on the Oxford data is demonstrated in Table 1. The first eight rows are taken from [26] and the last six rows are from our own experiments. We observe that, firstly, except for PN++, the handcrafted descriptors cannot exceed the learned descriptors. Secondly, our unsupervised learned descriptor achieves the best result on RRE and the success rate with ISS and best result on RTE and average iteration with 3DF kpt. Thirdly, training merely on interested points, our keypoint sampling indeed improves the performance.
An example of a registration is shown in Figure 4. We observe that our 3DF descriptor has more inlier correspondences compared to 3DFeatNet descriptor by using 3DF keypoints, hich is revealed by the denser connection of the red lines.  Then, the model was transferred to another outdoor dataset, the KITTI dataset. The registration results are shown in Table 2. The first six rows of the results are taken from [26] and the last six rows are from our experiments. In the table, we observe that, firstly, our unsupervised model exceeds the supervised model. Secondly, ISS+our ISS descriptor achieves best accuracy. Its RRE even decreases about 0.041 compared to the FPS version (ISS + our descriptor). Thirdly, with 3DF kpt, our 3DF descriptor also achieves better results. A further example of a registration is shown in Figure 5. One can see that our ISS descriptor achieves much denser matching comparing to 3DFeatNet descriptor. Overall, without using keypoint sampling, our unsupervised model achieves similar or even better performance than the supervised 3DFeatNet that uses the same descriptor body. In addition, with only interest points to train, our keypoint sampling indeed helps the model to learn more representative descriptors.

Conclusions
In this paper, we propose a novel self-supervised learning model to learn local descriptors for registration. We realize this goal by using a registration layer in the end. Thus, we use for supervision the randomly generated rotation of single point cloud input. In addition, we use keyopint sampling to make our model focus on interest points, in order to learn more expressive descriptors. In our pipeline, borrowing the same descriptor body as 3DFeatNet, our model is much easier to train, because this self-supervised method does not require any manual effort on annotation, and, without any pre-training, it converges with a higher learning rate, requiring far fewer iterations. Moreover, the experimental evaluation shows that our descriptor achieves much better performance on precision and geometric registration than the supervised 3DFeatNet descriptor.
As future work, we want to embedded our model into a SLAM framework to enable a no-annotation used data-driven descriptor for arbitrary scenes.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. CF Registration Model
The formula is a summation of weight square distances for each fully connected point pair. Figure A1 illustrates the full connection, where weights are set according to a similarity measure. Figure A1. Full connection between two point sets. Each edge is a weighted Euclidean squared distance term in our object function, given a proper w i,j to scale the cost term of the pair (i, j). The thickness of the lines reflect the similarity (weight) of pairs. min R,t

. Solving the Transformation
For the weighted function (A1), there is a full connection with quadratic distance between every point p ∈ P and q ∈ Q. Equation (A1) is reformulated with full connection as correspondences. The new point sets (X , Y ) are of size N × M and each pair is a connection. Let X = {p 1 , · · · p N M }, Y = {q 1 , · · · q N M }; the problem is formulated as with known weights w i > 0. We cancel t by computing the weighted mean and centering the point clouds We can then compute R and the optimal rotation is computed by When the solution consists of a reflection, i.e., |V||U| < 0, the last column of V will be multiplied by −1 before computing the rotation. Finally, the translation is given as A schematic diagram to solve the registration is demonstrated in Figure A2. We first sampled the point clouds from the meshes using Meshlab [41]. For CPD, the open source C++ implementation from the original project [12] was used. We set its scale and reflection parameters to false. For DARE, we used the Python implementation of Järemo Lawin et al. [40]. Its color label and feature label were disabled. We also used TEASER++ from the implementation [31]. We implemented CF and CFK using the Point Cloud Library (PCL) [42], where we used its FPFH descriptor and the SIFT keypoint detector. The ICP experiments were also done with PCL. The normal and feature computation in CF and CFK were performed with the same settings, i.e., searching k neighbors. In our implementation, we fixed k to 150. In addition, the β used in Equation (3) was fixed to 100.
For TEASER++, we used the same settings as for the feature descriptor FPFH. In the matcher of TEASER++, the options absolute_scale and crosscheck were selected. The solver used GNC_TLS with a 1.4 gnc factor, 0.005 rotation cost threshold, and 1000 max iterations.
In our experiments, the registration was done using two point clouds PC a and PC b , which were generated with added noise or outliers from the original point cloud, as described in more detail below. We then translated and rotated PC b to get PC b . Thus, the PC a was our PC1 and PC b was our PC2 and our task was to align PC1 to PC2 by solving for the transformation.
In the following experiments, PC b was transformed in two distinct ways to generate PC b . Firstly, we applied just a small, random rotation around the point clouds centroid. For the second type of data, we applied a large random rotation around the origin of the dataset, which is not the centroid.
The rotation vector is a concise axis-angle representation, for which both the rotation axis and angle are represented in the same three-vector. The rotation angle is the length of this vector.

Appendix B.2. Sensitivity to Noise
In this experiment, we evaluated the effects of different levels of noise on the registration. Each level was tested with 30 generated point clouds. Just for this experiment, we fixed the large and small rotation angles to two certain values, to be able to concentrate on the effects of the levels of noise and draw the diagrams of Figure A5. PC1 and PC2 used 500 points subsampled from the origin point cloud. Then, we rotated PC2 and added zero mean Gaussian noise to each point.
Following the definition of sensitivity [11], we logged the mean average shift to evaluate the performance, and the standard deviation was utilized as the metric. The noise scale was within the range (0, 0.02]. Because the size of the bunny does not exceed 0.3, too large noise would result in dysfunctional feature descriptors. We present the noise data with different noise scales in Figure A4. The results are given in Figure A5. In the small angle case of Figure A5, the TEASER++ curve breaks due to a low number of correspondences and followed by failure. For the centered small rotation, ICP, CPD, and DARE achieve better average shifts and less sensitivity to noise. For the feature based methods, our CF and CFK perform very similar to TEASER++.
However, for the large rotation data, ICP, CPD, and DARE fail to align the point clouds, while the feature-based methods CF, CFK, and TEASER++ are able to align with good performance.

Appendix B.3. Robustness to Outliers
Similar to above, we also used 500 randomly selected points from the bunny object and performed small and large rotations. Additionally, 100 random points were uniformly drawn in a sphere and added to the rotated point set PC2 (with radius 0.2, around the center of sampled point clouds).
Because the first 500 points in each set are also from the same sampled index, we actually know the correspondence in the non-outlier parts. To quantify the robustness, we computed the average shift as in Appendix B.2.
For both large and small rotations, we tested 100 times to record the mean and standard deviation. The quantitive evaluation is given in Table A1. All experiments were made using randomly drawn rotation vectors.
CPD achieves extremely precise solutions for small rotations, while feature-based methods (TEASER++, CF, and CFK) are similar and better than ICP and DARE. DARE gives the largest error and standard deviation. For the large rotation case, the feature-based methods (TEASER++, CF, and CFK) perform best, and the errors of the remaining methods are several times worse and unstable, since they yield large standard deviations. The performance of our one-step methods is close to TEASER++, even though its truncated least squares is theoretically more insensitive to spurious data.