NrtNet: An Unsupervised Method for 3D Non-Rigid Point Cloud Registration Based on Transformer

Self-attention networks have revolutionized the field of natural language processing and have also made impressive progress in image analysis tasks. Corrnet3D proposes the idea of first obtaining the point cloud correspondence in point cloud registration. Inspired by these successes, we propose an unsupervised network for non-rigid point cloud registration, namely NrtNet, which is the first network using a transformer for unsupervised large deformation non-rigid point cloud registration. Specifically, NrtNet consists of a feature extraction module, a correspondence matrix generation module, and a reconstruction module. Feeding a pair of point clouds, our model first learns the point-by-point features and feeds them to the transformer-based correspondence matrix generation module, which utilizes the transformer to learn the correspondence probability between pairs of point sets, and then the correspondence probability matrix conducts normalization to obtain the correct point set corresponding matrix. We then permute the point clouds and learn the relative drift of the point pairs to reconstruct the point clouds for registration. Extensive experiments on synthetic and real datasets of non-rigid 3D shapes show that NrtNet outperforms state-of-the-art methods, including methods that use grids as input and methods that directly compute point drift.


Introduction
The 3D object has better flexibility, and with the continuous development of 3D sensing technology in recent years, the 3D point cloud has been widely used in various fields, such as virtual reality [1], autonomous driving [2], and augmented reality [3]. Since LIDAR scanned point clouds do not correspond with each other, this great inconveniences downstream tasks of point cloud classification [4,5], segmentation [6,7], registration [8,9], and reconstruction [10,11].
Non-rigid point cloud registration can be divided into similar registration [12,13] and affine registration [14,15]. Similar registration is mostly based on ICP to improve the registration of point clouds by changing the optimization objective function and increasing the correspondence, while affine registration ensures that the parallelism between the lines remains unchanged during the transformation process. Corrnet3D [16] proposes an alignment idea of finding the correspondence between the point clouds first, and then reconstructing the point clouds, which gives us inspiration. However, most of these methods require large-scale labeled data. Labeled data requires a lot of time and cost, which also promotes the development of unsupervised methods. In our work, we focus on unsupervised large deformation non-rigid point cloud registration, which means that only 3D point cloud data is required as input. Figure 1 illustrates our idea that if we can align two sets of point clouds A and B, the registration process between the point clouds becomes easy. We permutate the point clouds by a transformer because the transformer is better at handling natural language Based on the above idea, we propose an unsupervised transformer-based registration network (NrtNet) for large deformation non-rigid point clouds. We propose a transformerbased permutation process. Specifically, this permutation process uses the encoder and decoder of the transformer to generate a point set correspondence matrix, which represents the correspondence between the source point cloud A and target point cloud B. During the training process, the global features of the target point cloud and the permutation source point cloud A re order are fed to the reconstruction module to obtain the reconstructed point cloud. The reconstruction module drives the learning of the correspondence matrix and the relative drift by optimizing the reconstruction error and additional regularization terms to achieve registration.
In general, our main contributions are: • We propose a transformer-based point cloud correspondence learning framework for learning dense correspondences between point clouds, and we are the first to introduce a transformer into the field of non-rigid point cloud registration. • Our network eliminates the reliance on ground truth and achieves unsupervised learning of non-rigid point cloud registration in an end-to-end manner, and has a better registration effect for different objects. • Experiments demonstrate that NrtNet has significant advantages in non-rigid point cloud registration. In particular, it is superior to methods that directly compute the drift of coherent points between point clouds and methods that use a grid as input.

Related Work
In this section, we introduce the application of point clouds in deep learning, the study of non-rigid point cloud registration, and the development of transformer-based deep learning.

Deep Learning on Point Cloud
Compared with well-developed image-based deep learning methods, point cloudbased deep learning methods are more challenging and still in the developing stage due to the irregularity and disorder of point clouds. Three-dimensional data can be displayed in various forms, such as 2D multi-views, unstructured point clouds, voxelized volumes, etc. Voxelization methods convert 3D data into regular volume occupancy voxels, resulting in structured volumes that are well suited for 3D CNNs. Early point cloud tasks used end-to-end 3D convolutional networks [19][20][21]. Due to the sparse volume of 3D data and expensive 3D convolutions, voxelized representations are limited by resolution, and [22,23] effectively solve the voxelized resolution problem. Qi et al. [24] projected 3D data into multiple 2D views and used the popular 2D CNN to process it.
PointNet [25] learns features directly from the point cloud, maps the point cloud to higher dimensions before aggregation, and takes symmetry operations in higher dimensions. Mapping to higher dimensions generates redundant information, which can be captured by maximization operations to avoid geometric information loss. PointNet only uses MLP and max-pooling and does not have the ability to capture local structural defects, which PointNet++ [26] improves upon. DGCNN [27] designs an EdgeConv that can efficiently extract features of local shapes of point clouds while still maintaining alignment invariance. Later, researchers investigated the use of merged features to represent the overall features and pointwise features [26] or more sophisticated RNN-based methods to extract features [28,29]. MortonNet [30] extracts more effective features based on learning an ordered sequence of point clouds. FoldingNet [31] learns to deform predefined 2D regular meshes into 3D shapes, AtlasNet [32] and 3D-Coded [33] are also based on the deformation of their networks, and they use fixed template deformations to reconstruct the point cloud or mesh.

Non-Rigid Point Cloud Registration
The development of registration optimization algorithms has attracted the attention of many researchers, and these algorithms are used to refine geometric transformations during iterations. The Iterative Closest Point (ICP) algorithm [34] is a classic case in rigid registration. The ICP initializes the estimation of the rigidity function and then iteratively selects the corresponding point to revise the transformation. However, ICP is not able to handle non-rigid point cloud variations efficiently due to the influence of initial values. Non-rigid point cloud registration can be divided into parametric registration and nonparametric registration by target transformation. The TPS-RSM algorithm [35] in parametric registration estimates the parameters of the non-rigid transformation with the penalty of the second derivative.
For classical nonparametric methods, coherent point drift (CPD) [36] introduces a process of fitting a Gaussian mixed likelihood that aligns the source point set with the target point set. Ma et al. [37] proposed the importance of exploiting local and global structures in non-rigid point set registration. CPD-Net [38] uses deep neural networks to fit functions that can adapt to geometric transformations of varying complexity. DispVoxNets [39] converts point clouds to voxels for nonlinear deformation in a supervised manner. PR-Net [40] introduces point-set shape features that determine the correlation between the source and target point set to predict the transformation, allowing source and target point sets to be statistically aligned. CorrNet3D [27] uses a new efficient de-smoothing module to optimize the point set pairs with better results. Ma et al. [41] used a robust transformation estimation method based on streamwise regularization for non-rigid point set registration, and the spatial transformation between two point sets is estimated by iteratively recovering the point correspondence. However, all extant methods do not perform well for point cloud registration with large deformations, and most of them rely on ground truth. These methods also work poorly for data with non-corresponding point sets. Our method eliminates the reliance on ground truth and has better registration results for large deformations and non-corresponding data sets.

Deep Learning Based on Transformer
CNN is a standard network model in computer vision [42], with the introduction of AlexNet [43], CNN began to become the dominant network model. Transformer and Self-Attention models revolutionized natural language processing [44,45], and some studies used Self-attention and Transformer to replace some or all of the spatial convolutional layers in the popular ResNet [46]. The encoder-decoder design in Transformer has recently been applied to object detection and instance segmentation tasks [47], and ViT [48] directly applies transformer to non-overlapping medium-sized image blocks for image classification.
AiR [49] is the first transformer-based image registration method. Point Transformer [50] is the first to introduce a transformer into the 3D point cloud domain, proposes a highly expressive point transform layer, and uses transformer to construct a high-performance point transform network for point cloud classification and dense prediction. Point Cloud Transformer [51] proposes a new transformer-based point cloud learning framework PCT, and uses implicit Laplace operators and normalized refinement to offset attention. Our method uses the transformer to derive correspondences between points to improve the effectiveness of the final registration.

Overview
As shown in Figure 2, NrtNet is composed of three modules: the feature extraction module, the transformer module, and the point cloud reconstruction module. Firstly, in order to get the point cloud features, the source point cloud A ∈ R n×3 and the target point cloud B ∈ R n×3 are fed into the point cloud feature extraction module to generate point cloud features F a ∈ R n×d and F b ∈ R n×d , where d is the feature dimension, the pointwise feature of the point cloud is obtained by setting d to the same dimension as the number of point clouds. After that, the pointwise features F a and F b are fed into the transformer module, which finds the correspondence between the source and target point clouds. The point set correspondence matrix P ∈ R n×n is obtained to represent the point set correspondence, the parameter p ij = 1 of P represents the i-th point a i of the source point cloud and the j-th point b j of the target point cloud. The transformer module is composed of a transformer encoder and a transformer decoder. The source point cloud A is permuted using P to obtain A re order ∈ R n×3 . Finally, the global features of the target point cloud V b ∈ R d and the permuted source point cloud A re order are fed into the reconstruction module to obtain A last , which is similar to the target point cloud B. The global features V b are aggregated from the pointwise features. As in most papers, we optimize our model parameters by minimizing the similarity between the reconstructed point cloud and the target point cloud. To better learn the correspondence between point sets, we regularize the point cloud correspondence matrix and then minimize it to obtain the optimal point set correspondence. We can express it as follows: where A last ∈ R n×3 and B last ∈ R n×3 are the point clouds after the registration, and . F represents the Frobenius parametric matrix. G(P) is a regularization operation on the corresponding matrix.

Figure 2.
NrtNet is an unsupervised, end-to-end network for non-rigid point cloud registration. The source point cloud A ∈ R n×3 and the target point cloud B ∈ R n×3 are fed into the feature extraction module and the transformer module to generate the point set correspondence matrix P ∈ R n×n . Then, the permuted point cloud is fed into the reconstruction module to generate the exact same point cloud A last as B, which achieves the purpose of registration.

Feature Extraction Module
For the feature extraction module, instead of using the traditional PointNet and PointNet++, we use a DGCNN with shared parameters to map points A and B to highdimensional features. DGCNN uses edge convolution, EdgeConv to dynamically build graph structures on each layer of the network, using each point as a centroid to characterize its edge with each neighboring point feature, and then aggregates these features to obtain a new representation of that point. Firstly, DGCNN defines the edge feature representation as: where h is that the edge convolution operation considers both the global information x i , and the local neighborhood information x i − x j , and x i ∈ R 1×d is the feature extracted by the i-th point fed into the edge convolution. Then, aggregating the edge features to obtain the feature e l+1 ij over the l-th layer is expressed as: where χ indicates that the aggregation operation consists of the MLP and max-pooling, Ω denotes the set of point-set pairs formed between the remaining points centered at point x i and the center point. After the multi-layer edge convolution, MLP and max-pooling operations, we can extract the pointwise features F a ∈ R n×d and F b ∈ R n×d . The pointwise features are fed into a max-avg-pooling layer to get the global feature V a ∈ R d and V b ∈ R d , which prepares for the later reconstruction.

Transformer Module
Because of the effectiveness of the transformer for word correspondence in NLP, we use the transformer to correspond to the point set. As shown in Figure 3, the transformer module consists of three parts: the transformer encoder, the transformer decoder, and a smooth module. The transformer module inputs the features of the source and target point clouds to learn the point set correspondence matrix P ∈ R n×n , which can explicitly represent the correspondence between any two points in A and B. The p ij = 1 in the matrix represents the i-th point a i of the source point cloud A and the j-th point b j of the target point cloud B are corresponding. This matrix is an inverted matrix. There are only two cases of correspondence between source and target point clouds, so this matrix should have only 0 or 1. The points in the source point cloud should correspond to the points in the target point cloud one by one, and each row and column of the matrix should have only one 1. The transformer module uses the transformer to find the similarity between point clouds, i.e., the probability matrix of the point clouds P rand , which represents the probability of correspondence between the point clouds. Finally, a smoothing process is applied to this probability matrix to obtain an exact inverted binary matrix P.
The transformer encoder that references Point Transformer [50] is shown in Figure 4. The feature f l i ∈ R l×d of the i-th point is fed into the standard scalar dot product attention layer. The standard scalar dot product attention layer is expressed as: where ϕ, ψ, α is the feature transform layer MLP, δ is a position encoder. γ is a mapping function. γ as a vector to represent the global features of the point cloud. The mapping function γ consists of an MLP, two linear layers, and a Relu activation function. Attention vectors are generated for later feature aggregation, feature transformation of ϕ minus ψ to obtain the vector relationship between them. Finally, the transformation features y i are obtained by a softmax regularization function. Figure 3. Transformer module. The probability matrix P rand can be obtained by feeding the highdimensional features of the point cloud F a ∈ R n×d and F b ∈ R n×d into the transformer encoder and transformer decoder, respectively. Then, the probability matrix P rand is fed into the smooth module to obtain the inverted exact correspondence matrix P.
Due to the disordered nature of point clouds and their irregular embedding in the entire vector space, self-supervision is performed using the position of the point cloud itself. The positional encoding δ is added to the transformed feature α. In this way, the transformation feature is expressed as: where F a (i) ∈ F a is the feature of k neighboring points around the sought point f a i ∈ R l×d . Self-attention is applied to each data point in the local domain. In 3D point cloud alignment, the 3D point cloud itself comes with position information, and the trainable parametric position encoder can be expressed as: where p i represent the i-th point, p j represents the j-th point around the i-th point, and θ has the same structure as γ. This position encoder has good effect enhancement for both attention generation and feature transformation. As shown in Figure 5, the transformed features are fed into the transformer decoder and smoothing module to generate the point set correspondence matrix P. First, the global features need to be obtained by point-by-point features. The global feature can be expressed as: each transformed feature f a trans i is summed to obtain a one-dimensional global feature f aa , so that we can find the global features F aa ∈ R n×1 and F bb ∈ R n×1 of the source and target point clouds. To obtain the probability matrix, we first obtain the distances of F aa and F bb . The distance formula can be expressed as: where I is a 1 × n unit column vector. Equation (8) returns an n × n point set corresponding distance matrix. The larger the distance, the smaller the probability they correspond to. The probability matrix P rand corresponding to its point set is obtained by inverting the distance matrix. The probability matrix can be expressed as: since there are only two cases for the correspondence of point sets, the probability matrix cannot effectively represent the correspondence between point sets. It can only represent the corresponding probability between point sets. We need to smooth this matrix, and we refer to Corrnet3D. Each row of the probability matrix should follow a normal distribution with mean µ i and variance σ i , i.e., p rand ij ∼ N µ i , σ 2 i . In order to better filter the incorrect point set correspondence, we normalize this normal distribution z ij = (p rand ij − µ i )/σ i , z ij obeys the standard normal distribution z ij ∼ N(0, 1). Finally, we select the corresponding point set according to the threshold τ. The number corresponding to the correct point set is z num . For the points close to the middle, it should find a larger number of corresponding points, and z num should also obey the normal distribution. It obeys the three-sigma rule. The probability of the value in [µ Z num − 3σ Z num , µ Z num + 3σ Z num ] is 0.9973, which is almost 1. The softmax operation on z num can be calculated to obtain the correct point set corresponding matrix P.

The Reconstruction Module
When the correct correspondence labels between point cloud A and B are given, the shape feature relationship between them can be learned well, and thus it is easy to learn the amount of drift between point sets. FoldingNet [31] and AtlasNet [32] reconstruct the global features by stitching point on top of the 2D grid. CPD-Net [38] learns point-to-point drift by concatenating point and global features. As shown in Figure 6, the reconstruction module based on point correspondence is proposed.
Point clouds A and B are permuted by the point set correspondence matrix P. A and B after permuting can be expressed as: A re order = P T A , B re order = PB The permuted source point cloud A re order correspond to the point of B one by one, so that the large deformation registration can be learned, which CPD-Net cannot learn. The relative drifts between A re order and B are learned by using the global features. As shown in Figure 6, A re order and the global feature V b ∈ R d are concatenated, and then the drift of each point is learned through three MLPs. The reconstructed point cloud A last is the source point cloud A plus the drift. The module is able to efficiently learn the drift between points for the purpose of registration. The reconstruction module learns a displacement field function to estimate the geometric transformations and is able to predict the geometric transformations of the alignment between positional objects.

Unsupervised Loss Function
The source point cloud A last should be similar to the target point cloud B after registration. The Euclidean distance loss between B last and A is added to the standard loss, which can better learn the relationship between A and B. According to the similarity of the source and target point clouds after deformation, the distance loss is expressed as: Since the points in A and B should be in one-to-one correspondence, their correspondence matrix should be an inverted matrix. The transpose of the inverted matrix and its own dot product should be infinitely close to the unit matrix. Based on this property, the matrix optimization loss formula is expressed as: where P is the correspondence matrix. I n is an n × n unit matrix. There are similar local features between the target point cloud B and the source point cloud after the permutation A re order , and similarly the source point cloud A and the target point cloud after the permutation B re order also have similar local features. Based on this property, the proximity similarity loss is expressed as: where Ω a i represents the set of k indexes around the i-th point in A, b re i ⊂ B re order and a re i ⊂ A re order are the points after rearrangement.
Finally, we aggregate these losses and the final loss is expressed as: where λ and η > 0 are superparameters to regulate the balance between several losses.

Experiment
In this section, the experimental results of NrtNet's non-rigid point cloud registration are presented. Details of the dataset and laboratory parameters used for training and testing are described in Section 4.1, and a brief introduction to the experimental evaluation method. In Section 4.2, a comparison of rigid point cloud registration results from different networks is discussed. In Section 4.3, the experimental results of non-rigid body methods in rigid registration are discussed. In Section 4.4, the registration results of NrtNet on small deformation datasets are presented. In Section 4.5, the effects of different losses on the experiments are compared. In Section 4.6, we show the registration effect of NrtNet on real scan data.

Experimental Setup
Dataset. We use the 200k sampled dataset from Surreal [33] as the unsupervised training datasets, and divide these 200k datasets into 100 random pairs for registration training. We used the 300 pairs dataset from Shrec [52] as the test dataset. We downsampled Shrec's dataset to 1024 grids and took the grid vertices as input to keep the variables constant. In order to compare the robustness of different datasets, we used the dataset of Bednarik, J et al. [53]including small deformation datasets of paper, tshirt, sweater, and cloth to learn for different data to ensure the reliability of NrtNet.
Evaluation. We reviewed a large amount of information on whether CPD-Net [38], DispVoxNets [39], or other articles such as PR-Net [40] have most of the evaluations as direct comparison of CD loss or subjective comparisons of the experimental result plots after registration. Almost none of them had registration again by finding correspondence for point pairs like we do, so we refer to Corrnet3D's [16] evaluation method to evaluate the goodness of the model based on whether the point set corresponds to each other or not. The point correspondence rate is expressed as: where • is the Hadamard product and · 1 is the parametric matrix. P gt is the ground truth of the point set corresponding to the matrix. We set the percentage of correct correspondence under different tolerances to compare the pros and cons of the method. The point correspondence rate under different fault tolerance is expressed as: where r is the error tolerance radius. Experimental parameters and configuration. We set the superparameter λ = 0.1 and η = 0.01. Our method was implemented in pytorch and our evaluation system was trained and tested on an NVIDIA GTX 1080 GPU. The learning-rate was 1 × 10 −4 , batchsize was two, and we trained 50 epochs on the large Surreal dataset [33] and 500 epochs on the small deformed dataset [53].

Experimental Evaluation of Non-Rigid Point Cloud Registration
NrtNet was compared with unsupervised FlowNet3D [54], unsupervised Corrnet3D [16], and unsupervised CPD-Net [38]. Figure 7 and Table 1 show a quantitative comparison of different methods, it can be seen that our method consistently outperforms other unsupervised methods. In particular, we have more significant performance advantages when comparing FlowNet3D and CPD-Net, and we also have some performance improvements when comparing Corrnet3D. The point set correspondence rate of CPD-Net is low, and the registration effect is poor for large deformation datasets. The point set correspondence rate of CPD-Net is low, and the registration effect is poor for large deformation datasets. Although FlowNet3D has a high correspondence rate, its registration effect is very dependent on the dataset, and some test datasets have a good registration effect, while some test datasets have a poor registration effect. Only NrtNet and the recently published Corrnet3D have better registration results. Because NrtNet uses a transformer that is better than Corrnet3D in point correspondence, it can still achieve better registration results for some datasets with larger deformations.   Figure 8 shows the qualitative comparison results. NrtNet suffers less from unsupervised large-deformation non-rigid point cloud registration and can generate a point cloud with accurate correspondence. In contrast, CPD-Net and FlowNet3D are affected by large deformation, which makes their correspondence deviate and cannot achieve effective registration when the target point cloud varies greatly from the source point cloud. NrtNet learns the point set correspondence between the target and source point cloud, and thus can effectively make the registration effect better. Our network can further enhance the robustness to the degree of deformation by learning the specific type.

Experimental Evaluation of Rigid Point Cloud Registration
We use the non-rigid registration method to register the rigid point cloud, and compare the effect of our method with FlowNet3D, Corrnet3D, and CPD-Net on the rigid point cloud registration. Figure 9 and Table 2 show our method and other methods compared with different fault tolerances. It can be seen from the table that our method has the best results under the same fault tolerance, Corrnet3D has a great improvement for FlowNet3D, and our method also has improvement for Corrnet3D. Compared with non-rigid point cloud registration, the unsupervised registration effect of CPD-Net in rigid point cloud registration has little improvement, while our method NrtNet has better registration effect and non-rigid point cloud registration in rigid point cloud registration.

Comparison between Different Datasets
Nrtnet was tested on the dataset of Bednarik, J et al. [53] for learning and registration to test the stability on different datasets. The dataset was divided into a training set and a test set in a ratio of 8:2. As shown in Figure 10, NrtNet has better registration for small deformation datasets, not only for learning the deformation part efficiently, but also for rigid transformations of deformed point clouds. NrtNet not only has a good registration effect on large deformation datasets, but also has good registration effects on small deformation datasets compared with existing methods. This makes the registration more efficient to first obtain the point set correspondence through the transformer. Figure 10. Registration performance of Nrtnet in small deformation dataset paper, cloth, sweater, and t-shirt.

Comparison of Different Losses
In the experiment, we compared the difference between the Euclidean distance L dis and the CD Loss, and we also showed the improvement of the Euclidean distance and the cd distance by adding optimization losses L mat + L pro . Figure 11 and Table 3 show the point-set correspondence rate at different losses, and it can be seen that the loss of NrtNet achieves the best results with the same fault tolerance. It can be seen that the improvement of L mat + L pro to L dis is very obvious by comparing L dis and L dis + L mat + L pro . When CDloss and L dis are compared separately, CDloss has a certain improvement. When CDloss and CDloss + L mat + L pro are compared separately, L mat + L pro have little effect on CDloss, and the increase in correspondence rate is minimal. Experiments show that the loss of NrtNet can achieve the best experimental results.

Real Scan Data
This section shows the effect of NrtNet registration on real data. The experiments used Shrec's human real scan dataset [52], and since the experiments were conducted without ground truth, it is hard to qualitatively evaluate the effects of the experiments. Figure 12 shows the final registration results of the experiments for a rational analysis of the results. NrtNet is able to effectively align the point cloud actions and shapes, and NrtNet is able to align any data without ground truth. As shown in Figure 12, The same color represents the correspondence of point sets, and NrtNet has better results for the correspondence of point set pairs. Although there are registration errors in some details, the experimental results are already much better than traditional non-rigid registration networks. The results are able to have better registration results for each movement.

Conclusions
We propose NrtNet, an unsupervised transformer-based registration architecture, which can learn the correspondence between pairs of large deformed point sets to effectively improve registration performance. NrtNet is much better than FlowNet3D in large deformation point cloud registration, and also significantly outperforms the state-of-the-art Corrnet3D. This shows that NrtNet can be used for most large deformation registration applications. We also show registration results on real scan data in the absence of ground truth, and still have good registration results. NrtNet has taken a long term step in large deformation non-rigid point cloud registration and eliminates the reliance on ground truth to conduct non-rigid point cloud registration.
In future work, NrtNet can be extended to voxels for non-rigid point cloud registration. Our correspondence may be inappropriate for the correspondence between points that are far apart. For this, we will sort the point cloud in future experiments and then use the transformer to do the point set correspondence, which corresponds to the word in NLP. Similarly, we believe that the registration effect can be improved to a certain extent after doing so. We believe that NrtNet can bring some help to other large scene point cloud registration, as well as human motion analysis and animal and plant growth analysis. Meanwhile, the model size of NrtNet can be further optimized to reduce training time.