VPRNet: Virtual Points Registration Network for Partial-to-Partial Point Cloud Registration

: With the development of high-precision and high-frame-rate scanning technology, we can quickly obtain scan data of various large-scale scenes. As a manifestation of information fusion, point cloud registration is of great signiﬁcance in various ﬁelds, such as medical imaging, autonomous driving, and 3D reconstruction. The Iterative Closest Point (ICP) algorithm, as the most classic algorithm, leverages the closest point to search corresponding points, which is the pioneer of correspondences-based approaches. Recently, some deep learning-based algorithms witnessed extracting deep features to compress point cloud information, then calculate corresponding points, and ﬁnally output the optimal rigid transformation like Deep Closest Point (DCP) and DeepVCP. However, the partiality of point clouds hinders the acquisition of enough corresponding points when dealing with the partial-to-partial registration problem. To this end, we propose Virtual Points Registration Network (VPRNet) for this intractable problem. We ﬁrst design a self-supervised virtual point generation network (VPGnet), which utilizes the attention mechanism of Transformer and Self-Attention to fuse the geometric information of two partial point clouds, combined with the Generative Adversarial Network (GAN) structure to produce missing points. Subsequently, the following registration network structure is spliced to the end of VPGnet, thus estimating rich corresponding points. Unlike the existing methods, our network tries to eliminate the side effects of incompleteness on registration. Thus, our method expresses resilience to the initial rotation and sparsity. Various experiments indicate that our proposed algorithm shows advanced performance compared to recent deep learning-based and classical methods.


Introduction
Point cloud registration is a fundamental task that has been widely used in many computational fields, such as object pose estimation [1], SLAM [2], and 3D reconstruction [3]. In its most common incarnation, point correspondence estimation and rigid transformation computation, including rotation and translation, trivializes the problem, which is possibly misled by noise and partiality.
Iterative Closest Point (ICP) [4], as the most representative method, is the gold standard for solving registration problems. It iteratively obtains the point correspondences by nearest neighbor search and estimates the rigid transformation by Singular Value Decomposition (SVD). The ICP algorithm does not require any prior information about the original point clouds. However, the convergence to global minimum puts forward strict requirements for the initial poses because the accuracy and locality of convergence depend heavily on the proportion of the overlapping area [5,6]. Besides, the cover of noise and outliers also prevent the estimation of rigid transformation. Therefore, many works are proposed to overcome the blemish of ICP [7][8][9][10]. The Point-to-Plane ICP algorithm [9] modifies the • A self-supervised virtual point generation network (VPGnet) based on GAN is proposed. The VPGnet focuses on the shape information of point clouds and can effectively complete the partial point cloud. • A combination strategy of virtual point generation and corresponding point estimation is proposed, which can reduce the negative effect of partiality during registration. • Various experiments demonstrate the advanced performance compared to other advanced approaches.
The rest of this paper is organized as follows: Section 2 reviews previous literature. Section 3 describes the architecture of our proposed network. Experiments are performed in Section 4. The discussion of the experimental results is shown in Section 5. Finally, Section 6 makes a precise summary of our work.

Related Work
Point cloud registration aims to find a rigid transformation matrix, including rotation matrix and translation vector, then apply this transformation to align the source point cloud to the target point cloud. In the past few decades, many pieces of literature proposed solutions to this fundamental task. Taking time as the border, we divide the method of solving point cloud registration into traditional and deep learning-based methods. Before 2017, most scholars focused more on conventional methods because of the sparsity and disorder of point cloud. After 2017, benefiting from the landmark PointNet [13] and PointNet++ [14], a large number of researchers tend to the deep learning-based methods [24]. The following text summarizes methods for point cloud registration from the above two aspects.

Traditional Methods
The most seminal method for solving registration problems is ICP [4]. This algorithm switches from finding correspondences and updating a rigid transformation matrix in a coarse-to-fine manner. Specifically, After obtaining the corresponding point sets, the ICP algorithm employs the least-square method to solve the transformation parameters. ICP can obtain accurate registration results as a fine registration algorithm, but some shortcomings deserve attention. The ICP algorithm needs a good initial value as input, or it is easy to converge to the local optimum [25]. Consequently, the registration accuracy of the ICP algorithm depends heavily on the overlap rate of point clouds [5,26]. Besides, the ICP algorithm requires many iterations to find the optimal corresponding point pair, which is time-consuming [27,28].
The above two drawbacks prohibit the application of the ICP algorithm in real-time and large-scale scenarios. Thus, some scholars proposed solutions. On the one hand, benefiting from the fact that the coarse registration has no hypothesis on the initial poses of point clouds, employing the result of coarse registration as the initial value of the ICP algorithm has become the consensus of the registration task [29]. A popular program utilizes the RANSAC method to find the corresponding triples [30]. The complexity of the RANSAC algorithm regularly degrades to its worst-case O(n 3 ) complexity in the number n of data samples [29,31]. As improvements to RANSAC, the 4 Points Congruent Sets(4PCS) algorithm [32] and Super 4PCS algorithm [31] intelligently ameliorate the registration process with four selected point pairs instead of three, making the computational complexity reach O(n 2 ) and O(n), respectively. Moreover, Super Edge 4PCS utilizes the edge of point clouds to finish the registration, thus greatly reducing the running time [33]. On the other hand, some ICP-variant algorithms were proposed, including distances defined as the point to plane [9,34], point to triples [35], and plane to plane [36]. In addition to changing the objective function, improving the search strategy is also a meaningful improvement. Eggert [37] and Vlaminck [38] employed two search strategies, kd-tree and Octree, to speed up the corresponding acquisition. These classical methods are still either easy to fall into local optimal values or time-consuming, which limits the application in large-scale scenarios that require real-time registration [18].

Learning-Based Methods
Learning-based methods have been gradually been accepted since 3Dmatch [39] was proposed in 2017. After PointNet [13] and PointNet++ [14], scholars can directly employ convolutional neural network to deal with disordered points directly. Therefore, the deep learning methods achieve considerable development. Correspondence-based methods and correspondence-free methods construct two main branches of learning-based methods [24].

Correspondence-Free Methods
The critical step of correspondence-free methods is regressing the global high-dimensional features generated by the deep neural network and outputting the rigid transformation parameters. The PoinetnetLK [16] modifies the traditional Lucas and Kanade (LK) algorithm and unrolls with PointNet into a trainable deep network framework. However, this method affords many derivation theories instead of simply concatenating the global features to solve R and t, which inevitably causes low computational efficiency [18,19]. As an intelligent improvement to PoitnetLK, PCRnet [18] replaces the approximation of Jacobian as a data-driven technique that is a deep feature alignment layer to output transformation parameters directly. Although PCRnet has improved efficiency and robustness compared to PointnetLK, the latter shows better generalization capabilities across various object categories [18]. Feature-Metric Registration (FMR) believes that the extracted features of point clouds with different poses are different. The transformation is iteratively solved by calculating the differences of global features [40]. Although the above correspondence-free methods straightforwardly follow an end-to-end network architecture, the performances depend heavily on the feature extraction block [24]. Likewise, we follow the analogous end-to-end network architecture but mix the extracted embeddings and other geometric information of opposite point clouds together.

Correspondence-Based Methods
Compared with the straightforward structure of the correspondence-free method, the correspondence-based method often possesses a more complex network architecture. Although employing voxels to represent point clouds and network training is not as popular as the PointNet-based methods due to the vast memory requirement and lost quality [41], some relative voxel-representation methods are still worth discussing [24]. 3DMatch [39] maps the local area that wraps the interest points to a 512-dimensional feature vector as a pioneering approach. Besides, the Perfect Match [42] employs Smoothed Density Value (SDV) voxelization to extract features computed with the Gaussian smoothing kernel. Recently, Huang et al. [43] designed an overlapping attention module in the feature coding stage for early information exchange, which improves the accuracy of registration and is suitable for low overlap scenes.
Inspired by the PointNet framework, PPFnet [44] defines point pair features, including point pairs' coordinates and normals to describe the local regions oriented 3D points. The feature processing leads to a rotation invariance while depending on the estimation of normal excessively [24]. Another representative method of employing the PointNet++ is DeepVCP [21], which utilizes the mini-PointNet++ [44,45] composed of three consecutively stacked fully connected layers and max-pooling layers to extract features and avoid the interference of dynamic targets. The generated corresponding point boosts registration accuracy [21]. In addition to feature extraction, outliers rejection also leverages PointNet's advantage. 3DRegnet [46] affords classification block and registration block, which extend the deep ResNet [47] to extract meaningful features and eliminate incorrect correspondences. However, none of the above methods pays attention to the corresponding points in the non-overlapping area, which influences the accuracy of corresponding points [12].

Under Partial Overlap
Among point cloud registration tasks, partially overlapping assignments pose a considerable challenge to deep learning methods due to the drastic differences in the global information [24]. Consequently, some algorithms mainly focus on partial-to-partial registration. As a successful case of applying the attention mechanism to registration, Deep Closest Point (DCP) [22] employs the Transformer [48] to absorb information from two point clouds and generates corresponding point pairs via soft pointers. However, the mapping M produces blurred correspondences in exchange for this differentiability. PRnet [23] extends the DCP algorithm to an iterative pipeline and utilizes Gumbel-Softmax sampling to define a sharp mapping function that accepts backpropagation. A corresponding point generation method similar to DCP appears in [49]. RPMnet proposes a subnetwork to predict annealing parameters and utilizes these two parameters and sinkhorn normalization to generate a match matrix [41]. The above methods contain no targeted measures to deal with partial point clouds. Paying attention to the negative effect of non-overlapping points, OMNet [12] learned the overlapping mask and achieved state-of-the-art performance. Song et al. propose a novel partial point cloud registration network that employs the graph attention module to predict key points [50]. Similarly, Eduardo et al. apply a RANSAC procedure after correspondence matching [51]. These generation processes of corresponding point pairs include only the information fusion of features mapped from original partial point clouds. Based on a brand-new idea, we propose an end-to-end network framework called VPRNet, which includes virtual point generation (VPGnet) and registration (Regnet). VPR-Net utilizes the GAN architecture to continuously generate missing points and applies an attention mechanism to weighted correspondences, ensuring the corresponding quality.

Overview
Our VPRNet is divided into two parts: VPGnet and Regnet. VPGnet is designed to generate virtual points, and the Regnet registers the completed point clouds. Figures 1 and 2 show the framework of VPGnet and registration network, respectively. Structurally speaking, this algorithm contains the framework of GAN using a self-supervised training strategy since the ground-truth missing part is separated from the original complete point cloud, and no auxiliary labeled data is added to the training process. The generator and discriminator confront each other until the discriminator cannot judge whether the virtual point generated by the generator is a ground-truth point or a fake one. The generator extracts the features of the three groups of sampled point clouds with PointNet and DGCNN. Then, the core Transformer and Self-Attention (SA) combine two hybrid features from original point clouds preferentially. Finally, missing parts are generated by MLP and Reshape operations. As for the Regnet, it first combines the virtual points generated by VPGnet with the original point clouds to complete point clouds X and Y. We then convert X into X , Y into Y according to the rotation matrix R i−1 , and translation vector t i−1 from the previous iteration. After extracting the hybrid features from X and Y , the probability volume is calculated according to the feature processed by softmax. Then, the corresponding matrix Σ is obtained as the weighted sum of probability and point coordinates. Finally, the SVD module is applied to generate the new rotation matrix R i and translation vector t i . Figure 1. Architecture of our VPGnet. The self-supervised network is mainly composed of two parts, the generator and the discriminator. The generator sub-network extracts features through Self-Attention and Transformer, then MLP and Reshape operations are used to generate virtual points. Next, the features of the generated points and ground-truth are extracted through the DGCNN of the Discriminator and compared with each other. Finally the probability that the input point cloud is the ground-truth is output.

Multi-Resolution Feature Extraction
The first step is to represent the point cloud as embedded features. Deng et al. [52][53][54] perform convolution operations on the entire point cloud and then duplicate the global features n times, where n is the number of points. Finally, the mixed point feature is formed by concatenating the global features and local features. There is no transition between point coordinates and global information despite the simplicity. Correspondingly, Qi et al. [13,55] pointed out that the local and global features extracted from different scales can describe the point cloud more efficiently. Consequently, we employ the multi-resolution feature extraction architecture proposed in PointNet++. As shown in Figure 1, we first perform the Farthest Point Sampling (FPS) on the source point cloud and target point cloud. Enlightened by the LRANet [56] and PF-Net [57], we performed FPS three times on the original point clouds. Then, the shared DGCNN encodes the points and their neighbor into latent vector F i l where i ∈ [1,3] is the scale number, and l ∈ [1,5] is the index of convolution. DGCNN integrates the local neighbor information of the point cloud, which is not available in PointNet [22]. After four convolution layers, the dimensions of the feature vector are [64, 64, 128, 256]. Before the fifth convolution layer, the four feature vectors are concatenated together to obtain a 512-dimensional latent vector. Subsequently, we pass this latent vector into the fifth convolution layer to get a 1024-dimensional feature vector F 5 . Putting all F i l together, we get a 3 × 1024 latent map. In addition to the local embeddings, we expect that the embeddings can focus on the entire information of point clouds, not limited to the neighborhood of a certain point. Therefore, we choose the PointNet [13] architecture to obtain the global information of input point clouds. The points are encoded into multiple dimensions [64, 128, 256, 512, 1024]. After the Max-pooling operation, we can obtain 1024-dimensional global features F g . The combination of F g and F i l can juggle the details and overall information. The feature encoding process can be summarized as: where D is DGCNN, x ∈ R n×3 is the original point cloud, and n is the point number in x; FPS i is the i-th farthest point sampling with a different sampling size; F g and F i l ∈ R d represent the global and local features, here d is 1024; ⊕m means to repeat the DGCNN and FPS operations m times, and stitch the obtained vectors together.

Attention
Both input point clouds suffered from a deficiency of geometric attributes. Thus, we design to employ the shape information of one point cloud to complete the other. Thus, the particular embeddings from two point clouds need to be merged instead of separately decoding the two independent latent maps. Inspired by a recent article [58], we attached two attention mechanisms to change the encoder's attention: Transformer and Self-Attention (SA) modules.
The Transformer is the first composition model that relies on SA to calculate input and output representation [48]. It is first used in natural language processing (NLP) to solve the sequence-to-sequence problem, such as the machine translation task. The Transformer consists of an encoder module and decoder module. Each encoder module and decoder module are stacked with separate sub-encoders and sub-decoders. The encoders are all the same in structure, including two parts: SA and feed-forward neural network. The SA module can help the current element combine the context semantics. Compared with the encoder, the decoder contains a masked self-attention to cover up later elements, which helps the decoder focus on the relevant part of the input sequence. Reviewing the complete encoding and decoding process: input the embedding E 1 obtained after position encoding of the sequence S 1 into the encoder, then output a new embedding E 1 after SA. There is a residual connection in the sublayers of each encoder, so the output of the encoder is In the decoding process, the new E 1 is first decoded to obtain the sequence S 2 , which is then encoded by the decoder and merged with E 1 to output a new sequence S 3 .
We draw inspiration from applying Transformer to solve sequence2sequence problems: Transformer combines two sequences so that the encoder and decoder module can learn co-context information. Consequently, we utilize the Transformer as the first attention method to supplement semantic information of one point cloud to the other. The calculation of the Transformer can be summarized as the following equations: Assuming that the latent maps obtained from the input point clouds are M x and r is the number of latent vectors obtained by DGCNN and PointNet (here is 4), and d is the dimension of latent vectors. Θ x and Θ y are the high-dimensional result feature output by Transformer Ω ∈ R r×d . It is worth noting that Ω is not a symmetric function: Ω(x, y) = Ω(y, x). The decoder realizes meaningful fusion of the contained information from two sequences.
However, this scheme has a premise that we need to know the missing parts of the current point cloud before completing the current point cloud. Therefore, we leverage a separate Self-Attention as a sibling attention mechanism with Transformer, aiming at making the point cloud aware of its distinctive shape. The structure of SA can be described by the equation shown below: where φ represents the softmax function, C q and C k are the convolutions needed to generate query and key vectors. These two vectors are employed to score the high-dimensional feature vectors generated from the coordinates of other points in the point cloud. The scores determine the amount of feature expressions. These scores are multiplied by the value vector generated by C v to distract attention from the points with less correlation. The entire SA process also follows the structure of residual connections. The latent map obtained by the Transformer is subjected to a Max-pooling layer of [1024-512] and then concatenated with the embedding vectors obtained by SA. Thus, we finally obtain a latent map with dimensions of 1536. The following continuous MLP layers encode the embedding vectors obtained by the attention mechanism into dimensions of 192, so that the final reshape layer can output 256 virtual points.

Discriminator
As another important component of GAN, Discriminator is used to judge the virtual points generated by Generator. Its working mode can be described as: where x represents input point cloud generated virtual points or rendered ground truth, D is the DGCNN operation, ξ is the Maxpooling layer, ϕ is the Leaky_Relu activation function, and L i represents the Linear layer. Discriminator takes virtual points generated by the Generator and ground-truth missing point clouds as inputs, and outputs the predicted probability that the received point cloud is ground-truth. It calculates the adversarial loss between the predicted and the actual label and then feeds it back to the generator. Repeating the above game process until the probability that the predicted label is the virtual point is close to 0.5, means that the discriminator cannot tell the difference between the input point cloud and ground-truth.

Correspondences Calculation
After obtaining the virtual points, it is first combined with the original points to form complete point clouds PC g . Then, applying the rotation matrix R and translation vector t generated from the previous iteration to PC g , we get a new input of the current iteration. Next, DGCNN and Transformer are used to extract and fuse features similar to VPGnet. The Transformer in Regnet enforces the encoder to pay more attention to the spatial information of another point cloud, that is, the orientation and position of the point cloud. The dimension of the embedding vectors obtained after the Transformer is n × 1024, where n is the number of points in the point cloud. In order to obtain the corresponding points in the target complete point cloud, we calculated the correlation between each point in two combined point clouds PC x g and PC y g , which is expressed as: where Θ x and Θ y ∈ R n×1024 denote the high-dimensional feature maps after Transformer. The dimension of Σ is n × m, where n and m are the scales of source and target point clouds, respectively. Each element Σ ij represents the correlation between the i-th point in the source complete point cloud PC x g and the j-th point in the target complete point cloud PC y g . Then, the corresponding points in the target point cloud are calculated as Σ· PC y g .

SVD Module
Now, for each point x i in the source complete point cloud, there are m corresponding points y j weighted in the target complete point cloud. Therefore, in order to reduce the burden of network training, we employ the SVD module to calculate the final rotation matrix R xy and translation vector t xy . We define the centroids of PC x g and PC y g as: The covariance matrix can be expressed as: Then, singular value decomposition is performed on H ∈ R 3×3 : where U and V ∈ SO(3) are the matrices formed by the eigenvectors of HH T and H T H, respectively. S is a diagonal matrix whose diagonal elements are eigenvalues of H. Finally, the rotation matrix R xy and the translation vector t xy can be calculated according to Equation (10):

Loss Functions
The first loss function is the adversarial loss of Discriminator L d in VPGnet. We consider four groups of adversarial losses, which are the ground-truth x point clouds, the generated x virtual points, the ground-truth y point cloud, and the generated y virtual points, so L d is: Each L j d is defined as: where x i is the i-th point cloud, GT i is the i-th ground-truth missing point cloud, and N is the number of input point clouds. D() and G() represent the Descriminator and Generator. Fan et al. proposed two position-invariant metrics to calculate the distance between two point clouds: Chamfer Distance (CD) and Earth Mover's Distance (EMD). CD calculates the average closest point distance between two input point clouds, which is shown as Equation (13). The first term represents the sum of the minimum distance from any point x in S 1 to S 2 . The second term serves the symmetric role. The two sets S 1 and S 2 do not need to be the same size. EMD was first proposed in [59] as a histogram similarity measure based on transportation efficiency. It calculates the minimum distance from one distribution to another. Unlike CD, the calculation of EMD requires that the two sets S 1 and S 2 have the same size. The calculation method is shown in Equation (14): We calculate the CD and EMD between the missing parts of the virtual point clouds and the ground truth. Apart from that, the CD between the combined point clouds and the ground-truth complete point cloud are employed to ensure that the former has a similar shape and structure to the latter. Therefore, the loss function of the Generator can be summarized as follows: where V x , V y are the virtual point clouds generated from the source partial cloud X and target partial cloud Y; GV x and GV y are the ground-truth missing regions of input two point clouds; PC x and PC y are the complete point clouds consisting of the original partial cloud and generated virtual points; PC x gt and PC y gt are the ground-truth complete point clouds. L y g is calculated with the same method and symmetrical parameters.
The last loss function is the registration loss. We directly measure the deviation of predicted R and t from ground-truth R g and t g that are recorded during the original point clouds preprocessing. Equation (17) shows the last loss term: Here, g denotes ground-truth. k represents the total iteration numbers. Therefore, the total loss can be summarized as follows:

Implementation Details
First, we set the training batch size to 64 and epochs to 250. Adam is the selected optimizer with a learning rate of 0.0002 and weight decay of 0.001 to perform gradient descent stably and efficiently. In order to speed up the training of the GAN network, we first train 50 epochs for the G network so that the G network can generate relatively accurate virtual points after a short training. The total number of iterations in Regnet is three. The α in Equation (18) is set to 0.05.

Dataset
We trained and evaluated VPRNet on the Modelnet40 dataset. The dataset comprises 12,311 meshed CAD models grouped into 40 artificial categories. We follow the original division of training and testing set in the original Modelnet40 dataset, that is, 9843 for training and 2468 for testing. In the test of unseen category models, we leverage the first 32 categories of shape names file in Modelnet40 for training and the last 8 categories for testing. Coincidentally, the ratio of the training to testing set is close to 8:2, which is 9907 train models and 2404 test models, respectively. We did not use the half-and-half data segmentation strategy provided by Modelnet40, because we added new processing to the original dataset, that is, the separation of the point cloud patch. We arbitrarily select a point inside a point cloud and exclude the nearest k points to construct original partial point clouds. Here, k is set to 256. Such data augmentation makes the training of baseline algorithms more difficult than under clean data. Besides, our algorithm employs the structure of GAN, and the final generation effect can be improved with more samples [60,61]. Consequently, we adjust the ratio of the training to the testing set to 8:2; 1024 points were uniformly sampled from Modelnet40 samples for training and testing of VPGnet. We employed the augmentation strategy for all sampled point clouds, and a rotation and translation was performed along each coordinate axis with a randomly selected angle within [0, 45°] and a distance generated from [−0.5, 0.5].

Metrics
We evaluate the network framework according to five registration metrics: MAE, MSE, RMSE, R_loss, and T_loss. Equations (19)- (21) shows the calculation method of the first three metrics, which evaluate the distance between two vectors. M is the length of two vectors, and x i ,y i are the corresponding elements of two vectors. The smaller the value is, the better the registration effect is. We adopt the L 2 norm between the ground-truth rigid transformation parameters and the predicted results to evaluate the accuracy of the rotation and translation. The calculation methods of R_loss and T_loss are shown in Formulas (22) and (23), where R pre and t pre are the predicted rotation results, R gt and t gt are the ground truth, respectively. Finally, Reg_loss is defined as the sum of R_loss and T_loss. All angular measurements in our results are in units of radians.

Baseline Algorithms
In order to evaluate the proposed network framework more comprehensively, this section divides the baseline algorithms into two categories. One is the most representative traditional algorithm, including ICP [4], Generalized ICP [36], Point-to-Plane ICP [9], and Fast Global Registration (FGR) [62], the other is the state-of-the-art deep learning-based algorithm proposed in recent years, including OMNet [12], PointnetLK [16], DCP [22], and RPMnet [41]. All networks are trained in NVIDIA Tesla v100 GPU and tested in AMD Ryzen 7 at 4800H CPU.

Traditional Algorithms
We choose the feature-based registration algorithm for the traditional method, namely Fast Global Registration (FGR) [62]. The algorithm uses the Fast Point Feature Histograms (FPFH) of point cloud to return corresponding point pairs with similar geometric structures. The other is ICP [4] and its variant version GO-ICP [10] and Point-to-Plane ICP [9]. As a classical point cloud registration algorithm, ICP can accurately complete the registration task under the insurance of a good initial value. The GO-ICP tries to avoid the disadvantage of the ICP algorithm falling into local optimization by employing the branch-and-bound method to search for the optimal value in the global range. The ICP-plane changes the definition of distance from point-to-point to point-to-plane. The implementation of ICP, ICP-plane, and FGR is available in Open3D. The GO-ICP is called from the library pygoicp whose parameters of DT size and Factor are set to 300 and 2.0. ICP and its variant ICP-plane are initialized with a rigid identity matrix, and the distance threshold is set to 1.

Deep Learning Algorithms
The deep learning algorithms we choose are PointnetLK [16], DCP [22], RPMnet [41], and OMNet [12]. As the first deep learning-based registration algorithm, the strategy that uses MLP to extract point cloud features for pose estimation in PointnetLK became attractive after being proposed. This algorithm is compared by many papers [12,22,40], so we chose it as the first baseline algorithm belonging to deep learning. Besides, DCP removes the relevant calculation of Lie algebra in PointnetLK and applies the Transformer to extract hybrid features. Then, the rotation matrix and translation vector are estimated by SVD for corresponding point pairs. As an advanced algorithm for applying the attention mechanism to the registration task, the DCP algorithm expresses competitive performance on the Modelnet40 dataset, so we treat it as the second baseline algorithm based on corresponding point pairs. Moreover, RPMnet utilizes a subnetwork to predict the annealing parameters according to the PPF feature. Then, a sinkhorn normalization is concatenated to the match matrix module, thus outputting a matching matrix. Finally, OMNet is proposed to specially deal with partially overlapping registration tasks with the critical mask prediction module. Although pre-trained models of the above networks are delivered by the original authors, the division of training and testing set of those models is not consistent with the design in this paper. Therefore, we retrained other all deep learning methods with the same dataset as ours. For a fair comparison, we use the parameter values recommended in the official introduction of baseline algorithms to ensure baseline algorithms achieve the best effect. Some important parameters of all deep learning methods used in training and testing are introduced in Table 1. Note that DCP does not employ an iterative strategy. Thus, it does not contain the Iter_num parameter.  Table 2 shows the statistical results of the registration indicators of all algorithms under unseen category point clouds. For comparative purposes, we define the relative error rate to normalize indicators of different orders of magnitude. The calculation method is: ε = |M 1 − M 2 |/M 1 . As shown in Table 2, we can clearly obtain that the accuracy of the deep learning methods significantly exceed the traditional algorithm since the average relative error ratio of RMSE(R) and RMSE(t) is reduced by 54.21% and 32.40% over traditional algorithms. Thanks to the high-dimensional features map extraction of deep learning methods, the calculation of corresponding points is more accurate than the traditional algorithm. Specifically, our algorithm expresses good competitiveness in accuracy compared with all deep learning-based algorithms. Compared with DCP and PointnetLK, our algorithm's average relative error ratio in registration loss is reduced by 68.25% and 80.68%, respectively. We have to admit that our algorithm does have a certain gap with RPMnet and OMNet in some aspects. However, we can find that the differences are not too large to be accepted after a detailed analysis. For example, in terms of MAE(R), our value is only 5.84 larger than RPMnet, that is, the average error of rotation of three rotation axes is only 0.10°. Compared with the difference (32.57, 22.17) between RPMNet and DCP and PointnetLK, the error rate is up to 82.07% and 73.66%, respectively. Besides, focusing on translation estimation, the disparity between ours and RPMnet becomes smaller. In numerical terms, both RMSE(t) are equal to 0.16, and the MAE(t) of our method is 0.01 lower than RPMnet. Moreover, it can be seen that the robustness of the RPMnet algorithm is inferior to our algorithm in the subsequent experiments. In the comparison with OMNet, our algorithm and OMNet are both aimed at partial registration, but these adopt totally different processing ideas. Ours attempts to complete, while OMNet tries to mask the meaningless part. From the results, the difference of MAE(R) is 5.08 (0.08°), the error rates reach 84.03% and 76.27% compared with the difference between OMNet and DCP and PointnetLK, which shows that the disagreements over MAE(R) between ours and OMNet are not as sharp as DCP or PointnetLK. In summary, our algorithm achieves competitive performance in unseen category tests. The conclusion can be inferred that the self-supervised VPRNet first generates virtual corresponding points with Transformer and Self Attention, which makes up for the negative impact of the incompleteness of point clouds. The visualization of samples after registration is shown in Figure 3. The histogram on the right shows the proportions of different colors. Different colors represent the distance of the closest point. The closer to blue, the closer the closest point in the opposite point cloud is to this point. It is worth noting that the OMNet algorithm needs a ground-truth pose matrix to calculate the overlapping mask, so the calculation of registration parameters cannot be completed with only the residual clouds. Therefore, the registration of OMNet is excluded from the results in Figure 3. The case of unseen categories are shown in sub-figure (a) of Figures 3 and 4. It can be clearly seen from the figure that the color of our registration result tends to be blue. Besides, there is no obvious visual difference between the registration effect of this algorithm and the RPMnet algorithm despite the leadership in data.

Robustness Test
The following three experiments tested the resistance of the proposed algorithm and baseline algorithms to noise, sparsity levels, and initial rotation angles.

Noise Test
We randomly sampled jittering noise from N (0, 0.002) and cropped it to [−0.05, 0.05]. All the deep learning-based algorithms are retrained with noisy point clouds. The results are shown in Table 3. The registration and completion results of noisy data are summarized in Figures 3b and 4b, which shows that the proposed algorithm can still finish the completion and registration of point clouds under the influence of noise. These two figures do not show great visual deficiency. The precise analysis of noisy data is stated below. Table 3. Results on noisy point clouds in ModelNet40. Bold numbers are the smallest in the current column, and represent the best performance. Lower is better. Our algorithm is in the front rank among all measurements. Evidently, it can be seen from Table 3 that our algorithm has significant leadership in the estimation of rotation compared with the PointnetLK and DCP, which is proved by the fact that the relative error rates of MAE(R) reach 83.04% and 80.24%. In addition, our algorithm is still not much behind RPMnet. The performance of this algorithm on MAE(R) is only 2.00 lower than the RPMnet algorithm, that is, the average rotation error of three rotation axes is only 0.03°. Compared with 28.80 and 34.31 of DCP and PointnetLK algorithm, the error rates are reduced by 93.06% and 94.17%, respectively. Additionally, the translation estimation of the VPRNet algorithm is better than the RPMnet algorithm. For example, the MAE(t), MSE(t), and RMSE(t) of VPRNet are all 0.01 lower than RPMNet. Compared with the OMNet algorithm, there are still some gaps, for example, the error rates of RMSE(R) and T_loss are 53.45% and 75.50%, respectively. Nevertheless, the gap between them is reduced compared with the previous clean data. In the aspect of MAE(R), the error between our algorithm and OMNet under noise interference is reduced by 22.05% compared with that under clean data, and T_loss is reduced by 10.00%. Therefore, it can be proved that our algorithm is closer to the advanced OMNet algorithm under noisy data than clean data. Besides, compared with clean data, our algorithm's error rate of Reg_loss was reduced by 12.24%, which shows that our algorithm is the only one among all deep learning methods whose registration results under noise interference are better than that of clean data. Especially in the estimation of rotation, the relative error rate of MAE(R) of RPMNet is 10.70 times higher than clean data. Not only the RPMnet algorithm but also the OMNet algorithm is worse under noise than that under clean data. For example, the change error rate of MAE(R) under clean and noisy data is 54.92%. Consequently, our method demonstrates competitive robustness among all deep learning methods under noisy data. Exploring deeper reasons, we can infer that the extracted high-dimensional embeddings contain some wrong position information, which results in biased virtual points and estimation of the registration parameters. However, the feature fusion of Transformer and Self-Attention in VPGnet and Regnet can still focus on more relevant parts in noisy data. Thus, these two modules reduce the impact of global noise on completion and registration.

Sparsity Level Test
Subsequently, we tested the influence of different sparsity levels on predicted rotation and translation metrics. We first performed FPS on the original two point clouds and retained four sparsity levels, 0.5, 0.25, 0.125, and 0.0625. The statistical performances of all baseline algorithms under different sparsity levels are shown in Figure 5 and 6. The registration and completion results of sparse point clouds from algorithms are shown in Figures 3c and 4c. As can be seen from the figure, only our algorithm and RPMnet algorithm can finish the registration between sparse and partial point clouds. Our algorithm can complete point clouds with a sparse level of 0.5. In order to intuitively see the impact of increased sparsity level, we calculate the x-axis as 0.5 − x, where x is the sparsity level. Although the point cloud tends to sparse, our algorithm can still alleviate the limitation of sparseness and guarantee registration quality. The detailed analysis is as follows.
We can see from Figure 5 that, no matter how the sparsity level changes, the predicted rotation and translation errors of our algorithm consistently rank high in all methods. Among the traditional algorithms, only the estimation of rotation from ICP is near our algorithm. The ICP algorithm surpasses our algorithm when the sparsity level is 0.0625, but the average error rate of the two algorithms is only 3.90%. Meanwhile, the minimum average error rate of MAE(R) between the remaining traditional algorithms and our algorithm is 54.48%. Therefore, we can conclude that our algorithm is ahead of all the traditional algorithms in the accuracy of rotation estimation. Focusing on the deep learning methods, our algorithm, RPMnet, and OMNet always maintain a leading position. Notably, compared with DCP and PointnetLK algorithms, our algorithm performs significantly better on MAE(R) since the average error rates at different sparsity levels are 68.07% and 61.75%, respectively. As a future improvement focus, our algorithm still has a gap in the overall accuracy revealed in the 49.77% of average error rate compared with RPMnet algorithms. Thankfully, the mean gap in the degree system is only 0.09°. Besides, the variance of MAE(R) metric of our algorithm under different sparsity levels is 10.39, which is close to 12.22 of RPMnet, and the average error rate is only 14.98%. Peculiarly, when dealing with the point cloud registration problem with a sparsity level greater than 0.25, the average variation of our algorithm is 3.68, which is close to 3.70 of the RPMnet algorithm. However, our algorithm increases by 4.31 compared with MAE(R) under the previous sparse levels, which is lower than 4.66 of RPMnet when turning to the case of extreme sparsity level of 0.0625. Compared with the previous sparse level, the error growth rate of our algorithm under extreme sparse conditions is 38.90%, while the error growth rates of RPMNet and OMNet are 73.71% and 49.47%, respectively. Therefore, the above data prove that our algorithm illustrates outstanding robustness among deep learning-based methods especially under extreme sparse conditions. We can infer from the above situations that the additional virtual point completion in this paper makes the source points that do not have ground-truth correspondences produce virtual corresponding points, thus making up for the lack of shape information caused by the increased sparseness of the point cloud. As for the estimation of translation from Figure 6, except for the PointnetLK and GO-ICP algorithms, the estimated values of the other algorithms are relatively close, which is validated by the 0.01 variance of the mean value under different sparsity levels. Nevertheless, our algorithm is still the third performer with an average of 0.13. Although RPMnet is highly ranked with an average value of 0.08, its overall variance is 0.001 higher than ours. In other words, our translation estimation is more stable than RPMNet in the estimation of translation. Finally, the translation estimation error of our algorithm is only 0.09 higher than that of OMNet at different sparse levels. The reason for our improved performances is that the VPG module enriches the shape information of partial point clouds, so that more conjugate point pairs mean a stronger guarantee of translation estimation.

Initial Rotation Angles Test
We followed the suggestion of FMR and evenly divided the initial rotation angle range of 0-60°into 6 groups with an interval of 10°. Then, we calculated the indicators about predicted rotation in the above groups of initial angles to explore the robustness of algorithms to different initial rotation angles. The statistical figure about registration is shown in Figure 7. The broken lines with different colors in the figures represent the performance of different algorithms at various initial rotation angles. Figures 3d and 4d show the registration and completion results with an initial rotation angle of 30-40°. It can be seen from the two pictures that the completion and registration of our algorithm under this initial rotation angle are visually reasonable.
In view of the overall tendency, the prediction errors of all algorithms show a surging trend as the initial rotation angle increases. The reason is that the overlapping region between point clouds decreases with the increase of rotation angles. Besides, the FPFH feature used in the FGR algorithm is also rotation-sensitive, which debilitates the registration ability under different initial rotation angles. As can be seen from Figure 7, although our algorithm lags behind OMNet slightly, it is ahead of other algorithms in the test of all initial rotation angles with a mean value of 11.04. For a more detailed analysis, we divide the initial rotation angles into small angles of 0-40°and large angles of 40-60°. For the small angles registration, the average value of our algorithm on MAE(R) gets a smaller 5.10, which is still ahead of other algorithms. For example, compared with RPMNet, the relative error rate is 58.27%. It indicates that our method expresses good applicability for registration with a large overlap rate. However, once facing the rotation angle of 40-60°, MAE(R) of all the deep learning methods rise in different steep degrees. The PointnetLK algorithm is the most seriously affected. The average error of the PointnetLK at the initial rotation angle of 40-60°is 193.36% higher than the average value of 0-40°. Specifically, our algorithm still takes the lead in all algorithms even in large rotation angles. The relative error rate of the MAE(R) of ours and RPMNet under 40-60°is 27.63%. Moreover, the average amplitude of RPMNet at large rotation angles is 12.43, which is greater than our 11.97. Therefore, the proposed algorithm achieves more stable and excellent performance than all methods except OMNet regarding resistance to various initial rotation angles. The apparent gap can still be observed from Figure 7 when compared with OMNet, but the average discrepancy between them is 0.06°under large rotation angles, and 0.17°under small rotation angles, respectively. Let us pay special attention to the case where the initial rotation angle is 50-60°. In this case, the increased error rate of our algorithm is 57.14%, which is lower than 88.99% of OMNet. Therefore, our method performs with close accuracy to OMNet and better robustness to different initial rotation angles than OMNet. Notably, the stability of our method even exceeds the OMNet algorithm in extreme cases, that is, the initial rotation angle of 50-60°. By inspection, we conclude the interpretation that the VPG module enriches the corresponding point pairs after completing the missing points. The Transformer is employed to consider the position information of the opposite cloud in the structure of Regnet, so the corresponding points generated by the registration network can be keenly aware of the position change of the opposite cloud.

Ablation Study
We conduct several ablation studies in this section to dissect VPRNet. Specifically, we replace the important module with an alternative to better understand how various components influence the measurement of the proposed algorithm. All experiments are performed in the same setting as the experiments in Section 4.2.

Without VPGnet
Firstly, we exclude VPGnet and only retain Regnet to test the effectiveness of our completion subnetwork. The network is retrained according to the contents of Section 3.5, and the number of iterations is 1. The comparison data between the retrained network and the original VPRNet is shown in Table 4. As seen from the table, VPRNet with GAN has lower rotation and translation error than the network without GAN. Mainly, R_ Loss and t_ Loss decreased by 55.56% and 66.67% than the network without the VPG module. Therefore, the network structure we designed to complete before registration plays a positive role in registering partial point clouds. Sequentially, we exclude the Transformer module in the Regnet and explore its significance for feature fusion. The new hybrid features are extracted from the source point cloud and the target point cloud independently. There is no communication between feature information. The results are shown in Table 5. From Table 5, we can see that the Regnet without Transformer module is inferior to the original one regardless of the rotation or the translation measurement. Especially in the rotation estimation, the relative error ratio ε of our RMSE(R) and MAE(R) are reduced by 30.45%, and 30.72%, respectively. Some conclusions can be inferred from the data that the Transformer provides not only the shape information of the opposing point clouds but also includes the position information. Combining one's feature map with the others' makes the matching of corresponding points more accurate.

Change Iteration Numbers
Finally, we tested the influence of different iteration times in Regnet on the registration effect, and the specific results are shown in Table 6. As can be seen from the table, the number of iterations with the best performance is 3, so we take 3 as the number of iterations in the final registration network. Since too many iterations lead to the registration network relying too much on training data and reducing the generalization of test data, a moderate number of iterations guarantees generalization ability.

Discussion
It can be concluded that VPRNet is a novel and competitive registration algorithm for partial assignment tasks from the above extensive experiments. Mainly, some meaningful discussions are summarized below.

Generalizability Test
We tested the registration under unseen category point clouds by dividing the dataset into the training and testing set according to the category. The experiment shows that the accuracy of deep learning methods significantly exceed the traditional algorithm. Specifically, our algorithm ranks high in registration accuracy among all algorithms and is close to the better RPMNet and OMNet. Therefore, we believe that our algorithm is competitive and outstanding in generalization ability.

Noise Test
We added N(0, 0.002) jittering noise to the original data to test the robustness of the baselines and our methods to the noise. The results described in Table 3 state that the accuracy of the proposed algorithm under the influence of noise is still ahead of DCP and PointnetLK algorithms. Although it is slightly inferior to OMNet and RPMnet, the gap between the two methods is smaller than clean data. Moreover, it is the only one that still produces more accurate registration under noise interference than clean data. Therefore, the proposed algorithm shows advanced robustness in point cloud registration under noise interference.

Sparsity Test
We performed FPS on the original data with different ratios to construct data with different sparsity levels. our algorithm, RPMnet, and OMNet still maintain a leading position under different sparsity levels in the sparsity test. Specifically, our algorithm illustrates the best robustness and stability among deep learning-based methods under extreme sparse conditions.

Initial Rotation Angle Test
We divided the initial rotation angles into six groups at an interval of 10°and tested the influence of different rotation angles on the registration results. Experiments show that our algorithm has better results than RPMNet at all initial rotation angles. Specifically, our registration under large initial rotation angle of 50-60°is more stable than OMNet.

Ablation Study
We excluded the influence of the VPGnet and Transformer in the ablation study to explore the role of each component in the network. Besides, we changed the number of iterations to determine the number of iterations that perform best. Experimental results show that VPGnet and transformation of the network have positive significance for the final registration, and the registration is the most accurate when the number of iterations is three.

Conclusions
We have proposed a novel neural network architecture called VPRNet to solve the partial-to-partial cloud registration task. The network first generates virtual points to complete the partial point clouds via a self-supervised VPGnet. Then, an iterative Regnet is designed to estimate the registration parameters. Various experimental results obtained from Modelnet40 indicate that our algorithm commands a leading position in the aspects of generality and robustness during the competition with traditional and advanced deep learning algorithms. Therefore, we can summarize that our proposed VPRNet achieves advanced performance for partial-to-partial registration. In the future, we plan to improve the algorithm from the following aspects: (1) We will add other loss functions and effective modules to improve the accuracy of the completion. (2) We will try to incorporate our method into a large system like SLAM to ensure the completeness and accuracy of reconstructed scenes. Acknowledgments: The public dataset used in this article is ModelNet40: http://modelnet.cs. princeton.edu/ accessed on 1 January 2020.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: