Abstract
Transformer has recently become widely adopted in point cloud registration. Nevertheless, Transformer is unsuitable for handling dense point clouds due to resource constraints and the sheer volume of data. We propose a method for directly regressing the rigid relative transformation of dense point cloud pairs. Specifically, we divide the dense point clouds into blocks according to the down-sampled superpoints. During training, we randomly select point cloud blocks with varying overlap ratios, and during testing, we introduce the overlap-aware Rotation-Invariant Geometric Transformer Cross-Encoder (RIG-Transformer), which predicts superpoints situated within the common area of the point cloud pairs. The dense points corresponding to the superpoints are inputted into the Transformer Cross-Encoder to estimate their correspondences. Through the fusion of our RIG-Transformer and Transformer Cross-Encoder, we propose Transformer-to-Transformer Regression (TTReg), which leverages dense point clouds from overlapping regions for both training and testing phases, calculating the relative transformation of the dense points by using the predicted correspondences without random sample consensus (RANSAC). We have evaluated our method on challenging benchmark datasets, including 3DMatch, 3DLoMatch, ModelNet, and ModelLoNet, demonstrating up to a improvement in registration recall. The improvements are attributed to our RIG-Transformer module and regression mechanism, which makes the features of superpoints more discriminative.
1. Introduction
Point cloud registration is a critical research area within the realms of computer vision and robotics, serving as pivotal function in diverse applications including 3D object reconstruction, scene comprehension, and robotic manipulation [,]. Achieving precise alignment of point clouds enables the amalgamation of data from varied sources, thereby supporting activities such as environmental modeling, object identification, and augmented reality applications. Enhancing the efficiency and precision of point cloud registration algorithms empowers researchers to elevate the performance of autonomous systems, robotic perception, and augmented reality applications, consequently driving progress across sectors spanning industrial automation to immersive virtual reality encounters.
Recently, there has been a notable increase in research within the domain of point cloud registration focusing on deep learning methodologies. These innovative strategies utilize neural networks to directly acquire descriptions from 3D points, eliminating the necessity for manual feature engineering and tackling issues like varying point density and noise. Fully Convolutional Geometric Features (FCGF) [] is a deep learning method that seeks to extract geometric features directly from point clouds. Through the application of fully convolutional neural networks, FCGF can effectively capture both local and global geometric details, facilitating precise point cloud registration amidst noise and partial overlap. FCGCF [] incorporates color data from point clouds into the FCGF network structure, merging geometric structural details with color features for enhanced representation. By fusing geometric and color information, the feature descriptors are enhanced in distinguishing points with high similarity in three-dimensional geometric structures. Udpreg [] proposes a distribution consistency loss function based on a mixture of Gaussian models to supervise the network in learning its posterior distribution probabilities. It combines this approach with the Sinkhorn algorithm [] to handle partial point cloud registration, aiding the network in extracting discriminative local features. Through unsupervised learning, UDPReg achieves label-free point cloud registration. GeoTransformer [] introduces a method to extract global geometric features from the position coordinates of superpoints. It presents a geometric Transformer for learning global features and introduces the overlap circle loss function, treating superpoint feature learning as metric learning. By combining this approach with the Sinkhorn method, GeoTransformer achieves point cloud registration without the need for RANSAC []. RoITr [] introduces a network based on the Transformer architecture utilizing channel-shared weights to leverage the global properties of Transformer. Building upon the GeoTransformer framework, it embeds geometric features from self-attention modules into cross-attention modules to achieve rotation invariance in the Transformer structure. RegTR [] utilizes a superpoint correspondence projection function to directly constrain the features interacting with the Transformer Cross-Encoder and the voxelized superpoint coordinates. This method replaces RANSAC and directly regresses the relative transformation matrix. RORNet [] divides point clouds into several small blocks and learns the latent features of overlapping regions within these blocks. This approach reduces the feature uncertainty caused by global contrast and subsequently selects highly confident keypoints from the overlapping regions for point cloud registration. HR-Net [] introduces a dense point matching module to refine the matching relationships of dense points and utilizes a recursive strategy to globally match superpoints of point clouds and locally adjust dense point clouds layer by layer, thereby estimating a more accurate transformation matrix. Roreg [] addresses the point cloud registration challenge by focusing on directional descriptors and local rotation techniques. The directional descriptors are categorized into rotational equivariance and rotational invariance components. Equivariance mandates that descriptors are invariant to transformations in the relative point positions within the point cloud, whereas invariance ensures that registration outcomes are insensitive to changes in scale, rotations, or translations of the point cloud. A local rotation approach is devised to integrate rough rotations for significant angle adjustments with precise rotations for minor angle variations, aiming to ascertain the optimal rotation amount and improve registration precision.
Combining the 3D coordinates and features of superpoints, RegTR [] employs Transformer to directly perform global information interaction on superpoints. However, the coordinates of superpoints are sparse, and the computation on superpoints is voxelized around the centers of point cloud blocks, introducing errors in superpoint coordinates, especially for point clouds with small areas of overlap. We seek to leverage the global properties of Transformer to extract and incorporate global information from dense point clouds. Nevertheless, due to the limitations of Transformer in terms of data length and computational resources, direct processing of dense point clouds is not feasible. Through multiple experiments and data analysis, we discovered that the similarity between the neighborhoods of points outside the overlapping region and those inside the overlapping region has a significant influence on point cloud registration. Points within the overlapping region have less significance for point cloud registration due to their uniform structure. Therefore, it is crucial to select the overlapping region and features with higher discriminative power within this region to enhance the registration’s effectiveness. Drawing inspiration from previous studies [,,], we segmented the point cloud registration procedure into two distinct stages. Initially, we leverage Transformer’s overarching characteristics to differentiate the overlapping and non-overlapping zones, thereby converting the point-to-point matching challenge into a classification task across these areas. Subsequently, we select representative dense keypoints within the overlapping region using a Transformer Cross-Encoder to directly regress the relative transformation.
2. Materials and Methods
2.1. Problem Setting
Our objective is to utilize dense point clouds to compute the relative rigid transformation matrix between point cloud pairs and by minimizing the Equation (1) defined as follows:
where , which is the set of predicted dense correspondences; is a pair of correspondence; and is norm.
2.2. Overview of Our Method
Our approach, named TTReg, utilizes a global transformer to select dense correspondences related to sparse superpoints within the common area to estimate the transformation (See Figure 1). TTReg consists of an encoder–decoder feature extraction module, a sparse superpoint matching module, and a dense point matching module (see Figure 2). The encoder–decoder utilizes the KPConv [] backbone as a feature extraction module and computes downsampling points of different levels (Section 2.3). The sparse superpoint matching module utilizes our RIG-Transformer to select matching superpoints located in overlapping regions to generate dense point clouds (Section 2.4). We partition dense points and superpoints into spatially clustered blocks. During training, we randomly select point cloud blocks with varying overlap ratios, and during testing, we choose dense points corresponding to superpoints selected by RIG-Transformer. Then, the dense point matching module directly regresses the correspondences of input dense point clouds, enabling the computation of relative transformation between dense point cloud pairs (Section 2.5). We introduce different loss functions to supervise superpoint matching module and dense point matching module to learn the correspondences and predict the transformation (Section 2.6). Our contributions are summarized as follows:
Figure 1.
Our TTReg predicts dense correspondences in the overlap region and estimates the transformation of point clouds with regions of low overlap. Points in red and green represent point clouds and , respectively, and gray lines represent the relationship of correspondences.
Figure 2.
Overview of our TTReg architecture. and are features of superpoints and . and represent features of dense points and . Our RIG-Transformer serves as the superpoint matching module for selecting the optimal matching superpoint pairs within the overlap area. The point matching module encodes the feature and of dense points and corresponding to , and predicts the dense correspondences . Finally, the relative transformation matrices and are calculated utilizing dense correspondences .
- We propose a Rotation-Invariant Geometric Transformer Cross-Encoder module (RIG-Transformer) that combines the geometric features and positional encoding of superpoint coordinates to extract more distinctive features for predicting superpoints located in the overlapping region.
- Through the fusion of our RIG-Transformer and Transformer Cross-Encoder, we introduce a Transformer-to-Transformer dense regression (TTReg) that leverages dense point clouds from overlapping regions for both training and testing phases to compute the transformation matrix.
- Through extensive experiments, our method showcases strong matching capabilities on public 3DMatch and ModelNet benchmark, with a notable improvement of 7.2% in matching recall on datasets with small overlap ratios.
2.3. Feature Extraction and Correspondences Sampling
We utilize KPConv [] as our feature extractor. The original point cloud pairs, represented as and , are voxelized to calculate the downsampled 3D points and , where and denote the number of 3D points obtained at each convolutional layer. Unlike Farthest Point Sampling (FPS) [,], we calculate the centroids of adjacent points within a voxel radius to derive the downsampled point clouds and . These downsampled point clouds are then utilized for feature extraction in the subsequent KPConv layers, resulting in feature representations and .
The architecture for 3DMatch and 3DLoMatch illustrated in Figure 2 is adopted. We apply a 3-layer stridden KPConv convolutional structure to the original point cloud, involving three downsampling steps. Conversely, we perform two upsampling steps during the upsampling stage, resulting in one less upsampling step compared to the downsampling step. This is because point clouds are dense, requiring uniform voxelization to adapt to our correspondence loss function. The choice of upsampling steps follows the settings in Predator []. For the ModelNet and ModelLoNet datasets, a 2-layer downsampling and 1-layer upsampling encoder–decoder structure is employed.
To illustrate the sampling and aggregation method between sparse superpoints and dense point clouds used in our architecture of 3DMatch and 3DLoMatch (as shown in Figure 2), we first perform downsampling and feature extraction using the KPConv network. This process yields the lowest-level sparse 3D point cloud superpoints and , along with their corresponding features and . We adopt the data grouping method proposed in [,], where each superpoint serves as the center of a circle to divide the dense point clouds and into 3D data blocks. The Euclidean distances between the dense points in and , as well as and , are computed. The dense points that are closest to the superpoints are assigned to the corresponding data blocks. The grouping method for mapping the dense points and to superpoints is defined by Equation (2):
where represents a superpoint obtained by downsampling the point cloud , and denotes a dense 3D point of that needs to be grouped. The symbol denotes the Euclidean distance of 3D points. The same grouping strategy is applied to the other point cloud .
After the grouping process of dense 3D point clouds, as shown in Figure 3, we calculate the neighborhood points for each superpoint and by considering the points in and . For each superpoint, we compute the distance to its farthest neighboring point, which is used to measure the overlap region. By applying the relative transformation, we transform the superpoints and the dense points from point clouds into the coordinates, denoted as , from . We then measure the overlap between superpoint pairs and . The overlap is determined based on whether the dense points , are contained within the overlap region of superpoints , or not. The threshold for the overlapping region is represented as , which is used to select the dense point pairs. The selected superpoints in the overlap region of and are denoted as and , respectively, and the dense points related to and are denoted as and , respectively. We select dense points around superpoints based on the size of the overlapping region of the aligned point clouds. Dense point block pairs with larger overlapping regions are chosen to train our network architecture. In Figure 3, dense point block pairs of (a) are considered to be located in the overlapping region due to a large overlap area, while those in (b) and (j) are discarded as they either have no overlap or a small overlap region.
Figure 3.
The selection of a superpoint and its corresponding dense points. (a) represents the selected dense matching point cloud block with a relatively large overlapping region, while (b) and (j) represent dense point cloud blocks with no or small overlapping region.
2.4. Superpoint Matching Module
We propose a Rotation-Invariant Geometric Transformer Cross-Encoder module, referred to as RIG-Transformer. Figure 4a depicts the computational flowchart, and Figure 4b,c depicts our RIG-self-attention and RIG-cross-attention, respectively. Through incorporating geometric features and positional encoding, we start by adding and normalizing the input features, a slight deviation from the typical attention mechanism. We compute the geometric features and , as well as the positional encoding and of superpoint and . These values are then combined with the feature vectors and of the superpoints and fed into RIG-Transformer to calculate the interaction features of the point cloud pairs.
Figure 4.
Overview of our RIG-Transformer module: (a) depicts an overall computation of the RIG-Transformer module, (b,c) depict the RIG-self-attention and RIG-cross-attention.
For instance, considering the point cloud , we compute the feature vectors for RIG-self-attention, following the process and dimensions depicted in Figure 4. Utilizing the superpoint and feature vector extracted by KPConv [] as input, we not only calculate the initial attention [] components , , , but also compute weighted geometric feature encodings and . Subsequently, we derive the geometric branch feature and contextual feature based on the attention weights. The definitions of the individual variables are provided as follows:
where , represent the learnable weights for geometric features; , , and denote the learnable self-attention weights for the features of superpoints and ; and indicates the dimension of the features and of and . It is noteworthy that , , , , and share weights between point clouds and . Similarly, taking and as inputs to the self-attention module RIG-self-attention, we replace the corresponding input variables according to the computation methods in Equations (3)–(9) to calculate and for subsequent feature interactions.
We input the calculated , , , and from the above computations into RIG-cross-attention, as illustrated in Figure 4c. We compute the rotation-invariant geometric feature after the interaction, with the calculation of variables defined as follows:
where , and are the learnable shared weights for point clouds and . By swapping the input order of and , and in Figure 4c, we obtain the output features . As shown in Figure 2, the obtained and are further processed through the FFN module, which consists of two layers of linear transformation units, to facilitate multiple interactions and the fusion of features. The calculated features serve as new and . We iterate the computation of RIG-Transformer times to enhance the correlation between intrinsic and interaction features of the point clouds, resulting in the final output feature vectors and with geometric properties and rotation invariance.
To extract the optimal matching dense 3D points , during the training phase, we compute neighboring points from and and randomly select matching superpoints , followed by selecting densely related points for training our network. During the testing phase, we calculate the optimal matching superpoints and then select the densely related points in the dense point matching module. To select the best matching superpoints, we first normalize the features and mixed with geometric features and positional encoding. Subsequently, we compute the Gaussian correlation matrix for the normalized features, with the Gaussian correlation formula for a pair of superpoints defined as follows:
To enhance the discriminative feature matching of superpoints, we normalize the correlation matrix using bidirectional normalization [,,]. Bidirectional normalization refers to normalizing the feature correlation matrix in rows and columns separately. The formula is defined as follows:
After bidirectional normalization, we select the top K best-matching pairs of superpoints based on the matching scores and extract the related dense points. These dense matching points not only possess richer geometric features but are also located in overlapping regions. This reduces the impact on point cloud registration performance from points structurally similar but located outside the overlapping regions.
2.5. Point Matching Module
Due to the quadratic relationship between weight matrices and data length in the attention mechanism of Transformer, existing methods are unable to directly process dense point clouds with Transformer. Furthermore, numerous dense point pairs correspond to points in the overlapping areas. Thus, we choose the top K matching point pairs using the predicted superpoints containing global information. Subsequently, we index the dense 3D matching point pairs located in overlapping regions based on the relationship between superpoints and dense points. The predicted dense matching points are denoted as , containing the dense three-dimensional points and of point clouds and , along with their respective features and .
Since and are dense point cloud features extracted by the KPConv network, they have more accurate coordinates compared to sparse superpoints. We utilize a Transformer Cross-Encoder to interactively exchange information among the dense point features located within the overlapping regions outputted by the RIG-Transformer module. In the interaction, we fuse the positional encoding [,] of the respective dense point coordinates. The incorporation of positional encoding not only enhances the robustness of the features but also allows us to use them for subsequent prediction of corresponding point coordinates and determining if the predicted points reside in overlapping areas. Based on the predicted dense corresponding points, we calculate the transformation matrix.
Our dense point cloud matching module is primarily composed of a Transformer Cross-Encoder module and an output decoding module. The Transformer Cross-Encoder module consists of a self-attention module, cross-attention module, and FFN module. In Figure 5a, the overall calculation process of attention is depicted, Figure 5b represents the calculation of self-attention for feature extraction, and Figure 5c illustrates the attention interaction calculation between two point clouds. The FFN module comprises two layers of linear transformation units, serving to transform and integrate feature channels in the interaction of multi-layer attention.
Figure 5.
The dense matching module structure: (a) depicts the overall computation of the dense matching module, (b,c) depict the calculation of the self-attention and cross-attention modules.
Taking dense spatial coordinates and the features of point cloud located in the overlapping region as an example, we illustrate the feature fusion and interaction of dense point feature pairs in Figure 5b,c. Firstly, are encoded using sine encoding, followed by the fusion with the feature to calculate the self-attention context feature . The formulas for feature fusion and the self-attention of , , and and the context feature are defined as follows:
where the variables , , and of self-attention are shared. The corresponding learnable weights , , and are shared by the dense point clouds and of point clouds and . Similarly, we replace the input variables of the self-attention module (b) in the Transformer Cross-Encoder in Figure 5 with and and compute the global context feature vector of point cloud based on Equations (16)–(19). Consequently, we have obtained the context feature vectors and of the matching point pairs of dense point clouds and within the overlapping region, which serve as inputs to the cross-attention module for further feature fusion.
The computational process of the cross-attention is illustrated in Figure 5c. Subsequently, we input and into the cross-attention module (c) in sequential order (first and then ), and compute the interacted feature , which integrates more accurate encoding of dense spatial coordinates and contains global feature interaction information of the point cloud to be matched, thus possessing better discriminative power. The corresponding formula for this computation is defined as follows:
where , , and are the learnable parameters of the cross-attention variables , , and . Similarly, we interchange the input order of and in Figure 5c and the corresponding computation Formulas (20)–(23). By first computing in the sequence of and then , we obtain the feature for dense point pairs located in the overlapping region. The resulting and contain positional encoding and feature information from the other, representing dense global features with enhanced discriminative power. These features will be directly utilized in the output decoding module to predict corresponding point coordinates and overlap scores.
We feed and as inputs to the output decoding module, which consists of four linear layers. Among these layers, three are utilized for predicting the corresponding coordinates of points, while the fourth linear layer is responsible for predicting scores indicating whether the matched points are within the overlapping area. The network structure is illustrated in Figure 6, and the detailed mathematical expressions are presented as follows:
where , , , , , , , and are the learnable weights of the corresponding point and overlap prediction linear layers in the output decoding module. We concatenate the dense points located in the common area with the predicted corresponding points . Similarly, we concatenate the dense points within the common area with the network-predicted corresponding points . After concatenation, we obtain dense matched points and in the overlapping region with a data length of . The corresponding overlap score is denoted as , and the formulas are defined as follows:
we calculate the relative transformation matrix and of the point cloud to be matched by minimizing the loss function of dense matching points:
where and represent the data lengths of and , , and , respectively. , , and are the estimated corresponding points and overlap score weights from Equation (26). We obtain our estimated relative pose transformation matrix by solving Equation (27) using the Kabsch–Umeyama algorithm [,] for the optimal solution.
Figure 6.
The overview of output decoder structure and transformation calculation.
2.6. Loss Function
Our optimization target loss comprises two parts: the superpoint loss and the dense point loss , i.e., . The superpoint loss function constrains and predicts superpoint correspondences located in the overlapping region, while the dense point loss function directly enforces the architecture to predict dense correspondences.
2.6.1. Superpoint Correspondences Loss Function
We utilize the overlapping circle loss from [] to choose the K best pairs of superpoints with the greatest similarity scores. The function is defined as follows:
where represents the distance between feature vectors and and are the features of superpoints in point clouds and , respectively. Following the three-level downsampling of point clouds as shown in Figure 2, we have and and , where denotes the degree of overlap between dense points corresponding to the superpoint pair and . We use and to weight the matching points and non-matching points, respectively, enhancing the discriminative power of the function, where we set the hyperparameters and as suggested in []. Similarly, the loss function for is computed using a similar method, resulting in the complete superpoint loss function .
2.6.2. Point Correspondences Loss Function
The dense point loss function comprises three components: overlap loss, corresponding point loss, and feature loss, i.e., , which, respectively, constrain the overlapping regions in three-dimensional space, match corresponding points in three-dimensional space, and supervise the learning of the joint network for feature vector space matching.
Overlap Loss
Through the superpoint matching module in Section 2.4, we obtained dense point pairs located in the overlapping region. To facilitate the network in acquiring additional characteristics of corresponding areas, we divide the point cloud match pairs into overlapping and non-overlapping regions. We further constrain the dense points corresponding to matching superpoints with similar structures but low overlap rates using a cross-entropy loss function. This helps reduce matching errors and improve matching accuracy, and the formula for overlap score constraint is defined as follows:
where represents the likelihood score of the network predicting if points are belong the overlapping area, we calculate utilizing the method proposed in [,], defined by the formula:
where represents the rigid transformation matrix of the ground truth relative pose change for the pair of point clouds, is the threshold to determine whether a pair of dense matching points match, and denotes the spatial nearest neighbor calculation. Similarly, we can derive the dense overlap loss function of , and the complete dense point overlap loss function is given by .
Corresponding Point Loss
We constrain the three-dimensional points in the overlapping region by minimizing the Euclidean distance of the corresponding points in three-dimensional space. For dense points outside the overlapping region, we use the overlap degree as a weight. The formula is defined as follows:
where is the ground truth overlap degree, and are a pair of dense matching points as in Equation (26), and is the number of dense three-dimensional points within the overlapped region in point cloud . Similarly, we calculate the corresponding point loss function for the point cloud . The overall corresponding point loss function is given by .
Feature Loss
We utilize the infoNCE loss [] to supervise the network learning of dense point feature vectors, encouraging the network to learn more similar features for matching points. Here, and represent a pair of dense matching points as in Equation (26). The definition of the feature loss function infoNCE is as follows:
We measure the similarity of features using a logarithmic bilinear function [], where the function is defined as:
where , with representing the dense feature obtained from the three times downsampled and two times upsampled feature extraction network structure shown in Figure 2. is the feature vector of its corresponding point. and indicate whether the point in matches , with the matching determined by the positive boundary and negative boundary , where is the radius of the voxel size. is a learnable weight matrix that is diagonal and symmetric. Similarly, the overall feature loss is given by .
3. Results
3.1. Datasets
3.1.1. Indoor Benchmarks: 3DMatch and 3DLoMatch
The 3DMatch [] and 3DLoMatch [] datasets were introduced to address the challenges of 3D scene understanding and alignment. The datasets consist of RGB-D scans of various indoor scenes, and provide aligned point clouds and RGB images for each scene, along with ground truth transformations that represent the accurate relative poses between pairs of point clouds. One key feature of the datasets is their diversity in terms of scene types, object categories, and sensor noise. The scenes include different indoor environments, such as living rooms, kitchens, and offices, with varying levels of clutter and occlusion. This diversity helps to analyze generalization capabilities for point cloud registration algorithms. The 3DLoMatch dataset contains significant geometric variations, occlusions, and partial overlaps (between and 30%), but the overlaping regions are smaller than those in 3DMatch (>30%), making accurate alignment and pose estimation difficult. This makes the dataset suitable for evaluating the performance of various point cloud registration methods under realistic conditions.
3.1.2. Synthetic Benchmarks: ModelNet and ModelLoNet
We also utilize the ModelNet40 [] benchmark to further assess our model. We follow the dataset settings proposed by [,,] to obtain ModelNet and ModelLoNet, respectively. These datasets exhibit varying average overlapping regions, with ModelNet at overlap and ModelLoNet at . The ModelNet40 dataset provides a well-balanced distribution of object categories, including chairs, tables, airplanes, cars, and more, guaranteeing representation from diverse classes. The objects within this dataset are captured from multiple angles and poses, offering a realistic and comprehensive depiction of real-world objects. This diversity presents challenges for algorithms, as they need to handle the different orientations, partial views, and inherent noise present in the data.
3.2. Experiment Details
For 3DMatch and ModelNet40, we set the voxel size to 0.025 m and 0.015 m, respectively, with the voxel size doubling at each downsampling step. Training is conducted only on 3DMatch and ModelNet datasets, and evaluation testing is performed not only on 3DMatch and ModelNet, but also on 3DLoMatch and ModelLonet. We select 32 superpoints, with a maximum of 64 dense points associated with each superpoint in the training and testing phases of 3DMatch. For the ModelNet40 dataset, we unify the data length of training superpoint, testing superpoints, as well as the maximum length of dense points associated with each superpoint to 128. We utilize the AdamW optimizer with a consistent initial learning rate of 0.0001. For 3DMatch, the learning rate is decreased by half every 20 epochs, whereas for ModelNet, it is halved every 100 epochs. Training concludes upon reaching 900k iterations. The training and testing processes are carried out on an Nvidia RTX 3090Ti GPU. We set the batch size to 1 for both 3DMatch and ModelNet40.
3.3. Evaluation
3.3.1. Evaluation of 3DMatch and 3DLoMatch
In order to evaluate our approach’s effectiveness, we utilize the registration recall (RR) metric configuration proposed in [,,] for measuring the success rate of registration. Moreover, we apply the relative rotation error (RRE) to evaluate the accuracy of rotation matrices and the relative translation error (RTE) to assess discrepancies in translation vector estimations, commonly employed for analyzing transformation matrix errors. RegTR [] directly regresses poses using sparse superpoints, which serves as our baseline for assessment (see Table 1).
Table 1.
The registration performance on 3DMatch and 3DLoMatch datasets.
In Table 1, the methods above the line are based on RANSAC, and those below are non-RANSAC methods. Our method notably enhances point cloud registration performance within limited overlapping regions, with an increase of over 7 percentage points and lower registration errors. Moreover, it also demonstrates better registration performance for highly overlapping point clouds.
To showcase the exceptional capability of our suggested model, we illustrate the validation set’s test performance curves throughout the training phase, as displayed in Figure 7. The initial row illustrates the test curves for the 3DMatch dataset, whereas the subsequent row exhibits the test curves for ModelNet. These graphs depict the RR, RRE, and RTE. Our model converges rapidly, attains superior registration recall rates, and demonstrates reduced matching errors.
Figure 7.
The evaluation curves during the training process for 3DMatch (a–c) and ModelNet (d–f).
In Figure 8, we present the point cloud registration capability on the 3DLoMatch dataset, focusing on areas with small degrees of overlap and high structural similarity outside the overlapping regions. It can be observed that our method generates more feature correspondences, predominantly within the overlap regions, while the baseline method produces more correspondences within the overlap regions but also includes some erroneous matches outside these regions, significantly impacting registration capability.
Figure 8.
The performance of our method on 3DLoMath. Each column corresponds to different pairs of point clouds. The red and green points signify point clouds and . Row (a) shows the superpoint correspondences obtained by the baseline, row (b) displays the dense point correspondences computed by our method, row (c) illustrates the registration of the baseline, row (d) depicts the registration of our method, and row (e) showcases the registration using ground truth poses.
3.3.2. Evaluation of ModelNet and ModelLoNet
In the case of the ModelNet and ModelLoNet benchmarks, we refer to the relevant method [,] to evaluate point cloud registration error using the RRE, RTE, and Chamfer distance (CD). Since RR is a key metric for assessing the success of point cloud registration, we further assess the performance of our method in terms of its RR.
Similarly, we provide detailed demonstrations of the registration performance on ModelNet and ModelLoNet in Table 2 and Figure 7 and Figure 9. The experimental results show that the proposed TTReg not only accomplishes strong performance in registration in real-world scenarios but also achieves significant improvements on synthesized datasets. The dense matching points computed by our method are mainly concentrated within the overlap regions, effectively enhancing the registration performance.
Table 2.
The registration performance on ModelNet and ModelLoNet datasets.
Figure 9.
The performance of our method on ModelLoNet. Columns correspond to different point cloud pairs. The red and green points signify point cloud and . Row (a) shows the superpoint correspondences obtained by the baseline method, row (b) displays the dense point correspondences computed by our method, row (c) illustrates the registration of the baseline, row (d) depicts the registration of our method, and row (e) showcases the registration using ground truth poses.
3.4. Ablation
To corroborate the efficacy of our TTReg, we evaluate the impact of repetitions of the proposed RIG-Transformer on the 3DMatch and ModelNet; the low-overlap 3DLoMatch and ModelLoNet benchmarks will also be evaluated. Following prior works [,,], we assess the RR, RRE, and RTE for 3DMatch and 3DLoMatch, while for the ModelNet and ModelLoNet datasets, we evaluate the CD, RRE, and RTE. The quantitative performance metrics for 3DMatch and 3DLoMatch are presented in Table 3, while those for ModelNet and ModelLoNet are shown in Table 4.
Table 3.
The ablation performance on 3DMatch and 3DLoMatch datasets.
Table 4.
The ablation performance on ModelNet and ModelLoNet datasets.
We consider values of , with the maximum value limited to 5 due to computational constraints. From Table 3 and Table 4, we observed that increasing appropriately leads to improved matching performance, with optimal results achieved at . Further increasing may offer additional improvements. However, due to computational limitations, we do not test cases where . Notably, our method incurs significantly lower costs when increasing compared to previous methods, as we only match the dense points with the highest correspondence in the overlapping regions, greatly reducing the computational resources required.
4. Discussion
To further investigate the impact of repetition times of RIG-Transformer during the training process on registration performance, we visualize the evaluation curves for 3DMatch and ModelNet (see Figure 10). The RR improves with increasing , while the RRE and RTE decrease as increases. The evaluation curves during the training process align with the registration performance presented in Table 3 and Table 4.
Figure 10.
The impact of RIG-Transformer layer on registration performance during the training process for 3DMatch (a–c) and ModelNet (d–f).
Furthermore, we analyze the distribution of dense points in the overlapping regions predicted by the RIG-Transformer module. The corresponding dense points are illustrated in Figure 11 and Figure 12. We first compute the sparse matching keypoints predicted by the baseline [] and the dense corresponding points obtained by our RIG-Transformer module. Then, we align the point clouds and using the ground truth point cloud relative pose transformation. The gray connecting lines in the figures link the predicted matching corresponding points. In point cloud pairs with high common ratios, sparse corresponding keypoints are predominantly located in the overlapping regions. Conversely, for tow point cloud with low common ratios, numerous unmatching keypoints appear within non-overlapping areas. On the other hand, in point cloud pairs with low common area ratios between point cloud pairs, numerous unmatched keypoints appear in the non-overlapping areas.
Figure 11.
Predicted 3DLoMatch overlap area. Points in red and green represent point clouds and , respectively; gray lines represent the connection relationship between corresponding points. The first row (a) shows the correspondence of sparse matching keypoints from the baseline, and the second row (b) displays the correspondence of dense points predicted by our model located in the overlapping area, with each row representing a pair of point clouds to be matched.
Figure 12.
Predicted ModelLoNet overlap area, where points in red represent point cloud , points in green represent point cloud , and gray lines represent the relationship between corresponding points. The first row (a) shows the correspondence of sparse matching keypoints from the baseline, and the second row (b) displays the correspondence of dense points predicted by our model located in the overlapping area, with each row representing a pair of point clouds to be matched.
Our model predicts dense points that are primarily clustered within the overlapping regions, particularly in regions with low overlap ratios. This outcome is credited to our model’s enhanced capacity to thoroughly investigate the structural characteristics of point cloud sets, leading to improved registration performance by directly predicting the relative pose transformation of point clouds.
5. Conclusions
Our proposed method calculates the relative transformation of point clouds by regressing the corresponding point coordinates of dense point clouds using two cascaded Transformers. We divide the point cloud registration into two steps. Firstly, we divide the points of pairs of point clouds into overlapping and non-overlapping regions. The proposed RIG-Transformer is used to distinguish the best-matching sparse superpoints located in the overlapping region, which transforms the point-to-point matching into a binary classification, reducing the difficulty of classification. The proposed RIG-Transformer integrates point cloud geometric features and positional encoding, possessing rotational invariance. By extracting more complex geometric features, and improving the robustness of feature matching, RIG-Transformer can effectively filter out incorrect superpoint correspondences with high structural similarity outside the overlapping area. Subsequently, the dense point clouds are indexed through the spatial clustering relationship between point cloud superpoints and dense point clouds. The dense point clouds located in the overlapping region play a key role in point cloud registration and have high spatial coordinate accuracy. By using the Transformer Cross-Encoder, corresponding point coordinates can be regressed with higher precision, thereby enhancing the estimated transformation accuracy of point clouds. By combining RIG-Transformer with a Transformer Cross-Encoder, we directly regress the transformation between dense points within the overlapping region. Our approach leverages both the geometric properties of features and the precision of the point coordinates in dense point clouds. Importantly, our regression mechanism avoids the time overhead incurred by using RANSAC. However, due to constraints on computational resources, we did not conduct extensive testing on the interaction times of RIG-Transformer.
Author Contributions
Conceptualization, Y.Z.; methodology, Y.Z.; software, Y.Z.; validation, Y.Z.; formal analysis, Y.Z.; investigation, Y.Z.; resources, Y.Z.; data curation, Y.Z.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z., L.C., Q.Z., J.Z., H.W. and M.R.; visualization, Y.Z.; supervision, Y.Z. and M.R.; project administration, Y.Z. and M.R.; funding acquisition, M.R. All authors have read and agreed to the published version of the manuscript.
Funding
This research was supported in part by the National Natural Science Foundation of China under Grant 61703209 and Grant 62272231, in part by the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province (No. 19KJB520022), in part by the Qing Lan Project of Jiangsu Province (2021), and in part by the Cultivation Object of Major Scientific Research Project of CZIMT (No. 2019ZDXM06). The APC was funded by Nanjing University of Science and Technology.
Data Availability Statement
All datasets used in this study are publicly available. The 3DMatch dataset is available at https://share.phys.ethz.ch/~gsg/pairwise_reg/3dmatch.zip, accessed on 15 August 2023, and the ModelNet dataset is available at https://shapenet.cs.stanford.edu/media/modelnet40_ply_hdf5_2048.zip, accessed on 15 August 2023).
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| RIG-Transformer | Rotation-Invariant Geometric Transformer Cross-Encoder |
| TTReg | Transformer-to-Transformer Regression |
| RANSAC | Random Sample Consensus |
| FPS | Farthest Point Sampling |
| RR | Registration Recall |
| RRE | Relative Rotation Error |
| RTE | Relative Translation Error |
| CD | Chamfer Distance |
References
- Chen, Y.; Mei, Y.; Yu, B.; Xu, W.; Wu, Y.; Zhang, D.; Yan, X. A robust multi-local to global with outlier filtering for point cloud registration. Remote Sens. 2023, 15, 5641. [Google Scholar] [CrossRef]
- Sumetheeprasit, B.; Rosales Martinez, R.; Paul, H.; Shimonomura, K. Long-range 3D reconstruction based on flexible configuration stereo vision using multiple aerial robots. Remote Sens. 2024, 16, 234. [Google Scholar] [CrossRef]
- Choy, C.; Park, J.; Koltun, V. Fully convolutional geometric features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8958–8966. [Google Scholar]
- Han, T.; Zhang, R.; Kan, J.; Dong, R.; Zhao, X.; Yao, S. A point cloud registration framework with color information integration. Remote Sens. 2024, 16, 743. [Google Scholar] [CrossRef]
- Mei, G.; Tang, H.; Huang, X.; Wang, W.; Liu, J.; Zhang, J.; Van Gool, L.; Wu, Q. Unsupervised deep probabilistic approach for partial point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13611–13620. [Google Scholar]
- Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. Adv. Neural Inf. Process. Syst. 2013, 26, 2292–2300. [Google Scholar]
- Qin, Z.; Yu, H.; Wang, C.; Guo, Y.; Peng, Y.; Xu, K. Geometric transformer for fast and robust point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11143–11152. [Google Scholar]
- Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
- Yu, H.; Qin, Z.; Hou, J.; Saleh, M.; Li, D.; Busam, B.; Ilic, S. Rotation-invariant transformer for point cloud matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 5384–5393. [Google Scholar]
- Yew, Z.J.; Lee, G.H. Regtr: End-to-end point cloud correspondences with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 6677–6686. [Google Scholar]
- Wu, Y.; Zhang, Y.; Ma, W.; Gong, M.; Fan, X.; Zhang, M.; Qin, A.; Miao, Q. Rornet: Partial-to-partial registration network with reliable overlapping representations. IEEE Trans. Neural Netw. Learn. Syst. 2023. [Google Scholar] [CrossRef] [PubMed]
- Zhao, Y.; Chen, L.; Hu, B.; Wang, H.; Ren, M. HR-Net: Point cloud registration with hierarchical coarse-to-fine regression network. Comput. Electr. Eng. 2024, 113, 109056. [Google Scholar] [CrossRef]
- Wang, H.; Liu, Y.; Hu, Q.; Wang, B.; Chen, J.; Dong, Z.; Guo, Y.; Wang, W.; Yang, B. Roreg: Pairwise point cloud registration with oriented descriptors and local rotations. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10376–10393. [Google Scholar] [CrossRef] [PubMed]
- Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
- Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
- Arya, S.; Mount, D.M.; Netanyahu, N.S.; Silverman, R.; Wu, A.Y. ANN: A library for approximate nearest neighbor searching. ACM Trans. Math. Softw. (TOMS) 1999, 26, 469–483. [Google Scholar]
- Huang, S.; Gojcic, Z.; Usvyatsov, M.; Wieser, A.; Schindler, K. Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 4267–4276. [Google Scholar]
- Li, J.; Chen, B.M.; Lee, G.H. SO-Net: Self-organizing network for point cloud analysis. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9397–9406. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
- Rocco, I.; Cimpoi, M.; Arandjelović, R.; Torii, A.; Pajdla, T.; Sivic, J. Neighbourhood consensus networks. Adv. Neural Inf. Process. Syst. 2018, 31, 1651–1662. [Google Scholar]
- Sun, J.; Shen, Z.; Wang, Y.; Bao, H.; Zhou, X. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 8922–8931. [Google Scholar]
- Kabsch, W. A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. Sect. A Cryst. Phys. Diffr. Theor. Gen. Crystallogr. 1976, 32, 922–923. [Google Scholar] [CrossRef]
- Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
- Lu, F.; Chen, G.; Liu, Y.; Zhang, L.; Qu, S.; Liu, S.; Gu, R. Hregnet: A hierarchical network for large-scale outdoor lidar point cloud registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16014–16023. [Google Scholar]
- Zeng, A.; Song, S.; Nießner, M.; Fisher, M.; Xiao, J.; Funkhouser, T. 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1802–1811. [Google Scholar]
- Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar]
- Yew, Z.J.; Lee, G.H. Rpm-net: Robust point matching using learned features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11824–11833. [Google Scholar]
- Bai, X.; Luo, Z.; Zhou, L.; Fu, H.; Quan, L.; Tai, C.L. D3feat: Joint learning of dense detection and description of 3d local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6359–6367. [Google Scholar]
- Gojcic, Z.; Zhou, C.; Wegner, J.D.; Wieser, A. The perfect match: 3d point cloud matching with smoothed densities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5545–5554. [Google Scholar]
- Xu, H.; Liu, S.; Wang, G.; Liu, G.; Zeng, B. Omnet: Learning overlapping mask for partial-to-partial point cloud registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 19–25 June 2021; pp. 3132–3141. [Google Scholar]
- Choy, C.; Dong, W.; Koltun, V. Deep global registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2514–2523. [Google Scholar]
- Cao, A.Q.; Puy, G.; Boulch, A.; Marlet, R. PCAM: Product of cross-attention matrices for rigid registration of point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Nashville, TN, USA, 19–25 June 2021; pp. 13229–13238. [Google Scholar]
- Aoki, Y.; Goforth, H.; Srivatsan, R.A.; Lucey, S. Pointnetlk: Robust & efficient point cloud registration using pointnet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7163–7172. [Google Scholar]
- Wang, Y.; Solomon, J.M. Deep closest point: Learning representations for point cloud registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3523–3532. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).