Next Article in Journal
Enhanced Sequence-to-Sequence Deep Transfer Learning for Day-Ahead Electricity Load Forecasting
Next Article in Special Issue
Enhancing Resource Utilization Efficiency in Serverless Education: A Stateful Approach with Rofuse
Previous Article in Journal
Research on Aspect-Level Sentiment Analysis Based on Adversarial Training and Dependency Parsing
Previous Article in Special Issue
PHIR: A Platform Solution of Data-Driven Health Monitoring for Industrial Robots
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GraM: Geometric Structure Embedding into Attention Mechanisms for 3D Point Cloud Registration

1
School of Information Engineering, China University of Geosciences, Beijing 100083, China
2
School of Computer Science and Engineering, Beihang University, Beijing 100191, China
3
Department of Computer, North China Electric Power University, Beijing 102206, China
4
China Reasset Management Ltd., Beijing 100033, China
5
Department of Computer and Information Sciences, Northumbria University, Newcastle upon Tyne NE1 8ST, UK
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(10), 1995; https://doi.org/10.3390/electronics13101995
Submission received: 18 February 2024 / Revised: 24 April 2024 / Accepted: 25 April 2024 / Published: 20 May 2024
(This article belongs to the Special Issue Machine Intelligent Information and Efficient System)

Abstract

:
3D point cloud registration is a crucial technology for 3D scene reconstruction and has been successfully applied in various domains, such as smart healthcare and intelligent transportation. With theoretical analysis, we find that geometric structural relationships are essential for 3D point cloud registration. The 3D point cloud registration method achieves excellent performance only when fusing local and global features with geometric structure information. Based on these discoveries, we propose a 3D point cloud registration method based on geometric structure embedding into the attention mechanism (GraM), which can extract the local features of the non-critical point and global features of the corresponding point containing geometric structure information. According to the local and global features, the simple regression operation can obtain the transformation matrix of point cloud pairs, thereby eliminating the semantics that ignores the geometric structure relationship. GraM surpasses the state-of-the-art results by 0.548° and 0.915° regarding the relative rotation error on ModelNet40 and LowModelNet40, respectively.

1. Introduction

Three-dimensional reconstruction provides digital 3D models by presenting a real-world scene, promoting the development of augmented reality, such as autonomous driving and digital twins [1,2,3,4,5]. On the other hand, point cloud data is fundamental for building a digital 3D model because 3D point cloud registration is the core technology for achieving 3D reconstruction by providing stereoscopic models [6,7]. In recent years, the rapid development of sensor technologies enables 3D point cloud data to visualize the world, further promoting technology innovation in practice.
Three-dimensional point cloud registration is a crucial step in the 3D reconstruction process. It aims to learn the local information of an object from multiple perspectives and integrate it into a unified perspective to obtain objects with rich global information. Specifically, it evaluates the correlation of each corresponding point in the point cloud sample by learning the characteristics of each point cloud sample under multiple visual angles.
Then, it estimates the rigid transformation parameters that ensure all the corresponding points can be directly transformed in 3D space. Subsequently, it obtains the converting matrix of the point cloud sample, thereby achieving the goal of 3D point cloud registration.
In the latest research, many methods utilize attention mechanisms for learning point cloud features [8,9]. For instance, the Transformer can extract the local and global features of the corresponding points between the clouds to obtain the transformation matrix for point clouds [10,11,12]. However, the features extracted by the traditional Transformer do not contain the critical geometric information for point cloud registration, making it impossible to achieve better performance. Therefore, for 3D point cloud registration technology, extracting features containing geometric structure information becomes a solid challenge.
To solve the above challenges, we propose a 3D point cloud registration method that embeds geometric structure into the attention mechanism, forming an end-to-end registration framework. More specifically, REGTR [13] is a classic 3D point cloud registration model. Its famous innovation is to adopt the attention mechanism in the Transformer, which can effectively obtain the global and local features, to replace traditional feature matching.
Nonetheless, the extracted features do not include information on geometric structure, which is extremely important for registering point cloud data. To this end, we take REGTR as the primary architecture and introduce two embedded modules to extract geometric structures. The two structures are bound with the original two self-attention structures of REGTR, respectively, to learn more rich features that contain information on geometric structure. The fusion of global and local features that include geometric structure information can improve the accuracy of point cloud registration. Extensive experimental validation demonstrates that the proposed method can significantly outperform the state-of-the-art mechanisms.
The main contributions of this paper are summarized as follows:
  • As far as authors know, this is the first proposal to embed the geometric structure into an improved REGTR network. The proposed GraM effectively promotes the local features integrated with information on geometric structure and global features.
  • We introduce the attention mechanism to the point cloud registration task and optimize the feature extraction on the REGTR network, significantly improving the accuracy and efficiency of the low-overlap point cloud registration task.
  • Comprehensive experiments on the reconstructed ModelNet40 and KITTI datasets show that GraM obtains better accuracy than state-of-the-art methods.
The remainder of this paper is organized as follows. Section 2 illustrates the related work on 3D point cloud registration technology. Descriptions of the problem definition and the core technology used are in Section 3. Our research methodology and specific implementation steps are introduced in detail in Section 4. Section 5 evaluates the performance of our proposed 3D point cloud registration method. Finally, we summarize the paper and present future research in Section 6.

2. Related Work

Extensive research has been conducted on 3D point cloud registration technologies, which include optimization-based registration, feature learning-based registration, and end-to-end learning-based registration.
Optimization-based point cloud registration. Besl et al. [14] proposed the classic Iterative Closest Point (ICP) algorithm, which iteratively estimates corresponding points between two point clouds and their transformation matrix to achieve registration. IMLP was proposed in [15] to improve the corresponding point estimation of ICP by incorporating measurement noise into the transformation estimation. Segal et al. [16] proposed a generalized version of ICP that allows for the inclusion of arbitrary covariance matrices in ICP variants using point-to-plane metrics. Zhu et al. [17] proposed a graph registration method that simultaneously considers vertices and edges to find point-to-point correspondences between two graphs. Huang et al. [18] introduced a novel pruning module to enhance deep learning-based point cloud registration in low overlap scenarios (Predator), resulting in significant performance improvements. However, the computational efficiency and registration accuracy were significantly decreased when these methods were used to deal with large-scale datasets and low-overlap scenarios.
Feature learning-based point cloud registration. Zeng et al. [19] introduced 3DMatch, a parallel network trained from RGBD images, to extract features by combining the local structure around critical points and further capture the local characteristics of the 3D point cloud. The network 3DFeatNet [20] uses a weakly supervised approach to learn feature correspondences from 3D point clouds. RPMNet [21] can obtain soft correspondences of points in partially overlapping point clouds from a mixture of features learned from spatial coordinates and local geometry. A dynamic graph convolutional neural network is employed in Deep Closest Point (DCP) [22] for feature extraction, which then uses an attention module to learn the correspondence between two point clouds. It still utilizes an SVD module to calculate the rotation matrix and translation vector required for the transformation. These algorithms cannot optimize post-processing operations through learning methods during training, resulting in significant limitations in performance. Wang et al. [23] proposed a novel local descriptor-based framework(YOHO). It employs rotation-equivariant descriptors to achieve robust and efficient point cloud registration with superior performance compared to conventional methods. Recently, Zhang et al. [24] presented a novel approach utilizing rotation-invariant features and spatial geometric consistency for robust partial-to-partial point cloud registration, outperforming existing methods, particularly in handling large rotations. Liu et al. [25] proposed a group-wise contrastive learning (GCL) scheme to extract density-invariant geometric features.
End-to-end learning-based point cloud registration. The core idea of these methods is to add the transformation matrix to the learning network to avoid the impact of post-processing operations on the algorithm’s performance. Deng et al. [26] proposed a relative pose regression network that can directly estimate the relative pose of point clouds based on features learned from local descriptors. Yasuhiro et al. [27] proposed PointNetLK by combining PointNet with the Lucas–Kanade (LK) algorithm [28]. DGR [29] utilized fully convolutional geometric features (FCGFs) [30] for feature extraction from point clouds. The six-dimensional convolutional network structure was employed to estimate point correspondences, and a weighted Procrustes model was used to estimate the transformation. Yu et al. [31] employed rotation-invariant and globally aware descriptors for robust point cloud registration (RIGA), surpassing state-of-the-art methods, especially in managing large rotations across diverse datasets. Recently, Yew et al. [13] applied Transformer for the first time in the point cloud registration. It extracted global and local features through a multi-headed attention mechanism (REGTR), alleviating the problem of point cloud registration at low overlap. These methods can achieve higher accuracy but require more complex network structures and greater computational power.
Although the above methods can complete the point cloud data registration at a certain level, they ignore the most critical geometric structure information. Only the extracted features include geometric structure information, and the 3D coordinates obtained will be more accurate, further improving the registration accuracy.

3. Preliminary

3.1. Problem Definition

Three-dimensional point cloud registration task can be described as follows: there are two point clouds to be registered, X R M × 3 and Y R N × 3 . X represents the source point cloud. Y represents the target point cloud. M and N are the number of points in the source and target point clouds, respectively. The task of 3D point cloud registration involves utilizing a rigid transformation composed of a rotation matrix R S O ( 3 ) and a translation vector t R 3 to align the source point cloud X with the target point cloud Y . Therefore, the process of 3D point cloud registration is the process of finding the optimal rotation matrix R and translation vector t .

3.2. Transformer Model

The Transformer represents a paradigm shift in sequence modeling within the domain of deep learning. Its core concept lies in utilizing self-attention mechanisms for processing sequence data. This mechanism enables the model to dynamically allocate attention weights to various elements within the sequence without needing fixed window sizes or recurrent structures. Such flexibility allows the Transformer model to better capture global and local data information. Additionally, its effectiveness has been demonstrated in the field of computer vision. The self-attention mechanism comprises scaled dot-product attention and multi-head attention.
Scaled dot-product attention. Within the self-attention mechanism, attention weights are computed by scaling the dot product of the query and key vectors and applying the result to the value vector.
Attention ( Q , K , V ) = softmax Q K T d k V
where Q, K, and V represent the query, key, and value vectors, respectively, with d k denoting the dimensionality of the keys.
Multi-head attention. Multi-head attention augments the model’s representative capacity by parallelly applying multiple queries, key, and value projection sets. The results from multiple attention heads are concatenated and linearly transformed to obtain the final output.
MultiHead ( Q , K , V ) = Concat ( head 1 , , head h ) W O
where each attention head head i is computed as Attention ( Q W i Q , K W i K , V W i V ) and W i Q , W i K , W i V , and W O represent the weight matrices for linear transformations.

4. The Proposed Method

Transformation estimation from the corresponding points in two point cloud samples is crucial in point cloud registration. Corresponding point estimation involves identifying the correspondence between points in two point cloud frames corresponding to the same object or scene in the same location. Transformation estimation determines the rotation and translation operations required to align the corresponding positions of two point cloud frames seamlessly. The problem definition, the network architecture, and the loss function are presented and discussed in detail accordingly.

4.1. The Overall Network Architecture

REGTR is an end-to-end network based on Transformer, which can predict the probability of being in the overlapping region for each point in the source point cloud and their corresponding positions in the target point cloud cloud [13]. It could effectively extract the global and local features and perform well in predicting the rigid transformation. To overcome the problem of REGTR not obtaining the essential geometric structure, we propose GraM’s overall network architecture that optimizes the REGTR. Specifically, we bind two geometric structure embedded modules on the two self-attention layers in the Transformer cross-coding, respectively. Consequently, they can learn about local features, including information on geometric structure.
The architecture of GraM is devised by embedding the end-to-end point cloud registration network with a geometric structure, as shown in Figure 1. The architecture contains three core modules: (1) Feature extraction module. We utilize the kernel point convolution backbone [32] to extract critical points’ features from the source and target point clouds while downsampling the input point cloud. (2) Cross-encoder module. A cross-encoder embedded with a geometric structure receives the features. It utilizes a multi-head self-attention layer to learn features of non-critical points within the point cloud itself and a multi-head cross-attention layer to learn features corresponding to points to be registered. (3) Output module. The output decoder obtains the predicted corresponding critical point positions and transformation matrices between the two point clouds using simple regression operations.

4.1.1. Feature Extraction

KPConv (Kernel Point Convolution) exhibits excellent spatial preservation in feature extraction in 3D point cloud data. Based on the ability of spatial preservation, it can handle point cloud data of varying densities and shapes, showcasing outstanding performance across multiple tasks. With the features of preprocessed point cloud data, we optimize the feature extraction process using KPConv networks. Specifically, we build a double-input KPConv network with shared weights to extract the same homogeneous characteristics from the original and target point cloud data. It can meet the requirements that they come from the same dataset and can allow swaps. The network iteratively applies a series of residual blocks, including convolution, kernel point convolution, stridden convolution, normalization modules, and LeakyReLU activation functions. Specifically, multiple times of feature extraction and downsampling are firstly performed on the input source point cloud X R M × 3 and target point cloud Y R N × 3 , and then, they are transformed into critical point sets X ˜ R M × 3 and Y ˜ R N × 3 , along with their features F X ˜ R M × D and F Y ˜ R N × D .
With the above thought, we apply this feature extraction network separately to the source and target point clouds, obtaining critical points’ features for both the source and target point clouds. This process supports the subsequent learning of features in the cross-encoder.

4.1.2. Cross-Encoder

We employ the transformer cross-encoder network to learn features of points in the point cloud and their correspondences with points in the target point cloud. To solve the problem that the output dimensions of different depths from the feature extraction network are diverse, we introduce linear feature projection functions to reduce the dimensions of the outputs before passing them into the cross-encoder. Figure 2 shows the cross-encoder structure, which can extract local and global features. One is the local feature of points outside the key point of a self-point cloud, and the other is the global features that describe the correlation of the two point clouds.
Although the classical Transformer performs sine positional encoding to embed the coordinate information, the coordinate-based encoding is unfixed or invariant. As a result, when executing point cloud registration, the point cloud coordinates change accordingly if different initial poses are used for the same point cloud pair. In this case, coordinate-based coding does not work [33].
In this paper, we replace the sine positional encoding of point clouds in the cross-encoder with geometric structure position encoding. This modification enables the cross-encoder to learn the geometric structure features between critical points before learning self-features and relevant features. It can further improve point cloud registration accuracy. Geometric structure position encoding includes pairwise distance embedding and triangular embedding. The former represents the distance of the pair of critical points, and the latter is the angle of the triple critical points.
Pairwise distance embedding. We assume that P i ^ and P j ^ are given points and the distance between the two points d i , j = | | P i ^ P j ^ 2 . The distance formula for pairwise distance embedding r i , j D should satisfy
r i , j , 2 k D = s i n d i , j / σ d 10000 2 k / d t r i , j , 2 k + 1 D = c o s d i , j / σ d 10000 2 k / d t
where d t is the feature dimension and σ d is the temperature coefficient controlling the sensitivity to distance changes.
Triangular embedding. The triangular embedding can be calculated using the same method. Assuming that the given angles is α i , j k , the triangular embedding r i , j , k A can be calculated based on
r i , j , k , 2 x A = s i n α i , j k / σ a 10000 2 x / d t r i , j , k , 2 x + 1 D = c o s α i , j k / σ a 10000 2 x / d t
where σ a is another temperature coefficient controlling the sensitivity to angle variations.

4.1.3. Output Decoder

This paper’s output decoder differs from the original Transformer’s decoder in architecture. Since the cross-encoder has already learned the local and global features, there is no need to use attention mechanisms for decoding. Instead, simple regression operations are sufficient to estimate the corresponding positional coordinates and transformation matrices.
In estimating corresponding positional coordinates, we use a two-layer MLP to regress the required coordinates. So, the corresponding position of the critical point X ^ of the source point cloud in the target point cloud Y ^ R M × 3 is
Y ^ = R e L U ( F ¯ X ˜ W 1 + b 1 ) W 2 + b 2
where W 1 , W 2 , b 1 , and b 2 are learnable weights and biases, respectively. Similar methods can be employed to obtain the predicted positions X ^ R N × 3 after receiving the critical points Y ^ of the target point cloud. Simultaneously, we utilize a fully connected layer with the sigmoid activation function to predict overlap confidences O ^ X R M × 1 and O ^ Y R N × 1 . This design eliminates interference from points outside the overlapping region that cannot accurately predict corresponding relationships, significantly improving the accuracy of the transformation matrix estimation. After obtaining the predicted transformation coordinates, the estimation of the transformation matrix can be performed. Connecting the predicted transformation positions for the two point clouds yields a M + N dimensional corresponding point set, as shown in Equation (6):
X ^ c o r r = X ˜ X ^ , Y ^ c o r r = Y ˜ Y ^ , O ^ c o r r = O ^ X O ^ Y
where X ˜ and Y ^ represent sets of critical points and X ^ is the predicted value for the critical points of the target point cloud Y corresponding to the source point cloud X , while Y ^ is the predicted value for the critical points of the source point cloud X corresponding to the target point cloud Y .
R ^ , t ^ = a r g m i n R , t Σ M + N i o i ^ | | R x ^ i + t y ^ i | | 2
where x ^ i , y ^ i , o ^ i represent i-th rows of matrices X ^ c o r r , Y ^ c o r r , and o ^ c o r r , respectively. R is the rotation transformation matrix, t is the translation transformation vector, and R ^ and t ^ are the minimum predicted values for R and t satisfying the Equation (7). In this paper, we follow the approach proposed in [21,34] by using a differentiable weighted Kabsch–Umeyama [35,36] algorithm to solve the equation and obtain the rotation matrix and translation vector.

4.2. The Loss Function

Three loss functions for the supervised training of an end-to-end network incorporating attention mechanisms are used in this paper: the feature loss function, the overlap loss function, and the correspondence loss function.
The feature loss. To obtain the geometric structure when calculating the correspondence of the critical point, we apply InfoNCE loss [37] on the features related to both the current point cloud and another point cloud. Considering the correspondence between the critical point set x X ˜ of the source point cloud and the critical point set Y ˜ of the target point cloud, the InfoNCE loss for the source point cloud can be described as follows:
L f X = E x X ˜ [ l o g f ( x , p x ) f ( x , p x ) + Σ n x f ( x , n x ) ]
We follow the work of Oord et al. [37], where the function f ( · ) in the Equation (8) is a log-linear model, expressed as follows:
f ( x , c ) = e x p ( f ¯ x T W f f ¯ c )
where f ¯ x denotes the conditional feature of point x. p x and n x denote the sets of critical points in the target point cloud critical point set Y ˜ that match and do not match with x, respective, that is, the positive and negative sample sets. These two sets are determined by the margin values of positive and negative samples ( r p , r n ) , where the values of ( r p , r n ) are set to ( m , 2 m ) and m is the voxel distance used in the final downsampling layer of KPConv. All negative sample points falling outside the negative margin are included in the set n x .
The overlap loss. To calculate the overlap rate of the point cloud and predict the corresponding point relationships, avoiding some redundant work of critical point extraction, we use the binary cross-entropy loss to calculate overlap loss. The expression for the overlap loss function of the source point cloud X is depicted as follows:
L o X = 1 M Σ M i o x ˜ i * · l o g o ^ x ˜ i + ( 1 o x ˜ i * ) · l o g ( 1 o ^ x ˜ i )
To obtain the true value for the overlap labels o x ˜ i * , we employ the approach proposed by Huang et al. [18] to calculate the truth labels for the original dense point cloud. Thus, the truth label for point X i X is defined as follows:
o x ˜ i * = 1 , | | T * ( x i ) N N ( T * ( x i ) , Y ) | | < r o 0 , o t h e r w i s e
where T * ( x i ) represents the truth transformation matrix { R * , t * } , N N ( · ) denotes spatial nearest neighbors, and r o is a predefined overlap threshold. Subsequently, average pooling is employed to obtain the truth overlap labels o x ˜ i * for the downsampled critical points by using the same parameters as the pooling operation in the downsampling step of KPConv.
The loss L o Y for the target point cloud Y can be obtained in a similar manner. Thus, the total overlap loss is given by L o = L o X + L o Y .
The correspondence loss. Matching the main points in the overlapping area is used to calculate the overlapping rate. Therefore, we apply an L c X loss on the predicted transformation matrix for critical points in the overlapping region. The L c X loss for the source point cloud X is defined as follows:
L c X = 1 Σ i o x ˜ i * Σ M i o x ˜ i * | T * ( x ˜ i ) y ^ i |
The L c Y loss on the target point cloud is similar to L X , and the overall loss for the correspondence is L c = L c X + L c Y . Therefore, the final loss in this paper is a weighted sum of these three components: L = L c + λ o L o + λ f L f , where λ o = 1.0 and λ f = 0.1 .

5. Experiments

This section presents the dataset, metrics, baselines, experimental setup, main results, and ablation studies. The code is available at https://github.com/liupin-source/CSR-RegTR (accessed on 11 May 2024).

5.1. Dataset

We conducted extensive experiments on the representative ModelNet40 and KITTI datasets. To address issues such as insufficient data volume and information in the ModelNet40 dataset and inaccuracies in some truth labels in the KITTI dataset, we reconstructed a dataset more suitable for 3D point cloud registration tasks.
ModelNet40. ModelNet40 is a subset of the ModelNet dataset built by Princeton University and includes 40 types of point cloud data. We directly sampled the initial ModelNet40 point cloud dataset [38] twice at complete random to generate the source and target point clouds, which do not have precisely corresponding points. Then, we selected 4096 points in each sampling. Finally, we applied segmentation operations to the point cloud data, which can generate datasets with overlap rates of 70% and 50%. These two datasets are named ModelNet40 and LowModelNet40, respectively, in which sample point cloud data are shown in Figure 3 and Figure 4.
KITTI. KIT (Karlsruhe Institute of Technology) and TTI-C (Toyota Technological Institute at Chicago) jointly founded the dataset and obtained data from the collection vehicle equipped with a Velodyne lidar with 0.09 degrees resolution. KITTI contains multiple datasets, such as 3D object detection and visual ranging. We only use point cloud data for 3D registration tasks. To address the problem that some truth labels in the KITTI dataset [39] are not accurate, we perform a manual matching to calibrate the truth labels. Because of the large scale of the KITTI dataset, we employed voxel filtering with a grid size of 0.3 m for downsampling to preserve the density of the 3D point cloud after subsampling. An example of the preprocessed KITTI point cloud dataset is shown in Figure 5.

5.2. Evaluation Metrics

Relative rotation error (RRE). RRE is the degree difference between the predicted rotation matrix and the actual rotation matrix used to measure the error of the rotation matrix.
R R E = c o s 1 1 2 t r a c e ( R T R ¯ 1 )
where R represents the true rotation matrix and R ¯ represents the predicted rotation matrix.
Relative translation error (RTE). RTE refers to the Euclidean distance between the predicted translation vector and the actual translation vector, serving to quantify the error in the translation vector.
R T E = | | t t ¯ | | 2
where t represents the true translation matrix and t ¯ represents the predicted translation matrix.
Registration recall (RR). RR measures the accuracy of point cloud registration algorithms predicting the transformation matrix. The larger the value, the higher the accuracy of the transformation matrix. RR refers to the average ratio of correspondences correctly matched in the overlapping region to the total ones. This correct match occurs when the source point cloud is registered with the target point cloud using the predicted transformation matrix.
R R = 1 M Σ M i = 1 1 | e * | Σ ( p x i * , q y i * ) e * | | T ^ p x i * q y i * | | 2 2 < τ 3
where e * represents point pairs in the true labels with corresponding relationships and ( p x i * , q y i * ) denotes a pair of true corresponding points. T ^ S E ( 3 ) represents the predicted transformation matrix. Additionally, τ 3 represents the error threshold between the predicted and true values.

5.3. Baselines

3DFeatNet [20]: A network for learning feature correspondences from 3D point clouds using weak supervision methods.
RPMNet [21]: A deep learning-based point cloud registration method that is less sensitive to initialization and more robust.
DCP [22]: A learning-based approach that includes a point cloud feature extraction network, point cloud matching prediction based on attention mechanisms, and a differentiable singular value decomposition layer.
PointNetLK [27]: A 3D point cloud registration method that combines PointNet with the LK algorithm.
REGTR [13]: An end-to-end 3D point cloud registration network that utilizes attention mechanisms.
DGR [29]: A differentiable network architecture designed for actual point cloud data.
Predator [18]: A point cloud registration method explicitly designed to handle low overlap scenarios.

5.4. Experimental Setup

For the ModelNet40 and LowModelNet40 datasets, their training sets included 6316 pairs of point cloud data. The former’s test set contained 5995 pairs of point cloud data, and the latter’s test set contained 12,311 pairs of point cloud data. The convolution radius of KPConv is 2.75, and the initial sampling radius is 0.0375. During the training process, we trained the network using the AdamW [40] optimizer with an initial learning rate of 0.0001 and weight decay of 0.0001. The training epochs were 80, and each training iteration was verified in the validation set. For the KITTI dataset, the training and test sets contained 1358 and 555 pairs of point cloud data, respectively. The convolution radius of KPConv was 4.5, and the initial sampling radius was 0.3. The optimizer settings were consistent with the ModelNet40 dataset. The training consisted of 200 epochs, with a learning rate decay of 0.5 every 50 epochs. After each training iteration, the network was tested over the validation set.

5.5. Convergence Analysis

The core idea of GraM is to extract the global and local features containing geometric structure information to improve the importance of the 3D point cloud. To analyze whether it can achieve the above goal, we recorded the losses, creating feature-loss, overlap-loss, and correspondence-loss curves on the ModelNet40 and KITTI datasets, as shown in Figure 6. On the one hand, all loss curves quickly converge (i.e., 10 epochs on the ModelNet40 and 40 epochs on the KITTI), which shows that GraM can quickly position and learn the features related to geometric structure information. On the other hand, all the loss curves are relatively stable, without oscillation. This indicates that the loss function we designed can accurately restrict the limited conditions for each feature learning.

5.6. Comparison with State-of-the-Art Methods

To verify the superiority of GraM’s performance, we compare it with those of feature learning-based methods RPMNet and DCP, as well as end-to-end methods PointNetLK, basic REGTR, etc. Table 1 shows the experimental results of the ModelNet40, LowModelNet40, and KITTI datasets on RRE, RTE, and RR evaluation metrics. The results in Table 1 indicate that our method, GraM, is slightly advantageous compared to basic REGTR. However, GraM has a significant advantage over DCP and PointNetLK. The main reason is that processing the ModelNet40 and LowModelNet40 datasets involves only quantitative changes. Significantly, the number of point clouds increases, allowing the network to capture as much sufficient information as possible. In conclusion, our algorithm shows a certain degree of performance improvement effect, as indicated by the registration results shown in Figure 7, Figure 8 and Figure 9.

5.7. Analysis Sensitivity of Sampling Radius

Downsampling is a critical procedure of 3D point cloud data processing in KPConV. The sampling radius is an essential parameter of the process. The appropriate sampling radius can effectively reduce the scale of the point cloud data to facilitate feature learning of subsequent network structures. To analyze the sensitivity and effectiveness of the sampling radius, we conducted experiments using GraM with different sampling radii. Table 2 displays three metrics’ performance results and the model training’s time consumption on the KITTI dataset. From the performance results in Table 2, we can observe that a large sampling radius does not achieve the high accuracy of our geometrically embedded 3D point cloud registration algorithm because a massive sampling radius may overlook critical point information, preventing the cross-encoding network from learning sufficient features. In particular, the model training time decreases as the sampling radius increases. The core reason is that the small sampling radius can make the data too large, which causes many network parameters and eventually increases model training time.

5.8. Ablation Studies

The following subsections discuss the ablation studies, including the effectiveness of each component of GraM and performances with different loss functions.

5.8.1. Effectiveness of GraM’s Each Component

GraM takes the architecture of the REGTR as a carrier and introduces the feature extraction module (KPConv) and geometric structure embedding (GSE) module of the shared weight. To analyze the effectiveness of these two modules on GraM’s performance, we carried out an ablation study and recorded the results on ModelNet40 and KITTI datasets, as shown in Table 3. Overall, our final GraM (REGTR+KPConv+GSE) achieved optimal performance on all metrics. The table also shows that the two modules we introduce can effectively improve the performance of 3D point cloud registration. From the perspective of individual modules on performance effects, compared with GSE-based (i.e., ‡ relative to †) performance improvement and KPConV-based (i.e., † relative to *) performance improvement, three of the four metrics achieved the optimal result. This illustrates that the contribution of geometric structure embedding rather than KPConv to performance improvement is more significant. The results further verify that geometric structure embedding can effectively extract valuable geometric structure information to 3D points cloud registration.

5.8.2. Effectiveness of GraM with Different Loss Functions

We conducted experiments using GraM with different combinations of loss functions and recorded the results in Table 4. We can observe from Table 4 that any one or any two of the three loss functions cannot achieve satisfactory accuracy. It is worth noting that the algorithm can achieve the best performance only when all three loss functions are used simultaneously for network training. Judging from the analysis of a single loss function, comparing GraM with ( L c + L f ) to GraM with ( L c + L o ), the former achieves four maximum values in the four values, which shows that A is more advantageous than performance improvement (i.e., L f > L o ). Similarly, L c > L o and L c > L o . Therefore, we can sort the contribution of the three loss functions to the point cloud distribution performance as follows: L f > L c > L o .

6. Conclusions

This paper proposes a 3D point cloud registration method, GraM, which embeds the geometric structure into the attention mechanism to form an end-to-end registration framework. The framework can effectively extract local and global features containing the geometric structure information. With this feature, simple regression is enough to obtain the corresponding position coordinates and transformation matrix, thereby improving the registration accuracy of the point cloud. Extensive experiments show that the proposed method is far superior to existing state-of-the-art methods. In the future, we will explore techniques to achieve better registration accuracy for large-scale point cloud datasets with low overlap rates using lightweight models.

Author Contributions

Methodology, P.L.; Software, X.Z.; Formal analysis, L.Z.; Writing—review and editing, R.W. and J.Z. (Juan Zhang); Visualization, J.Z. (Jianyong Zhu); Supervision, R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the Fundamental Research Funds for the Central Universities (No. 2-9-2022-062).

Data Availability Statement

Data not available due to commercial restrictions. Due to the nature of this research, participants of this study did not agree for their data to be shared publicly, so supporting data is not available.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Azuma, R.T. A survey of augmented reality. Presence Teleoperators Virtual Environ. 1997, 6, 355–385. [Google Scholar] [CrossRef]
  2. Carmigniani, J.; Furht, B.; Anisetti, M.; Ceravolo, P.; Damiani, E.; Ivkovic, M. Augmented reality technologies, systems and applications. Multimed. Tools Appl. 2011, 51, 341–377. [Google Scholar] [CrossRef]
  3. Billinghurst, M.; Clark, A.; Lee, G. A survey of augmented reality. Now 2015, 8, 73–272. [Google Scholar]
  4. Liu, D.; Long, C.; Zhang, H.; Yu, H.; Dong, X.; Xiao, C. ARShadowGAN: Shadow generative adversarial network for augmented reality in single light scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8139–8148. [Google Scholar]
  5. Popișter, F.; Popescu, D.; Păcurar, A.; Păcurar, R. Mathematical Approach in Complex Surfaces Toolpaths. Mathematics 2021, 9, 1360. [Google Scholar] [CrossRef]
  6. Luo, K.; Yang, G.; Xian, W.; Haraldsson, H.; Hariharan, B.; Belongie, S. Stay Positive: Non-Negative Image Synthesis for Augmented Reality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10050–10060. [Google Scholar]
  7. Joseph, K.; Khan, S.; Khan, F.S.; Balasubramanian, V.N. Towards open world object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 5830–5840. [Google Scholar]
  8. Merickel, M. 3D reconstruction: The registration problem. Comput. Vis. Graph. Image Process. 1988, 42, 206–219. [Google Scholar] [CrossRef]
  9. Izadi, S.; Kim, D.; Hilliges, O.; Molyneaux, D.; Newcombe, R.; Kohli, P.; Shotton, J.; Hodges, S.; Freeman, D.; Davison, A.; et al. Kinectfusion: Real-time 3D reconstruction and interaction using a moving depth camera. In Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, USA, 16–19 October 2011; pp. 559–568. [Google Scholar]
  10. Pan, X.; Xia, Z.; Song, S.; Li, L.E.; Huang, G. 3D Object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7463–7472. [Google Scholar]
  11. Shi, X.; Ye, Q.; Chen, X.; Chen, C.; Chen, Z.; Kim, T.K. Geometry-based distance decomposition for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 15172–15181. [Google Scholar]
  12. Zou, Z.; Ye, X.; Du, L.; Cheng, X.; Tan, X.; Zhang, L.; Feng, J.; Xue, X.; Ding, E. The devil is in the task: Exploiting reciprocal appearance-localization features for monocular 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2713–2722. [Google Scholar]
  13. Yew, Z.J.; Lee, G.H. REGTR: End-to-end point cloud correspondences with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 6677–6686. [Google Scholar]
  14. Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. Proc. SPIE 1992, 1611, 586–606. [Google Scholar] [CrossRef]
  15. Billings, S.D.; Boctor, E.M.; Taylor, R.H. Iterative most-likely point registration (IMLP): A robust algorithm for computing optimal shape alignment. PLoS ONE 2015, 10, e0117688. [Google Scholar] [CrossRef] [PubMed]
  16. Segal, A.; Haehnel, D.; Thrun, S. Generalized-ICP. In Proceedings of the Robotics: Science and Systems, Seattle, WA, USA, 28 June–1 July 2009; Volume 2, p. 435. [Google Scholar]
  17. Zhu, H.; Guo, B.; Zou, K.; Li, Y.; Yuen, K.V.; Mihaylova, L.; Leung, H. A review of point set registration: From pairwise registration to groupwise registration. Sensors 2019, 19, 1191. [Google Scholar] [CrossRef]
  18. Huang, S.; Gojcic, Z.; Usvyatsov, M.; Wieser, A.; Schindler, K. Predator: Registration of 3d point clouds with low overlap. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4267–4276. [Google Scholar]
  19. Zeng, A.; Song, S.; Nießner, M.; Fisher, M.; Xiao, J.; Funkhouser, T. 3DMatch: Learning local geometric descriptors from rgb-d reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1802–1811. [Google Scholar]
  20. Yew, Z.J.; Lee, G.H. 3DFeat-Net: Weakly supervised local 3d features for point cloud registration. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 607–623. [Google Scholar]
  21. Yew, Z.J.; Lee, G.H. RPM-Net: Robust point matching using learned features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11824–11833. [Google Scholar]
  22. Wang, Y.; Solomon, J.M. Deep closest point: Learning representations for point cloud registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3523–3532. [Google Scholar]
  23. Wang, H.; Liu, Y.; Dong, Z.; Wang, W. You only hypothesize once: Point cloud registration with rotation-equivariant descriptors. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 1630–1641. [Google Scholar]
  24. Zhang, Y.; Zhang, W.; Li, J. Partial-to-partial point cloud registration by rotation invariant features and spatial geometric consistency. Remote Sens. 2023, 15, 3054. [Google Scholar] [CrossRef]
  25. Liu, Q.; Zhu, H.; Zhou, Y.; Li, H.; Chang, S.; Guo, M. Density-invariant features for distant point cloud registration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 18215–18225. [Google Scholar]
  26. Deng, H.; Birdal, T.; Ilic, S. 3D local features for direct pairwise registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 3244–3253. [Google Scholar]
  27. Aoki, Y.; Goforth, H.; Srivatsan, R.A.; Lucey, S. PointNetLK: Robust & efficient point cloud registration using pointnet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 7163–7172. [Google Scholar]
  28. Baker, S.; Matthews, I. Lucas-kanade 20 years on: A unifying framework. Int. J. Comput. Vis. 2004, 56, 221–255. [Google Scholar] [CrossRef]
  29. Choy, C.; Dong, W.; Koltun, V. Deep global registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2514–2523. [Google Scholar]
  30. Choy, C.; Park, J.; Koltun, V. Fully convolutional geometric features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8958–8966. [Google Scholar]
  31. Yu, H.; Hou, J.; Qin, Z.; Saleh, M.; Shugurov, I.; Wang, K.; Busam, B.; Ilic, S. Riga: Rotation-invariant and globally-aware descriptors for point cloud registration. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 3796–3812. [Google Scholar] [CrossRef] [PubMed]
  32. Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. KPConv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
  33. Qin, Z.; Yu, H.; Wang, C.; Guo, Y.; Peng, Y.; Ilic, S.; Hu, D.; Xu, K. GeoTransformer: Fast and Robust Point Cloud Registration With Geometric Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9806–9821. [Google Scholar] [CrossRef] [PubMed]
  34. Gojcic, Z.; Zhou, C.; Wegner, J.D.; Guibas, L.J.; Birdal, T. Learning multiview 3d point cloud registration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1759–1769. [Google Scholar]
  35. Kabsch, W. A solution for the best rotation to relate two sets of vectors. Acta Cryst. 1976, 32, 922–923. [Google Scholar] [CrossRef]
  36. Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 376–380. [Google Scholar] [CrossRef]
  37. van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
  38. Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar]
  39. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Ind. Robot. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  40. Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Figure 1. The overall architecture of the proposed GraM. ×L represents multiple cross-encoder network layers.
Figure 1. The overall architecture of the proposed GraM. ×L represents multiple cross-encoder network layers.
Electronics 13 01995 g001
Figure 2. The network architecture of cross-encoder based on geometric structure embedding.
Figure 2. The network architecture of cross-encoder based on geometric structure embedding.
Electronics 13 01995 g002
Figure 3. Example from the ModelNet40 dataset.
Figure 3. Example from the ModelNet40 dataset.
Electronics 13 01995 g003
Figure 4. Example from the LowModelNet40 dataset.
Figure 4. Example from the LowModelNet40 dataset.
Electronics 13 01995 g004
Figure 5. Example from the KITTI dataset.
Figure 5. Example from the KITTI dataset.
Electronics 13 01995 g005
Figure 6. The convergence situation of our GraM on ModelNet40 and KITTI datasets; feature_loss, corre_loss, overlap_loss, and total_loss represent feature loss, overlap loss, correspondence loss, and their total loss, respectively.
Figure 6. The convergence situation of our GraM on ModelNet40 and KITTI datasets; feature_loss, corre_loss, overlap_loss, and total_loss represent feature loss, overlap loss, correspondence loss, and their total loss, respectively.
Electronics 13 01995 g006
Figure 7. Example of a registration result using GraM on the ModelNet40 dataset.
Figure 7. Example of a registration result using GraM on the ModelNet40 dataset.
Electronics 13 01995 g007
Figure 8. Example of a registration result using GraM on the LowModelNet40 dataset.
Figure 8. Example of a registration result using GraM on the LowModelNet40 dataset.
Electronics 13 01995 g008
Figure 9. Example of a registration result using GraM on the KITTI dataset.
Figure 9. Example of a registration result using GraM on the KITTI dataset.
Electronics 13 01995 g009
Table 1. The comparison of GraM to the state-of-the-art approaches, where the best results are in bold.
Table 1. The comparison of GraM to the state-of-the-art approaches, where the best results are in bold.
MethodModelNet40LowModelNet40KITTI
RRE (°)RTE (m)RRE (°)RTE (m)RRE (°)RTE (m)
RPMNet1.7120.0187.3420.1241.0210.633
DCP11.9750.17116.5010.3000.9650.583
PointNetLK29.7250.29748.5670.5072.3520.936
REGTR1.4730.0143.9300.0870.4820.425
3DFeatNet2.0570.0394.0260.0730.2540.259
Predator1.9480.0263.5680.0720.2770.068
DGR2.0040.0243.6270.0690.3730.320
GraM0.9250.0102.6530.0490.2700.110
Table 2. Experiment results using GraM with different sampling radii on the KITTI dataset. TC represents the time consumption (hours) of model training. The best results are in bold.
Table 2. Experiment results using GraM with different sampling radii on the KITTI dataset. TC represents the time consumption (hours) of model training. The best results are in bold.
MethodKITTI
RRE (°)RTE (m)RR (%)TC (h)
radius-0.40.3520.21497.34.27
radius-0.50.4130.32596.16.24
radius-0.30.2700.11099.83.51
Table 3. Registration result using different components on ModelNet40 and KITTI datasets. GSE indicates the geometric structure embedding network used to extract geometric structure information. The best results are in bold. *, †, and ‡ refer to the baseline, GraM, and our final GraM. ↓ means the RRE and RTE are reduced.
Table 3. Registration result using different components on ModelNet40 and KITTI datasets. GSE indicates the geometric structure embedding network used to extract geometric structure information. The best results are in bold. *, †, and ‡ refer to the baseline, GraM, and our final GraM. ↓ means the RRE and RTE are reduced.
MethodModelNet40KITTI
RRE (°)RTE (m)RRE (°)RTE (m)
Baseline (REGTR) *1.4730.0140.4820.425
Our GraM (REGTR+KPConv) †1.2480.0130.3240.301
Our Final GraM (REGTR+KPConv+GSE) ‡0.9250.0100.2700.110
† relative to *0.225↓0.001↓0.1580.124↓
‡ relative to †0.3230.0030.054↓0.191
Table 4. Registration result of GraM using different loss functions on ModelNet40 and KITTI dataset. The best results are in bold.
Table 4. Registration result of GraM using different loss functions on ModelNet40 and KITTI dataset. The best results are in bold.
MethodModelNet40KITTI
RRE (°)RTE (m)RRE (°)RTE (m)RR(%)
Baseline ( L c loss in Equation (12))2.4420.0200.3020.17498.9
Our GraM ( L o + L f )2.2060.0160.3450.14298.6
Our GraM ( L c + L f )2.1250.0150.3420.13999.1
Our GraM ( L c + L o )2.2410.0170.3510.15398.0
Our Final GraM ( L c + L o + L f )1.6230.0130.2700.11099.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, P.; Zhong, L.; Wang, R.; Zhu, J.; Zhai, X.; Zhang, J. GraM: Geometric Structure Embedding into Attention Mechanisms for 3D Point Cloud Registration. Electronics 2024, 13, 1995. https://doi.org/10.3390/electronics13101995

AMA Style

Liu P, Zhong L, Wang R, Zhu J, Zhai X, Zhang J. GraM: Geometric Structure Embedding into Attention Mechanisms for 3D Point Cloud Registration. Electronics. 2024; 13(10):1995. https://doi.org/10.3390/electronics13101995

Chicago/Turabian Style

Liu, Pin, Lin Zhong, Rui Wang, Jianyong Zhu, Xiang Zhai, and Juan Zhang. 2024. "GraM: Geometric Structure Embedding into Attention Mechanisms for 3D Point Cloud Registration" Electronics 13, no. 10: 1995. https://doi.org/10.3390/electronics13101995

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop