GACM: A Graph Attention Capsule Model for the Registration of TLS Point Clouds in the Urban Scene

: Point cloud registration is the foundation and key step for many vital applications, such as digital city, autonomous driving, passive positioning, and navigation. The difference of spatial objects and the structure complexity of object surfaces are the main challenges for the registration problem. In this paper, we propose a graph attention capsule model (named as GACM) for the efﬁcient registration of terrestrial laser scanning (TLS) point cloud in the urban scene, which fuses graph attention convolution and a three-dimensional (3D) capsule network to extract local point cloud features and obtain 3D feature descriptors. These descriptors can take into account the differences of spatial structure and point density in objects and make the spatial features of ground objects more prominent. During the training progress, we used both matched points and non-matched points to train the model. In the test process of the registration, the points in the neighborhood of each keypoint were sent to the trained network, in order to obtain feature descriptors and calculate the rotation and translation matrix after constructing a K-dimensional (KD) tree and random sample consensus (RANSAC) algorithm. Experiments show that the proposed method achieves more efﬁcient registration results and higher robustness than other frontier registration methods in the pairwise registration of point clouds.


Introduction
Three-dimensional laser scanning is a technology that employs lasers to efficiently acquire spatial 3D data [1]. Features of the data, collected by laser scanning, include high precision, rich information, and real three-dimensionality, so it is widely used in urban planning [2], autonomous driving [3], high-precision mapping [4], and smart cities [5]. The efficient registration of point clouds, scanned from the urban scene, is a basic and vital requirement in point cloud processing, which is commonly employed in the point cloudbased simultaneous localization and mapping (SLAM) [6], positioning and navigation [7,8], 3D city reconstruction [9], and digital twins [10].
In the process of capturing point clouds in the urban scene, each scan has its own view angle and direction, and the scanned scenes are only partially (or less) overlapping. Besides, the urban scene is usually complex, and some objects have a high degree of similarity in the local structure. For example, the spatial structure between different floors of the same set and increase the training time. From this point of view, the input of 3D point capsule networks [33] is four-dimensional point pair features [30,34], instead of direct point 3D coordinate input.
Our research goal in this paper is to construct an efficient registration model of terrestrial laser scanning (TLS) point clouds with spatial coordinate information, only in the urban scene. We combine the graph attention convolution with 3D capsule network to form a new neural network model (namely GACM) for the first time, to extract discriminative TLS point cloud features in the urban scene. The proposed that GACM integrates the anti-rotation ability of the graph attention convolution and the part-whole relationships represented by the capsule vector. The final feature descriptor can effectively determine the corresponding points, thus improving the registration effects of TLS point clouds in the urban scene. Compared with several frontier registration methods, in recent years, our method has achieved a higher registration success rate in TLS point clouds of the urban scene. The main contributions of our research are as follows: (1) A new neural network model (namely GACM) is proposed to learn the 3D feature descriptors in the urban scene, which can fully represent the feature of 3D urban by fusing the advantages of graph attention convolution network and 3D capsule network. (2) We combined the GACM into a new efficient registration framework of TLS point clouds in the urban scene and successfully applied the learned 3D urban feature descriptors in the high-quality registration of the TLS point clouds.

Handcrafted Three-Dimensional Feature Descriptors of Point Clouds
Researchers have designed many feature descriptors, based on the existing human experience and knowledge. However, these descriptors are relatively low-level features. They are basically designed from geometry or statistical rules, so the expression abilities of the features are not strong. For instance, Johnson and Hebert [39] designed a spin image feature, which projects points of a cylindrical coordinate system onto a two-dimensional rotated image and determines the intensity of each grid (namely pixel) of the image, according to the number of points falling into the grid. Finally, the intensity values of the image form a feature vector to characterize the feature point and its neighborhood. Based on 2D shape context [40], Frome et al. [41] designed a 3D shape context (3DSC) feature. The 3DSC feature relies on the number of points in each sphere grid, which is built around the point. Since the north pole direction of the sphere is estimated based on the surface normal of the point, it produces a degree of freedom along the azimuth angle, so that the calculation and storage of multiple information are required for each feature to express the feature in the reference azimuth direction, which greatly increases the storage and calculation burden. To solve this problem, Tombari et al. [42] proposed a unique shape context (USC) feature by establishing a local coordinate system for each point, in order to avoid the need to calculate multiple information of a given point in 3DSC, thereby greatly improving the efficiency. Rusu et al. [18] designed a PFH feature that extracts an optimal set, in order to describe the features of point clouds by estimating a group of robust 16D geometric shape features and analyzing the persistence of features at different scales. Based on the above work, Rusu et al. [21] modified the mathematical expression of PFH to design the FPFH feature, which greatly reduces the computation by caching the previously calculated value. Both PFH and FPFH make use of the paired combination of the surface normal to describe the curvature around a point. In addition to the above-mentioned, handcrafted feature descriptors, there exists other approaches, such as the rotational projection statistics (RoPS) [43], the signature of histograms of orientations (SHOT) [44], and the normal aligned radial feature (NARF) [45]. These methods work well in scenarios with medium/high, evenly-distributed point density and with less noise. As the point density decreases and the noise increases, the performance will dramatically degrade. In the real-world, the point density on urban buildings, roads, and other objects is mostly uneven, which makes the feature histograms of a same object vary widely. In short, the features of hand-designed descriptors are generally low-level geometric features or relatively single features, which often fail to effectively identify local areas, when dealing with a complex realistic urban scene.

Learnable 3D Feature Descriptors of Point Clouds
For the irregular characteristics of point cloud, most point cloud processing methods use voxelization or a directly processing model [27]. For the 3D data represented by RGB-D, the representative learnable 3D descriptors extraction networks include 3DMatch [46], PPFNet [30], PPF-FoldNet [31], etc. In the original papers, these methods were tested with indoor data sets, and some of the input [30,31] contained features such as PPF [30,34], instead of the point coordinates. Recently, scholars proposed several new network models to test more types of data. For example, Zhang et al. [47] designed a Siamese network to learn deep feature descriptors for sparse point clouds based on voxel representation, and it was tested with indoor and urban subway tunnel data sets; 3DFeat-Net [29] designed a three-branch Siamese architecture to detect interest points and learn descriptors in a weakly supervised manner. The descriptor extraction module is implemented based on PointNet and has been tested on both indoor and outdoor datasets. Li and Lee [28] proposed an unsupervised learning method (unsupervised stable interest point, USIP), which uses the structure embedding of PointNet in both interests, i.e., point selection and feature extraction. DeepVCP [48] is a registration method, applied to autonomous driving, which directly implements an end-to-end, high-precision registration network on the point cloud. To determine the corresponding points, the PointNet structure is used in the feature embedding layer to obtain a detailed local feature description. In addition to these learnable descriptors, obtained by using a single network type or convolution method, other examples are based on fully convolutional geometric features (FCGF) [49], which adopt the Minkowski convolution to extract the feature of each point. Besides, the perfect match [50] encodes the unstructured 3D point cloud into a smoothed density value (SDV) grid that is convenient for convolving and designing a Siamese CNN to learn the descriptors. The D3Feat [51] uses the kernel point convolution (KPConv) [52] to process irregular point clouds to learn the feature descriptors. There are less abundant studies on the learnable descriptors of fusion features. Here, we list two of the representative works. The first one is the compact geometric features (CGF) [53], in which the hand-crafted descriptors are input into a fully-connected network, with five hidden layers, to obtain learning-based descriptors. Additionally, Yang et al. [54] designed a descriptor of fusion features, obtained by multiple combinations of various hand-designed descriptors with low-level features.

Method
The method combined graph attention convolution and 3D capsules into a new neural network model (GACM). The learned, robust, local features of GACM were used to conduct efficient registration for TLS point clouds in the urban scene. The framework of the proposed method is shown in Figure 1. In the training process, we used the farthest point sampling (FPS) algorithm to sample z points in each scan of point cloud as keypoints, and then adopted the k-nearest neighbors (k-NN) algorithm to search k-nearest neighbors of each keypoint in the corresponding point cloud, so that each keypoint corresponds to a point set that is a local point patch, namely P 1 , . . . , P z . Next, each point set was subsampled by FPS, one-by-one (4 times in total), and combined with graph attention convolution to obtain feature vector of each point in the point set after subsampling. Eventually, a 3D capsule module was added to further extract the features of the point sets, based on these features, to obtain the final deep feature of each point set (dimension size: 32 × 16), through the dynamic routing mechanism. During training, we introduced triplet loss [55] to train the designed network model. The registration process is as follows: Figure 1. The framework of our research consists of training and registration (test). In the training section, the training data set, including matching and unmatching pointsets, is constructed. N is the number of points in a point set; r is the number of operations. FC stands for full convolution operation. The trained GACM is used in the registration process. In the test process, two scans of point clouds P and S were used as the inputs. The points obtained by the farthest point sampling (FPS) were used as the key points, and the corresponding pointsets were described by the features extracted by the trained GACM. Finally, the corresponding points were determined based on the Euclidean distance of the features, and the rigid transformation matrixes (R and T) were calculated after eliminating the wrong corresponding points.
The first step is to generate z keypoints from target point cloud (reference point cloud) and source point cloud (point cloud to be registered) with FPS. The k-NN is then used to obtain the point set corresponding to each keypoint. The point sets are then sent to the trained network to extract the deep feature of each keypoint (the dimension of each extracted deep feature is 32 × 16 corresponding to the point set).
The second step is the construction of KD-tree by using the obtained deep features. The correspondence between keypoints in different point clouds is determined according to the Euclidean distance of deep features of different point sets. The keypoints corresponding to the two pointsets with the smallest Euclidean distance between the features are regarded as an initial matching keypoint pair.
The third step is to use the RANSAC [46] to eliminate the incorrectly matching keypoint pairs from the initial matching keypoint pairs (obtained by the previous step). The keypoint pairs with a distance greater than an empirical threshold (0.5 m in the test of Dataset I and 0.2 m in the test of Dataset II) are eliminated as outliers, and the correctly matched keypoint pairs will be retained in this step.
The fourth step is to calculate the translation matrix (T) and the rotation matrix (R), according to the keypoint pairs, after eliminating the low-quality matching points and registering the source and target point clouds.

Network Structure
The network structure of feature extraction of point set is depicted in Figure 1 (the green background in the training process). The point set input into the network was k × d dimension (k is the number of points, and d = 3 represents the 3D spatial coordinates of each point). The structure of the proposed GACM includes a graph attention convolution (GAC) module and 3D capsule module (3Dcaps).

Graph Attention Convolution Module
The graph attention convolution module is depicted in the yellow dashed box of the training process of Figure 1. Let the input data before the first GAC operation be a point set P = [p 1 , p 2 , . . . , p k ] ∈ R 3×k , which is a local point patch. Each point in the point set only contains 3D spatial coordinates, without any other information. Firstly, we take each point in the point set P as the central node, and a graph G (V, E) is constructed by randomly sampling (n − 1) the neighboring points, within the sphere of radius R for each central node, where V is the node set and E is the set of edges formed by the central node and each neighboring node. If the number of points in the sphere of radius R is less than n, the insufficient parts are randomly sampled from the points in the sphere and replicated. Finally, a centering operation is performed on the spatial coordinates of the n points in each graph G (that is, the coordinates of neighboring points, minus the corresponding coordinates of the center node, respectively), and the centered coordinates are used as the initial features of the n points in the corresponding graph. The GAC is performed, and the details can be found in the next paragraphs. After the r-th central node selection, by using FPS and GAC operations (as shown in the training process of Figure 1, r = 1, 2, 3, and 4, and a total of four times, GAC operations are performed), the point set P r = p r,1 , p r,2 , . . . , p r, k after the r-th GAC operation and D r = 64 × 2 r−1 ). The input of the (r + 1)-th center node selection (sampling by FPS) of the GAC operation is matrix which contains the coordinate of each current point and its corresponding features. In the training process of Figure 1, the arrow of central node transferring indicates that the information obtained at the previous level is transferred to the next level. In addition to r = 1, the GAC operation uses the feature passed from the previous layer as the input feature, when r is equal to 2, 3, and 4, respectively.
The specific process of GAC is as follows. Assuming that the current GAC operation is performed for the r-th time, the GAC operation completes the mapping process of the central node feature from R D r−1 to R D r . We define S(i) as the point sequence number set of the neighboring point set, formed by the central node i and its neighboring nodes. The schematic diagram of the graph attention convolution of the central node i and its neighbors is shown in Figure 2, where the number of nearest neighboring points is set as n = 32. An illustration of the graph attention convolution process by the central node i and its neighborhood. We randomly select n points (n = 4 in the Figure, including itself), within the radius R of the center node, as the target nodes to calculate the feature of the node i. The attention mechanism is used to express the degree of association between the central node and the neighboring nodes in local feature space.
In order to complete the mapping process, a shared attention matrix α ∈ R (3+D r )×D r , is established. Through this learnable variable parameter matrix, the convolution operation can reflect the difference of significance. Particularly, the attention weightâ ij between the i-th central node and its neighboring points takes into account the position and feature difference of spatial points. The formula is shown as: where ⊕ represents the concatenate operation, is a multi-layer perceptron operation with three layers that maps the feature dimension of each node in the graph structure from R D r−1 to R D r . Therefore, a ij = â ij,1 ,â ij,2 , . . . ,â ij,D r ∈ R D r , whereâ ij,D r is the attention weight of the j-th neighboring node to the center node i on the D r -th channel.
We then normalize all the features on the D r -th channel, as shown in Equation (2): The softmax (·) function is applied to the n-dimensional input tensor and scales it, so that the output element of each n-dimensional input tensor is in the range of [0,1], and all elements sum up as 1.
Finally, the central node i completes the feature update, as shown in Equation (3): where the symbol "•" represents Hadamard product, i.e., the corresponding elements of two matrices are multiplied, and b i ∈ R D r is a learnable offset.

Three-Dimensional Capsule Module Capsule Network
Sabour et al. [38] and Zhao et al. [33] proved that the capsule network has a great feature extraction capability. Following this inspiration, we introduced a 3D capsule module to further extract features of point clouds in the urban scene. We set up the dynamic routing [38] mechanism in the process of extracting primary point capsules to obtain the final high-level features (as shown in the purple dashed box of Figure 1). We set u i as the output vector of the capsule i in the primary point capsules, w ij as the affine transformation matrix from the capsule i to its higher-level of capsule j, and the input prediction vector is expressed asû j|i = w ij u i . The algorithm process (namely Algorithm 1) of the dynamic routing is as follows: 6: For every capsule j in the output feature layer: v j = squash ∑ i c j|iûj|i . 7: For every capsule i in the primary point capsule layer and the capsule j in the output feature layer: b j|i = b j|i +û j|i ·v j . 8: Return v j In the above algorithm, the squash (·) function is defined as: The nonlinear function squash (·) completes the compression of the vector, so that the length of the output vector can represent the probability of the entity.

The Specific Operation in 3D Capsule Module
The 3D capsule module is shown in the purple dotted box of Figure 1. In this module, we used the output F 512× k 8 GAC of the graph attention convolution module as the input of 3D capsule module. Specifically, the feature dimension of k 8 points, processed by the graph attention convolution, were converted from 512 to 1024 through a fully-connected layer (e.g., the FC in the purple dotted box of Figure 1), and the converted feature matrix (F 1024× k 8 ) was mapped to multiple independent convolutional layers with different weights (we designed 16 independent convolutional layers in the experiments). The max pooling was used to obtain global features in each independent convolutional layer. Next, the global features in the 16 independent convolutional layers were concatenated into primary point capsules (the size of the primary point capsules we used in the experiments was 1024 × 16). Finally, the primary point capsules generated a high-level feature, with a dimension of 32 × 16, through the dynamic routing mechanism, and we took the high-level feature as the final constructed deep feature.

Loss Function
The training process is shown in Figure 3. We used the triplet loss [55] as the loss function to train the model. During the training process, this function reduces the distance between the features of matching points and enlarges the distance between the features of non-matching points. Through this optimization function, more prominent features of urban point clouds can be obtained. We defined the anchor point set and the positive point set as P anc and P pos , respectively. They form the matching point pairs in the two sets of points. Similarly, the anchor point set P anc and the negative point set P neg are defined as the non-matching point pairs. The generation processes of P anc , P pos , and P neg will be described in the next section (namely "Section 3.2.2 The construction of training point pairs"). The triplet loss is calculated according to formula (5): where D anc,pos /D anc,neg represents the Euclidean distance between the features of matching/ non-matching point pairs, and M stands for the margin value between the positive and negative pairs.

The Construction of Training Point Pairs
The point cloud of each scan, in the urban scene, in the following operation is in the global coordinate system, which can be used as the true value of registration. In the training stage, we first used FPS to obtain interest points in the training scene (we set the scan number as τ), and the obtained interest points were set as keypoints to form a set of keypoints P τ = p τ 1 , p τ 2 , . . . , p τ z . p τ i was the i-th keypoint in the scan #τ, where i = {1, 2, . . . , z}. Then, in the two adjacent scans, i.e., scan #(τ + 1) and scan #(τ + 2), we queried whether there were corresponding keypoints, whose Euclidean distance from p τ i was less than the threshold (0.05 m in the experiments). If there are multiple points, the nearest point was selected as the corresponding point; if it did not exist, the point in P τ was removed. After the above process, the point set P τ,τ+1 and P τ,τ+1 τ+1 correspond one-to-one, thus becoming a pair of corresponding points. Taking P τ,τ+1 τ as the anchor point set P τ,τ+1 anc , we rotated the point set P τ,τ+1 τ+1 randomly from 0 to 360 degrees and translated them randomly from 0 to 100 m as the positive point set P τ,τ+1 pos . The points in P τ,τ+1 pos were swapped, in random order, as the negative point set P τ,τ+1 neg . Similarly, the scan #τ and #(τ + 2) could also construct the anchor P τ,τ+2 anc , positive P τ,τ+2 pos , and negative point sets P τ,τ+2 neg . In the same way, we use the anchor, the positive, and the negative point sets for different scans (such as scan #0, #1, . . . , #19) to complete the construction of the training set.
During the training process, for each keypoint in the anchor, positive, and negative point sets, its k-nearest neighboring points in the corresponding scan are searched through the k-NN algorithm, where a point set (a local point patch) is constructed as the input of the proposed network. Finally, we obtain the deep features corresponding to each point in the anchor, positive, and negative point sets, and use triplet loss to optimize the model.

Point Cloud Registration
We used the trained model to test the registration results of TLS point clouds in the urban scene. Firstly, we used the FPS to obtain z interest points, namely the keypoints in each scan. Secondly, we search the k-nearest neighbors for each interest point in the respective scan. The corresponding point sets of z interest points are formed, respectively. Thirdly, the z point sets were used as input for the trained network model. For each point set, four consecutive graph attention convolution operations were performed to update and obtain the features of nodes in the point set (the dimension of the features in each point set is k 8 × 512). After that, with the help of capsule network we proposed, the features of point set were inserted into independent convolutional layers, under different weights, to produce their global features. The global features in different independent convolutional layers were used to construct primary point capsules. Finally, high-level features were obtained through the dynamic routing. Each high-level feature corresponded to each keypoint.
After each keypoint feature extraction was completed, the extracted feature corresponding to each keypoint was used to construct KD-tree. If the Euclidean distance of the features of two keypoints in the two-point clouds is shortest, we regarded the two keypoints as a matching point pair. Then the RANSAC algorithm was used to eliminate mismatched point pairs. In this process, the point pairs with a feature distance below the threshold (0.5 m in the experiments) were saved as inliers, and the point pairs with a feature distance above the threshold were eliminated as outliers. Inliers were used to calculate the rotation matrix R and translation matrix T by singular value decomposition, in order to further realize the registration of two scans of point clouds in the urban scene.

Experiments and Results
First, we performed a sensitivity analysis of the model parameters and compared our method with the other four methods of point cloud registration, in order to deeply analyze the performance of our method. The implementation details of our model include: Pytorch 1.1.0 and NVIDIA GeForce RTX 2080Ti. In the training phase, we used the Adam optimizer, where the learning rate is set to 0.0001. The margin in triplet loss (namely the M in Equation (5)) is set to 0.2. The training samples include a set of 19,000 pairs of point sets. The batch size is 15, and the epoch number 100.

Datasets
We used four datasets (Datasets I-IV) for the experiments, all of which come from the ETH open datasets [56]. Datasets I-IV are ETH Hauptgebaude, Stairs, Apartment and Gazebo (winter), respectively. The specific information of the four datasets is shown in Table 1. These four datasets provide TLS point clouds in base frame (the point clouds coordinate origin is at the center of the scanner) and TLS point clouds in global frame (the point clouds are moved to a global reference coordinate system). During point cloud registration, scans #0 to #19 of Dataset I were used for training, and scans #20 to #35 of Dataset I test the registration results of TLS point clouds. Since there were certain overlapping areas of adjacent scans in the test scene, the scan #(t + 1) in base frame is registered to the scan #t in global frame during the registration test of Dataset I (20 ≤ t ≤ 35 and t ∈ N, and a total of 15 pairs of registrations are completed). Similarly, there were 31 scans in Dataset II, and the scan #(t + 1) in base frame is registered to the scans #t in global frame to complete the registration test of Dataset II (0 ≤ t ≤ 30 and t ∈ N, and a total of 30 pairs of registrations are completed). For Dataset III and Dataset IV, we only performed three deep learning methods to register scan #1 to scan #0.

Parameter Sensitivity Analysis
We performed registration tests on the 15 scans of Dataset I, which were not used during the training of Dataset I, to explore the impacts of keypoint number (z) and the point number of each point set (k) on the registration accuracy of our method. We set z to 3000, 5000, 8000, and k to 128, 256, and 512, respectively. The root mean squared errors (RMSEs) of registration under different values of z and k are shown in Figure 4. RMSE is the error value of all points in the scan. We can see that, with the increasing of k value, the RMSE value becomes smaller in general. In terms of the mean and standard deviation, the worse RMSE value of registration in Figure 4 is obtained when k = 128 and z = 3000, and better registration results are obtained when k = 512 and z = 3000. To fully test the performance of our method, in the following experiments, we use k = 128 and z = 3000 and k = 512 and z = 3000 to compare with other methods.

Comparison with Other Methods
In order to verify the performance of our method, we compared the registration results with the other four methods in Datasets I-II. The first method (Method I) was Super4PCS [19], which is a variant of 4PCS and mainly uses an affine invariance of four coplanar points for global registration of point clouds. The second method (Method II) was fast global registration [17], which uses FPFH to determine the correspondence between points. The third method (Method III) was 3DFeat-Net [29], which constructs deep features under a weakly supervised way to perform point cloud registration. The fourth method (Method IV) [47] designs a voxel-based 3D convolutional neural network to construct 3D deep features of point clouds. Methods III and IV belong to deep learning framework.
We refer to the evaluation indicators [57], namely relative rotational error (RRE) and relative translational error (RTE), to evaluate the effects of registration. RRE is calculated as follows: where F(·) transforms a rotation matrix to three Euler angles, R T is the ground-truth rotation matrix, and R E is the estimated rotation matrix. RRE is the sum of the absolute differences of the three Euler angles. RTE is calculated as follows: where T T is the ground-truth translation vector and T E is the estimated translation vector.

Test Results on Dataset I
In Dataset I, each method performs 15 pairs of adjacent scans. Table 2 shows the registration results of the 15 pairs of registration. The values of RRE and RTE are errors for all points in scans. Besides, we set three standards for a successful registration: RTE < 1 m and RRE ≤ 10 • , RTE < 1 m and RRE ≤ 5 • , and RTE < 0.5 m and RRE ≤ 2.5 • . During the experiments, the keypoint number (z) for both our method and Method IV are set to 3000. As shown in Table 2, our method achieves the highest successful registration rate under the three standards.
A pair of registration results is randomly selected for visual comparison (e.g., scan #34 and scan #33). As shown in Table 3 and Figure 5, the visual results of our method are better than Methods I-IV, which further illustrate the capability of the proposed GACM. Table 3. Registration results of each method from scan #34 to scan #33 in Dataset I.

RTE (m) RRE ( • )
Method I 0. To further compare the performance of extracting rotation invariant feature in three deep learning-based methods (namely Methods III-IV and our method), we rotate each point cloud to be registered in 15 pairs of point clouds by 30 • and 60 • around z-axis, respectively. The registration success rates of the three deep learning-based methods are shown in Table 4. It shows that the registration success rates of Methods III-IV decreased after the point cloud to be registered is rotated by 30 • or 60 • , while our method still maintains a robust performance, verifying the capability of anti-rotation in GACM.

Test Results on Dataset II
During training, each of the learning-based methods (e.g., Methods III, IV, and our method) does not use Dataset II (namely the Stairs). To verify the generalization capability of our method, we test the model in Dataset II only using the parameters learned from Dataset I and compared it with the other four methods. Unlike Dataset I, many adjacent scans of point clouds in Dataset II change greatly, and the initial orientation difference is also large. Figure 6 also show that the point density in many local areas are extremely low, making it even more challenging for registration. To adequately extract long-range contextual information of point clouds in the urban scene, we use k-NN algorithm to construct neighborhood in point clouds to ensure robust expression of the region (described in the part of "3. Method"). The registration starts from scan #1 to scan #0, and the registrations of 30 pairs of adjacent scans are completed. The keypoint number z for both our method and Method IV are set to 3000. As shown in Table 5, under the three successful registration standards, our method achieves higher registration success rates when k = 128 and k = 512 than the compared methods. Using our method with k = 512 can obtain higher success rates than that of k = 128. The success rate of the Method I in Dataset II is about 70% under the laxest constraint (RTE < 1 m and RRE ≤ 10 • ), and the success rate decreases sharply when the constraint condition becomes severe (e.g., RTE < 1 m and RRE ≤ 5 • and RTE < 0.5 m and RRE ≤ 2.5 • ). The success rate of Method II in Dataset II is significantly lower than that in the Dataset I, which is consistent with the fact that the Dataset II is more complex and the overlap between adjacent scans is less than Dataset I. The performance of deep learning method (Methods III and IV) in the Dataset II is worse than that in the Dataset I, which reflects the limitation of generalization.  The registration results of a pair of scans are randomly selected for visualization (e.g., the scan #25 is registered to the scan #24), as shown in Figure 6. Figure 6a shows some points of the two unregistered scans are far away from the main part (highlighted in the yellow boxes), and the registration results of each method are shown in Figure 6b-f, with enlarged displays of the main part. As shown in Table 6, our method achieves successful registration under the strictest standard (RTE < 0.5 m and RRE ≤ 2.5 • ), when k = 128 and k = 512, and the other four methods do not reach the laxest standard (RTE < 1 m and RRE ≤ 10 • ). The effect of orientation difference to the registration of the compared methods is large, especially Method IV, which has a serious direction error, and RRE is close to 180 • . In further analysis, we found that each compared method almost fails in the registration of some pairs of scans. As shown in Table 7, among 30 pairs of registrations, we chose 6 pairs of scans, in which the registrations of most comparison methods failed under the laxest registration success standard (RTE < 1 m and RRE ≤ 10 • ), and our method basically maintained the optimal RTE and RRE in the corresponding registration. In the registration from scan #15 to scan #14, our method failed when k = 512, but its RRE value was close to the registration requirement (RRE ≤ 10 • ). Compared with the other two deep learning registration methods (Methods III and IV), our method can fix the effects of initial orientation differences and remains relatively stable, which illustrates the ability of extracting the rotation-invariant feature in the proposed GACM.

Test Results on Dataset III and Dataset IV
Our method, and the two contrasting deep learning methods (Methods III-IV), are not trained on Datasets III-IV. Here, our goal was to further test the robustness of each deep learning method, in different urban scene environments, after only training in the Dataset I. As shown in Figure 7, the initial position rotation deviation of Datasets III-IV was small, but the density of Dataset III or geometric shape of Dataset IV was quite different from the training set (Dataset I). Figure 7e shows that the scanning on the ground seriously interferes with the visual result. We filter out the ground, that is, Figure 7f,g shows the registration results of each method after removing the ground points. By observing the red marked boxes in Figure 7, the registration results of Methods III-IV still have large translation deviations, and our results are better than them.

Ablation Experiments
In order to further study the performance of the proposed GACM, we conducted the ablation experiments. In Table 8, the GAC represents the graph attention convolution module, and 3DCaps represents the 3D capsule module. When using the GAC alone (corresponding to the yellow dotted box in Figure 1), we added the max-pooling to obtain the 512-dimensional features as the output. When using the 3DCaps alone (corresponding to the purple dotted box in Figure 1), MLP (multi-layer perceptron) was used to complete 128-dimensional generation, according to the encoder section of 3D point capsule networks [33]. In the experiments, except for the above difference in network structure, all other hardware and software conditions were the same. The number of keypoints in the GAC and GAC+3DCaps were both z = 3000 and k = 128. The results showed that the network structure of our method (GAC+3Dcaps, namely GACM) tends to have a higher registration success rate under various conditions than using the GAC network structure alone. When the input is coordinates alone and the number of training sets is not greatly expanded, 3DCaps cannot be used for registration. It proves that the proposed GACM is beneficial to the registration results. In the experiments, we found that the unsuccessful registration of each method often occurs when the RRE exceeds the constraint condition and mainly occurred in the registration of the last six pairs of scans. To further understand the effects of the 3DCaps module, we compared and analyzed the RRE values of the two network structures (GAC and GAC+3DCaps) in the registration of the last six pairs of scans in Dataset II. RRE of the registration results are shown in Table 9. We see that the rotation errors of the registration are reduced with the GAC+3DCaps structure (namely GACM), which is especially true for the scans where the registration fails when only using GAC. We conclude that the descriptors obtained by combining GAC and 3DCaps (namely GACM) are more effective than the descriptors obtained by only using the GAC.

Discussion
The four comparison methods include two traditional methods and two deep learning methods. Tables 3 and 5 show that, during the training stage for Dataset I, the deep learning methods have a higher registration success rate than the traditional methods (as shown in Table 2). The performance of the two compared deep learning methods are not as advantageous as the traditional ones when scene difference between training data and test data is large. So, the compared network model (e.g., Methods III and IV) is limited, while our method can maintain a robust performance in the case of large scene difference. After reducing the scene difference (e.g., the experiments in Table 2), the registration results of two compared methods (e.g., Methods III and IV) are not better than our method.
In the test of Section 4.2, the numbers of keypoint z between 3000-8000 have less effects on RMSE. It may be caused by the RANSAC threshold (in the third step of Figure 1) set by us. If the threshold is set to 0.1 m and z is 3000, the keypoints in some local region of a pair of registered point clouds may not have corresponding points. On the other hand, if z is set to 8000, these parts may have corresponding key points within the threshold of 0.1 m. For the range from z = 3000 to z = 8000, with the threshold of 0.5 m, there are always corresponding key points within 0.5 m.
During the test of Dataset II, our methods, as well as the four comparison methods, all have large errors in the last six scans. We believe that it is affected by the following factors: (1) the six scans are all similar to the situation in Figure 6a, except that the initial rotation deviation is large, and some noise points are still far away from the main body.
(2) Small clusters on these surfaces (for example, the yellow box in Figure 6a) cannot provide an effective geometric shape. (3) The overall density distributions of these scans are extremely uneven.

Conclusions
This research proposes a new network model (namely GACM) to extract the 3D feature descriptors of TLS point clouds in the urban scene. The GACM combines graph attention convolution and a 3D capsule network to obtain a discriminative, anti-rotation descriptor that can represent the feature of complex urban scene and be utilized to realize the high-quality registration. The experimental results on the four public datasets prove that our method has higher pairwise registration performance than other four frontier methods, without augmenting a heavy training, especially for the point clouds with relatively large orientation and angle differences. We also achieve the highest registration success rates in different standards. In the strictest standard (RTE < 0.5 m and RRE ≤ 2.5 • ), our registration success rates are 30% higher than the four comparison methods on the test datasets. Although our method performs better than the other two deep learning-based methods in untrained data set tests, it cannot complete a successful registration (RTE < 1 m and RRE ≤ 10 • ) of all scan tests, as in trained data set tests. In our method, furthest point sampling is adopted to obtain keypoints, in order to ensure that the spatial distribution of keypoints is relatively uniform. However, when the overlap region of two set of point clouds is very small, the calculation of the descriptors of the non-overlapping regions is invalid and unnecessary, which has not been solved by us.
In the future, we will explore the automatic determination of the overlapping area between the target point clouds and source point clouds to reduce the calculation of descriptors to obtain a better registration effect.