LRF-Net: Learning Local Reference Frames for 3D Local Shape Description and Matching

The local reference frame (LRF) acts as a critical role in 3D local shape description and matching. However, most existing LRFs are hand-crafted and suffer from limited repeatability and robustness. This paper presents the first attempt to learn an LRF via a Siamese network that needs weak supervision only. In particular, we argue that each neighboring point in the local surface gives a unique contribution to LRF construction and measure such contributions via learned weights. Extensive analysis and comparative experiments on three public datasets addressing different application scenarios have demonstrated that LRF-Net is more repeatable and robust than several state-of-the-art LRF methods (LRF-Net is only trained on one dataset). We show that LRFNet achieves 0.686 MeanCos performance on the UWA 3D modeling (UWA3M) dataset, outperforming the closest method by 0.18. In addition, LRF-Net can significantly boost the local shape description and 6-DoF pose estimation performance when matching 3D point clouds.


Introduction
The local reference frame (LRF) is a canonical coordinate system established in the 3D local surface, which is a useful geometric cue for 3D point clouds. LRF possesses two intriguing traits. One is that rotation invariance can be achieved via LRF if the local surface is transformed with respect to the LRF [1]. The other is that useful geometric information can be mined with LRF [2]. These make LRF popular in many geometric relevant tasks, especially for local shape description and six-degree-of-free (6-DoF) pose estimation.
For local shape description, two corresponding local surfaces can be converted into the same pose and full 3D geometric information can be employed, which is beneficial to improving the performance of local descriptors. Some hand-crafted local shape descriptors, e.g., signature of histograms of orientations (SHOT) [3] and signature of rotational projection statistics (RoPS) [1], estimate an LRF from the local surface and then translate local geometric information with respect to the estimated LRF into distinctive and rotation-invariant feature representations. Some learned local descriptors, e.g., [4,5], leverage LRFs to overcome the limitation of geometric deep learning networks of being sensitive to rotations. Therefore, LRF is critical for both traditional and learned local shape descriptors. For 6-DoF pose estimation, an LRF can significantly improves its efficiency. Traditional 6-DoF pose estimation is usually performed via RANSAC [6], which randomly selects inlier correspondences from an initial correspondence pool for pose prediction. Such random sampling method is neither • LRF-Net, based on a Siamese network that needs weak supervision only, is proposed that achieves the state-of-the-art repeatability performance under the impacts of noise, varying mesh resolutions, clutter and occlusion. To the best of our knowledge, we are the first to concentrate on designing LRF for local surfaces with deep learning. • LRF-Net can significantly boost the performance of local shape description and 6-DoF pose estimation. The rest of this paper is organized as follows. Section 2 presents a detailed description of our proposed LRF-Net. Section 3 presents the experimental evaluation of LRF-Net on three public datasets with comparisons with several state-of-the-art methods. Several concluding remarks are drawn in Section 4.

Related Works
Various methods for building LRFs have been proposed in the literature. Most of them can be categorized into two classes: CA-based methods and PSD-based methods. Given a local surface with a spherical support of radius r centered at the keypoint p, they compute a 3 × 3 matrix as its LRF.

CA-Based LRF Methods
Most CA-based methods are based on the eigenvectors of the covariance matrix, which is usually generated by the points or triangles in the support region.
Mian et al. [9]: This method directly calculates the unit vectors of the LRF via computing covariance analysis on the radius neighbors of the keypoint, the three eigenvectors of the covariance matrix are defined as the x,y,z-axis, respectively. While the eigenvectors of the covariance matrix define the principal direction of the local surface, their sign is still ambiguous [10]. Mian et al. disambiguates the sign of z-axis through the inner product between n(p) (normal of keypoint p) and two possible vector, i.e., z(p) and −z(p), where z(p) denotes the z-axis. However, the rest axes are still suffer from sign ambiguity.
SHOT [3]: This method leverages a weighted covariance matrix for the computation of LRF, which assigns smaller weights to more distant points. The weighted covariance matrix is calculated as follows: where w q = R − ||q − p||. R denotes the support radius and || · || represents L 2 norm. This weighted strategy improves the repeatability in present of clutter under 3D object recognition scenarios. To eliminate all sign ambiguities of the LRF axes, a technique which is similar to [12] is applied to the eigenvectors of the weighted covariance matrix. Specifically, the sign of a eigenvector is reoriented to coherent with the majority of the vectors. Such technique is used on the x-axis and z-axis. The rest y-axis is calculated by the cross-product operation between the z-axis and the x-axis. RoPS [1]: This method does not only calculate one covariance matrix for the local surface, it aggregates multiple covariance matrices computed for every single triangle of the local surface into a comprehensive one to enhance the robustness. Such method needs mesh representation of the 3D local surface. For a triangle τ ∈ ψ(p), its covariance matrix is calculated as: where q τ 1 , q τ 2 and q τ 3 denote the three vertices of τ. Then, the comprehensive covariance matrix is calculated as: w 1 and w 2 are defined as: where w 1 alleviates the impact of mesh resolution variations and w 2 improves the robustness performance to clutter and occlusion [8]. Based on the eigenvalue decomposition of C rops , the three axes of LRF can be calculated. As for disambiguating the sign, x-axis and z-axis (only take x-axis as an example) are further adjusted via x(p) = x(p) · sign(h), where x(p) denotes the x-axis and h is a signum function, which is defined as: Once the x-axis and z-axis are determined, the y-axis can be calculated via the cross-product between them.

PSD-Based LRF Methods
As for PSD-based LRF methods, they calculate three axes of the LRF successively. PS [13]: This method puts a sphere of radius r on the keypoint p and gain a contour at the intersection of the local surface. The point with the biggest signed projection distance to the tangent plane of the keypoint was selected to compute the x-axis, while the tangent plane is determined by z-axis, which is directly performed by the normal of the keypoint. The y-axis is calculated via the cross-product operation.
Board [2]: This method collects a small subset of the local surface for the estimation of the z-axis, which has achieved a robust performance to occlusion. The x-axis is calculated by the points lying in the border region. They choose the point lying in the border region with the biggest deviation angle between its normal and the z-axis as the calculation of x-axis. The y-axis is computed by the cross-product operation between z-axis and x-axis.
SD [10]: This method is a modified version of Board [2]. They make improvement to the repeatability of the LRF via employing the point with largest local depth instead of deviation angle in SD [10]. They achieve a more repeatable performance than Board on 3D registration and recognition data. However, both of them show a weak performance on the robustness to the large scale noise.
TOLDI [11]: This method resorts to the normal of the keypoint which is calculated by a subset of the radius neighbors for the estimation of its z-axis. Then, the tangent plane of the keypoint with respect to z-axis is determined and all radius neighbors of the keypoint are projected on the tangent plane. A weighted strategy is employed to each projection vector to calculate the x-axis, which is defined as: w i2 = (pq i · z(p)) 2 (8) where p donates the keypoint and q i is one of its radius neighbors within support radius r. w i1 is a weight related to the distance from p to q i , which is designed to improve the robustness of the LRF to clutter, occlusion and incomplete border regions [11]. w i2 is a weight related to the local depth which is designed to provide high repeatability on flat regions [11]. The x-axis is calculated as: where k is the count of radius neighbors of keypoint p and v i denotes one of the projection vectors.
The y-axis is computed by the cross-product between z-axis and x-axis.

Methods
This section represents the details of our proposed LRF-Net for 3D local surface. We first introduce the technique approach for calculating the three axes for an LRF and then describes a weakly supervised approach for training LRF-Net.

A Learned LRF Proposal
The whole architecture of LRF-Net in shown in Figure 2a. LRF-Net predicts the direction of three axes successively. For a local surface, we first estimate its z-axis via its normal vector computed over a small subset of the local point set. Then, unique weights are learned for each point in the local surface. The x-axis is calculated by integrating projection vectors with learned weights using a vector-sum operation. At last, the y-axis is calculated by the cross-product operation between z-axis and x-axis.
LRF definition: Given a local surface Q centered at keypoint p, the LRF at p (denoted by L p ) can be represented as : where x(p), y(p), and z(p) denote the x-axis, y-axis, and z-axis of L p , respectively. As three axes are orthogonal, the estimation of LRF therefore contains two parts: Estimation of the z-axis and the x-axis. The architecture of LRF-Net. The input to LRF-Net is a local surface and we calculate its normal as the z-axis of the LRF. Then, the local surface is converted to a set of rotation variant attributes. Next, a projection weight for every point is computed with mlp. At last, the x-axis is calculated by the weighted vector-sum of all the projection vectors and the y-axis is calculated by the cross product between the z-axis and x-axis. The LRF is formed as the combination of the x-axis, y-axis and z-axis.
A naive way to learn an LRF for the local surface is to train a network that directly regresses the axes. The premise is that ground-truth LRFs are labeled for local surfaces. Unfortunately, the network trained in this manner meets two difficulties. The first one is that the definition of ground-truth LRFs for local surfaces remain an open issue in the community [8]. The second one, which is more important, is that the orthogonality of three axes cannot be guaranteed. We suggest estimating z-axis and x-axis independently.
Z-axis: As for z-axis, we take the normal of the keypoint as the z-axis., which has been confirmed [2] to be quite repeatable. To resist the impact of clutter and occlusion, we collect a small subset of the local surface to calculate the normal. For more details, readers are referred to [11].
X-axis: Once the z-axis is determined, the remaining task is to compute the x-axis. Compared with z-axis, x-axis is more challenging due to the influence of noise, clutter, and occlusion [8]. We argue that each neighboring point in the local surface gives a unique contribution to LRF construction. Hence, we predict a weight for each neighboring point and leverage all neighboring points with learned weights for x-axis prediction. The main steps are as follows.
First, to make the estimate LRF invariant to rigid transformation, our network consumes with invariant geometric attributes, rather than point coordinates. In particular, two attributes, i.e., relative distance a dist and surface variation angle a angle are used in LRF-Net as illustrated in Figure 2b. For a neighbor q i of p, the two attributes of q i are computed as: where · is the L 2 norm and r represents the support radius of the local surface. The range of a angle and a dist are [−1, 1] and [0, 1], respectively. Thus, every radius neighboring point represented by two attributes that will be encoded to a weight value via LRF-Net later. The employed two attributes in LRF-Net have two merits at least. First, the unique spatial information of a radius neighboring point in the local surface can be well represented, as shown in Figure 3. Both attributes are complementary to each other. Second, the two attributes are calculated with respect to the keypoint, which are rotation invariant. It makes the learned weights rotation invariant as well. Second, with geometric attributes being the input, we use a simple network with multilayer perceptions (MLP) layers only to predict weights for neighboring points. The details of the network are illustrated in Figure 4. The network is very simple, however, is sufficient to predict stable and informative weights for neighboring points (as will be verified in the experiments). Third, because x-axis is orthogonal to z-axis, we project each neighbor q i on the tangent plane S of the z-axis and compute a projection vector for q i as: We integrate all weighted projection vectors in a weighted vector-sum manner: where n denotes the total number of radius neighbors of keypoint p and w i is a learned weight by LRF-Net. Another way for determining the x-axis, based on these weights, is choosing the vector with the maximum weight, as in many PSD-based LRFs [2,10]. However, it fails to leverage all neighboring information and we will shown that it is inferior to the vector-sum operation in the experiments.
Y-axis: Based on the calculated z-axis and x-axis, the y-axis can be computed by the cross-product between them.

Weakly Supervised Training Scheme
Our training data are constituted by a series of corresponding local surface patches. The corresponding relationship is obtained based on the ground-truth rigid transformation of two whole point clouds. In particular, LRF-Net needs the corresponding relationships between local surface patches only, rather than ground-truth LRFs and/or exact pose variation information between patches. Therefore, our network can be trained in a weakly supervised manner.
We train our LRF-Net with two streams in a Siamese fashion where each stream independently predicts an LRF for a local surface. Specifically, two streams take the local surfaces of keypoints p m and p s as inputs, respectively. Here, p m and p s are two corresponding keypoints sampled from the model and scene point cloud. Both streams share the same architecture and underlying weights. We use the predicted LRFs L m and L s by two stream to transform the local surfaces Q m and Q s to the coordinate system of the two LRFs. Then, we calculate the Chamfer Distance [14] between two transformed local surfaces as the loss function to train LRF-Net: where Our opinion is that it is difficult to define a "good" LRF for a single local surface. For 3D shape matching, LRFs that can align the poses of two local surface patches are judged as repeatable. This motivates us to consider two local patches simultaneously and employ the Chamfer Distance to train the network.

Experiments
In this section, we first evaluate the repeatability performance of our LRF-Net on three standard datasets, including the Bologna retrieval (BR) dataset [15], the UWA 3D modeling (UWA3M) dataset [16], and the UWA object recognition (UWAOR) dataset [17], together with a comparison with other state-of-the-art LRFs. Second, we apply our LRF-Net perform local shape description and 6-DoF pose estimation to verify the practicability of our method. Third, analysis experiments are conducted to improve the explainability of the proposed LRF-Net.

Experimental Setup
The details of our experiments including the description of datasets and the illustration for all compared methods are introduced before evaluation. The experiments were conducted on a Windows Server with an Intel Xeon E5-2640 2.39 GHz CPU and 96 GB of RAM. We train our LRF-Net using a batch size of 512 local surface pairs with Pytorch and leverage the ADAM optimizer with an initial learning rate of 1 × 10 −4 , which decays 5% every epoch. Each sampled local surface contains 256 points. The max epoch count is set to 20.

Datasets
Our experimental datasets includes three standard datasets with different application scenarios. The variety among these public 3D datasets definitely helps us to evaluate the performance of our method in a comprehensive manner. Figure 5 displays two exemplar models and scenes without noise in each dataset. The main properties of these datasets are summarized in Table 1.
These dataset are also injected with five levels of Gaussian noise (i.e., from 0.1 mr to 0.5 mr Gaussian noise) and four levels of mesh decimation (i.e., 1 2 , 1 4 , 1 8 and 1 16 of original mesh resolution). Here, the unit mr denotes mesh resolution. Remarkably, the noise-free BR dataset is used to train our LRF-Net, the rest noisy data in the BR dataset and data in the UWA3M dataset and the UWAOR dataset are used for testing.  # means the quantitative attribute of the dataset (e.g., the number of models).

Compared Methods
We compare our LRF-Net with several existing LRF methods for a thorough evaluation. Specifically, the compared methods are proposed by Mian et al. [9], Tombari et al. [3], Petrelli et al. [10], Guo et al. [1] and Yang et al. [11], respectively. We dub them as Mian, Tombari, Petrelli, Guo, and Yang, respectively. To compare fairly, we keep the support radius of all the LRFs as 15 mr. The properties of these LRFs are shown in Table 2.
To evaluate the local shape description performance of our method, we replace the LRF in four LRF-based descriptors (i.e., snapshots [18], SHOT [3], RoPS [1] and TOLDI [11]) and assess the performance variations. To measure the 6-DoF pose estimation performance of our method, we adapt LRF-Net to the RANSAC pipeline and compare with the original RANSAC [19].

Repeatability Performance
We evaluate the repeatability of all LRFs via the popular MeanCos [3] metric, which measures overall angular error between two LRFs. The MeanCos criterion is computed as: where L m and L s denote two corresponding LRFs between model and scene. L s represents the transformed L s , gained via ground truth transformation GT. * denotes matrix-product. Cos(Z) represents the cosine of the angle between the z-axis of the L m and the L s , and Cos(X) coincides with the x-axis angular error between L m and the L s . Due to the y-axis can always be calculated from the other two axes via cross-product, it is not necessary to be included in MeanCos calculation [2]. In our evaluation, we first randomly select 1000 points from each models and collect the corresponding points in the scenes via ground truth transformation for each model-scene pair. Then, we calculate the LRF for every local surface centered at the selected point in the model and scene. At last, the average MeanCos of the MeanCos value of all the corresponding LRFs between each model-scene pair is calculated as the final result for a dataset. Note that, the MeanCos of two perfectly corresponding LRFs equals to 1. The repeatability results of evaluated LRFs are shown in Figures 6 and 7. Several observations can be made from these figures. First, as witnessed by Figure 6, our LRF together with Tombari, Petrelli, and Yang achieve decent performance on the BR dataset. On the UWA3M and UWAOR datasets, our LRF-Net achieves the best performance. Second, as shown in Figure 7a, LRF-Net and Tombari achieve a comparably stable performance on the BR dataset with respect to different levels of Gaussian noise. Figure 7b,c indicate that LRF-Net achieves the best performance under all levels of Gaussian noise on the UWA3M and UWAOR datasets, surpassing the others by a very significant gap. Note that UWA3M and UWAOR datasets also include nuisances such as clutter, self-occlusion, and occlusion. Third, results in Figure 7d-f suggest that LRF-Net is the best competitor with 1 2 , 1 4 , and 1 8 mesh decimation on all datasets.
These results clearly demonstrate the strong robustness of our LRF-Net with respect to Gaussian noise, mesh decimation, clutter, and occlusion. The reasons are at least twofold. One is that all points are leveraged to generate the critical x-axis, which guarantees the robustness to Gaussian noise and low level mesh decimation. The other is that a LRF-Net can learn stable and informative weights for neighboring points. It can improve the robustness of LRF-Net to common nuisances.

Local Shape Description Performance
We further evaluate our LRF-Net by replacing the LRFs in four LRF-based descriptors (i.e., snapshots, SHOT, RoPS, and TOLDI) with our LRF-Net. Then we compare their descriptor matching performance measured via recall vs. 1-precision curve (RPC) [3,20]. The calculation of recall is defined as: recall = N true N corr (18) where N true denotes the number of correct matches and N corr is the total number of corresponding features. The calculation of 1-precision is defined as: (19) where N f alse represents the number of false matches and N match is the total number of matches. Notably, the original LRF methods employed by snapshots, SHOT, RoPS, and TOLDI are Mian, Tombari, Guo Yang, respectively. We conduct this experiment on the original BR, UWA3M, and UWAOR datasets. Figure 8 reports the RPC results of the all tested descriptors. As witnessed by Figure 8 and Table 3, most LRF-based descriptors equipped with our LRF-Net outperform their original versions. Specifically, snapshots achieves a dramatic performance improvement with our LRF-Net on the BR dataset; the performance of SHOT also climbs significantly on the UWA3M and UWAOR datasets with the help of the proposed LRF-Net. Therefore, we can draw a conclusion that LRF plays an important role in local shape description, where a repeatable LRF can effectively improve the description performance of an LRF-based descriptor without changing its feature representation. It also indicates that the proposed LRF-Net can bring positive impacts on a number of existing local shape descriptors. The bold means the better performance compared with other methods.

6-DoF Pose Estimation Performance
A general 6-DoF pose estimation process with local descriptors is achieved by correspondence generation and pose estimation from correspondences with potential outliers [6]. RANSAC is arguablly the de facto 6-DoF pose estimator in many applications. However, a key limitation of RANSAC is that the computational complexity of RANSAC is O(n 3 ) and estimating a reasonable pose requires a huge number of iterations. With LRFs, a single correspondence is able to generate a 6-DoF pose (shown as Figure 9), decreasing the computational complexity from O(n 3 ) to O(n). Therefore, we apply LRF-Net to 6-DoF pose estimation, following a RANSAC-fashion pipeline. The difference is that we sample one correspondence per iteration. Two criteria, i.e., the rotation error err r between our predicted rotation R and the ground-truth one R GT , and the translation error err t between the predicted translation vector T and the ground truth one T GT [16], are employed for evaluating the performance of 6-DoF pose estimation. err r and err t are defined as: where R = R GT (R) −1 and mr denotes the mesh resolution. The initial feature correspondence set is generated by first matching TOLDI (equipped with our LRF-Net) descriptors and keeping 100 correspondences with the highest similarity scores. 100 and 1000 iterations are assigned to our method and RANSAC. The average rotation errors and translation errors of the two estimators on three experimental datasets are shown in Table 4. Two salient observations can be made from the table. First, both RANSAC and our method manage to achieve accurate pose estimation results on the BR dataset that contains point cloud pairs with large overlapping ratios. However, our method only needs 1 10 of the iterations required for RANSAC. Second, on more challenging datasets, i.e., UWA3M and UWAOR, our method significantly outperforms RANSAC. This demonstrates that LRF-Net can improve the accuracy and efficiency of RANSAC for 6-DoF pose estimation simultaneously.

Verifying the Rationality of LRF-Net
To verify the rationality of the main technique components of our LRF-Net, we conduct the following experiments. As mentioned above, our LRF-Net contains two main parts: Estimating z-axis and x-axis. First, in order to verify the choice of normal vector for z-axis calculation, we replace the normal vector with the one regressing z-axis via a network shown in Figure 10 (dubbed "DR"). Second, to confirm the advantage of our x-axis technique, we perform analysis experiments from three aspects. (1) To prove the advantage of invariant geometric attributes, we replace the invariant geometric attributes with the combination of original points and z-axis (i.g., [q i , z(p)]). Then, we calculate the x-axis in a weighted vector-sum manner. The former is dubbed "Sum1" and the latter is dubbed "Sum2". (2) In order to verify the choice of weighted vector-sum operation for x-axis calculation, we test the approach using the vector with the maximum weight as the x-axis (dubbed "Max"). (3) To demonstrate that the axes of LRF is not suitable to be directly regressed, we compare our method with the one regressing x-axis via a network (DR). There are totally eight different combinations. All of them are tested on BR, UWA3M and UWAOR datasets. The results are shown in Tables 5-7. Clearly, LRF-Net (Normal + Sum1) achieves the best performance among tested methods. It verifies that learning weights via invariant geometric attributes rather than directly learning axes is more reasonable. In addition, vector-sum is more appropriate for integrating projection vectors with learned weights for LRF-Net. The bold means the better performance compared with other methods. The bold means the better performance compared with other methods.

Resistance to Rotation
To evaluate the robustness of LRF-Net to rotation, we manually rotate the tested data. Specifically, we rotate the scene point clouds a certain degree among z-axis (i.g., 30, 60, 90, and 120 degrees). Then, we measure their MeanCos performances. Figure 11 displays the results of eight different combinations. As shown in Figure 11, we can see that LRF-Net, Normal+Max, and Normal+Sum2 achieve very stable performances. The other ones which include "DR" part, show less robust performances.
This result has demonstrated two conclusions. One is that it is hard to achieve rotation-invariance by only relying on original points. A guidance (e.g., normal vector) is very necessary. The other is that the invariant attributes is not indispensable. Just a simple combination (e.g., combination of original points and normal vector) can also achieve rotation-invariance. However, the invariant attributes can boost the performance of our network. Figure 12 shows MeanCos performances of six LRF methods under varying support radius on three public datasets without noise. From the observation of Figure 12, we can see that our LRFNet achieves a stable and outstanding performance on the BR dataset. On the UWA3M and UWAOR datasets, our LRFNet outperforms other LRF methods when support radius is more than 7.5 mr. Another observation is that the performance of our LRFNet is tending towards stability with the increase of support radius, while some other LRF methods present a downward trend. It verifies that our LRFNet is able to gain a stable LRF from a local surface which contains enough points to guarantee its statistical significance and uniqueness.  Figure 13 visualizes the learned weights by our LRF-Net for several sample local surfaces, which presents two interesting findings. First, closer points do not seem to have greater contributions. It is a common assumption for many existing CA-and PSD-based LRF methods, including Tombari, Guo, and Yang, that closer points should have greater weights. However, they are inferior to our LRF-Net in terms of repeatability performance. Second, x-axis estimation is generally determined by a particular area, rather than a single salient point as employed by many PSD-based methods, e.g., Petrelli. These visualization results also demonstrate our opinion that each neighboring point in the local surface gives a unique contribution to LRF construction.

Conclusions
In this paper, we proposed LRF-Net, a learned LRF for 3D local surface that is repeatable and robust to a number of nuisances. LRF-Net assumes that each neighboring point in the local surface gives a unique contribution to LRF construction and measure such contributions via learned weights. Experiments showed that our LRF-Net outperforms many state-of-the-art LRF methods on datasets addressing different application scenarios. In addition, LRF-Net can significantly boost the local shape description and 6-DoF pose estimation performance.

Future Work
In the future, we are going to do our research in two interesting directions. The first one is to consider the texture of 3D objects. RGB information can provide a power guidance when the 3D models lack sufficient geometric features but has photometric cues. The other is to take multi-scale geometric information into consideration.