A Robust Rigid Registration Framework of 3D Indoor Scene Point Clouds Based on RGB-D Information

: Rigid registration of 3D indoor scenes is a fundamental yet vital task in various ﬁelds that include remote sensing (e.g., 3D reconstruction of indoor scenes), photogrammetry measurement, geometry modeling, etc. Nevertheless, state-of-the-art registration approaches still have defects when dealing with low-quality indoor scene point clouds derived from consumer-grade RGB-D sensors. The major challenge is accurately extracting correspondences between a pair of low-quality point clouds when they contain considerable noise, outliers, or weak texture features. To solve the problem, we present a point cloud registration framework in view of RGB-D information. First, we propose a point normal ﬁlter for effectively removing noise and simultaneously maintaining sharp geometric features and smooth transition regions. Second, we design a correspondence extraction scheme based on a novel descriptor encoding textural and geometry information, which can robustly establish dense correspondences between a pair of low-quality point clouds. Finally, we propose a point-to-plane registration technology via a nonconvex regularizer, which can further diminish the inﬂuence of those false correspondences and produce an exact rigid transformation between a pair of point clouds. Compared to existing state-of-the-art techniques, intensive experimental results demonstrate that our registration framework is excellent visually and numerically, especially for dealing with low-quality indoor scenes.


Introduction
In recent years, 3D point clouds of indoor scenes have been regarded as the most appropriate data source for generating a building information model (BIM), which has become a crucial tool for constructions and architecture professionals [1]. In the past, indoor scene point clouds were usually acquired by static terrestrial laser scanners (TLS). Although TLS have a high scanning precision, they are expensive. As a result, the high price greatly restricted the application and popularization of TLS. Fortunately, the recent prevalence of consumer-grade RGB-D cameras (e.g., Microsoft Kinect, Asus Xtion, etc.) has allowed ordinary people to easily capture 3D indoor scene point clouds from the real world by scanning and reconstruction processes [2]. During the reconstruction process, as a fundamental task, 3D point clouds registration aims at registering individual scans in a unified coordinate system for producing a complete 3D point cloud of the target indoor scene [1,3]. Typically, the rigid registration of point clouds consists in finding correspondence points and minimizing the sum of residuals over all correspondence points to estimate the rigid transformation information, which consists of a 3 × 3 matrix representing the rotation information and a 3 × 1 vector denoting the translation information. It is more challenging to register point clouds acquired using RGB-D cameras. In the rest of this paper, we refer to point clouds acquired using RGB-D cameras as RGB-D point clouds. The main reason is that this type of point clouds is inevitably corrupted by comparatively large noise and outliers, due to many factors including sensor errors, occlusions, etc. Thus, robustly registering RGB-D point clouds has been a crucial problem in geometry modeling, computer vision, and so on.
Researchers have extensively studied rigid registration of 3D point clouds and proposed many remarkable methods. For example, Besl and McKay [4] have introduced the popular iterative closest point (abbreviated as ICP) scheme. This method iteratively estimates correspondences and computes transformation information. Due to its simplicity, researchers have applied the ICP algorithm for developing 3D reconstruction systems, such as the famous Kinect-Fusion system [5,6]. Nevertheless, the main drawbacks of the ICP algorithm are its slow convergence and sensitivity to noise. Moreover, it is also computationally expensive. Researchers have also designed many registration methods in terms of correspondence extraction and point clouds alignment to address these issues. To extract correspondences accurately, researchers first introduced the methods based on point features including the fast point feature histograms (abbreviated as FPFH) feature [7] and integral volume descriptor [8]. Then, to enhance robustness of the above methods, more approaches were proposed by using high level information, such as color information [9][10][11][12][13][14], planar structure [15,16], and hybrid structure [17], etc. For point clouds alignment, many remarkable methods have been presented in terms of efficiency and robustness. For efficiency, Bylow et al. [18] have investigated the point-to-point metric and the point-to-plane metric. The Anderson acceleration strategy has been adopted to improve the convergence rate of the ICP algorithm [19,20]. For robustness against noise and outliers, researchers have developed numerous advanced methods based on many techniques, including the least trimmed squares [21], p sparsity optimization [22], nonconvex optimization [20,23], maximum correntropy criterion (MCC) [24], branch-and-bound scheme [25]. Recently, deep-learning-based registration methods [26][27][28][29][30] have demonstrated promising results. However, their performance is limited by the completeness of training data sets.
However, the aforementioned methods still fail to register RGB-D point clouds with considerable noise and weak texture. Motivated by this observation, we present a new rigid registration framework to effectively deal with RGB-D point clouds. The key idea is to fully utilize texture and geometry information computed from RGB-D images to build correspondences between point clouds accurately. Specifically, the proposed framework consists of three consequent stages, i.e., point normal estimation, correspondence point extraction, and point clouds alignment. In the first stage, we introduce a variational normal estimation method by coupling the total variation model with a second-order operator, which can effectively remove the noise of input point clouds and simultaneously maintain sharp geometric features and smooth transition regions. In the second stage, we design a correspondence extraction method with the help of the RGB information, which can robustly extract corresponding points from noisy point clouds. Finally, we utilize a fast optimization-based method to compute rigid transformation matrices for aligning pairs of point clouds. The experiments on multiple open-source RGB-D datasets demonstrate the superiority of our method, especially its robustness against low-quality point cloud data.
Specifically, this paper has the following main contributions: • We present a point normal estimation method by coupling total variation with secondorder variation. The method is capable of effectively removing noise while keeping sharp geometric features and smooth transition regions simultaneously. • We present a robust correspondence points extraction method, based on a descriptor (TexGeo) encoding both texture and geometry information. With the help of the TexGeo descriptor, the proposed method is robust when handling low-quality point clouds.
• We design a point-to-plane registration method based on a nonconvex regularizer. The method can automatically ignore the influence of those false correspondences and produce an exact rigid transformation between a pair of noisy point clouds. • We verify the robustness of our approach on a variety of low-quality RGB-D point clouds. Intensive experiments demonstrate that our approach outperforms the selected state-of-the-art methods visually and numerically.

Related Work
As a fundamental problem for many geometric modeling applications, the rigid registration of 3D point clouds has drawn great attention in the past decades. There is a lot of existing point cloud registration work in the existing literature. For a more comprehensive review of rigid 3D point cloud registration, readers are referred to [31]. Although most of the existing methods are remarkable, a discussion of the full literature is beyond the scope of this study. Thus, we only focus on rigid point cloud registration and mainly review techniques closely related to our work. Point normal estimation. As an important signal indicating the direction field of the scanned surface, the point normal field has been widely applied for constructing 3D point descriptors, such as the FPFH [7]. Note that the 3D point descriptor is fundamental to correspondence extraction. However, it is challenging to robustly estimate point normals, since the captured point clouds are inevitably corrupted by noise and outliers. To address this issue, researchers have proposed many valuable methods. Here, we only review remarkable ones related to our work. Avron et al. [32] have applied 1 regularization to recover the point normal field. In order to preserve sharp edges and corners, Sun et al. [33] have derived a sparsity-based method that uses 0 minimization for effectively processing point clouds whose underlying surfaces are piecewise constant. These two methods can keep sharp geometric features while removing noise effectively. However, both of them inevitably suffer from serious staircase artifacts in smooth transition regions [34][35][36][37][38][39][40]. For alleviating these artifacts, Liu et al. [41] have recently introduced a point cloud denoising framework, which presents an anisotropic second-order regularizer to remove noise and preserve sharp geometric features as well as smooth transition regions.
Correspondence extraction. Correspondence extraction consists in matching points to determine a coarse alignment. Existing methods are designed based on either point features or structure information to construct descriptors. Gelfand et al. [8] have identified features and extracted correspondences using a novel integral volume descriptor. Similar to [7], Zhou et al. [23] have utilized the FPFH descriptor to match points efficiently. In order to register RGB-D point clouds, the authors of [9,14] have applied texture information for extracting correspondences. Though the above methods can extract correspondences effectively, they are easily disturbed by large noise. Different from the above methods, the methods based on structure information need to construct some meaningful structures. For example, Aiger et al. [15] have matched points by comparing approximately congruent coplanar four-point sets selected from a pair of point clouds. Their approach, called 4PCS, can robustly register point clouds without any assumption about their initial poses. However, it costs a lot of time when handling large-scale point clouds, because it performs the RANSAC random iteration process [42,43]. As a result, to improve the efficiency of 4PCS, Mellado et al. [44] have derived Super4PCS. Moreover, by using structural information including planes and lines as well as their interrelationship, Chen et al. [16] have matched points for pairs of point clouds whose overlap ratios are small. Zhang et al. [17] have introduced a registration framework that computes correspondences by using middle-level structural features.
Point clouds alignment. Point clouds alignment estimates the rigid transformation for registering a pair of point clouds given the extracted correspondences. To this end, the ICP algorithm iteratively minimizes the sum of the 2 distance (i.e., point-to-point distance or point-to-plane distance) between correspondence points [4,18]. Though the ICP algorithm is simple, it is not only sensitive to noise and outliers, but also computation-ally expensive. To overcome these limitations, Chetverikov et al. [21] have introduced a trimmed ICP algorithm, which can robustly register incomplete point clouds with noise. By utilizing a branch-and-bound scheme, Yang et al. [25] have presented the Go-ICP, which is a global algorithm for point cloud registration. Bouaziz et al. [22] have presented the sparse ICP algorithm that formulates the registration problem as an p minimization problem. Though their method can limit the effect of noise and outliers on the aligned results by adjusting the value of parameters p, it is time-consuming to solve the nonconvex optimization problem. To conquer the issue, Mavridis et al. [45] have improved the sparse ICP for more efficiently solving the nonconvex optimization problem. Wu et al. [24] have eliminated the interference of outliers and noise by using the maximum correntropy criterion. Zhou et al. [23] have introduced a robust global approach by utilizing a scaled Geman-McClure function, which can automatically reduce wrong correspondences. To improve the convergence of the ICP algorithm, Rusinkiewicz [46] have presented a symmetric objective function. Furthermore, to speed up the ICP algorithm, Pavlov et al. [19] have proposed AA-ICP, a novel modification of the ICP algorithm based on Anderson acceleration, which substantially reduces the number of iterations with a negligible cost. However, when the ground-truth rotation is close to a gimbal lock [47], the AA-ICP method cannot produce the desired result. To alleviate this issue, Zhang et al. [20] have recently proposed a fast and robust variant of the ICP algorithm using Welsch's function. Moreover, an Anderson-accelerated majorization-minimization algorithm has been proposed to solve their problem.
As a result, although the existing point clouds registration algorithms have good performance when processing point clouds corrupted by small-scale noise, their performance are significantly degraded when point clouds are corrupted by comparatively large noise. This situation becomes even worse when processing RGB-D point clouds. Thus, this paper proposes a robust registration framework for 3D indoor scene point clouds based on RGB-D information to address this issue. Figure 1 shows the pipeline of our method. In the first step, given a pair of point clouds A, B, we estimate point normals to provide the smoothed normal field, which can represent the orientation of the underlying surface (see Section 3.1). Note that we do not change point positions at this step. In the second step, called correspondence point extraction (Section 3.2), we first compute a novel TexGeo descriptor for each point, based on the filtered point normals and the texture information of RGB images, and then produces the point matching results (i.e., correspondence points) of the pair of point clouds. In the third step (i.e., point clouds alignment presented in Section 3.3), based on the correspondence points obtained in step two, we compute the final rigid transformation information, i.e., R,t.

Point matching
Output: ,

TexGeo descriptor computation
Rough normal computation Point clouds alignment Input: Point clouds ,

Point normal estimation
Point normal filtering Derived from RGB-D cameras, the RGB-D information consists of pairs of RGB and depth images, which capture the textural and geometry information of the scanned scene. To our knowledge, the RGB-D information has been widely used in many fields, such as 3D reconstruction [9,48], simultaneous localization and mapping (abbreviated as SLAM) [49,50], and so on. Specifically, we assume that all the RGB and depth images are well registered. Moreover, since we produce point clouds using depth images with a pin-hole model, each depth image and its corresponding point cloud have the same size. Thus, we assume the size of each depth image to be M × N, and denote each point cloud as

Point Normal Estimation
Point normals are frequently used to construct 3D point descriptors, e.g., the FPFH feature [7] and the Harris feature [51], depicting the geometric characters of local surface regions. However, even with high-fidelity devices, real-scanned point clouds are inevitably corrupted by noise. Moreover, there are even some bumps on RGB-D point clouds [52]. Thus, it is necessary to remove the noise of point normals to construct 3D point descriptors that can accurately reflect the geometric characteristics of scanned surfaces. To this end, we first compute rough point normals, and then filter them by using a novel point normal filter. Details are elaborated as follows.
Rough point normal computation. We can easily compute rough point normals from the corresponding depth image. Formally, for each point p i,j , we compute its rough normal, N, asN where Φ(x) = x x is the normalized function. Due to the effects of both intrinsic factors (i.e., device errors) and extrinsic factors (e.g., ambient light), the points generated from the depth image tend to be noisy. Therefore, the rough normals given by (1) are inevitably corrupted by noise.
Point normal filtering. To filter out the noise of the rough point normals, we present a sparsity-inspired global optimization method, whose goal is to find the smoothed point normals that best fit the input rough normals and the given sparsity constraints. We first design a point normal filtering model by using the total variation model and a secondorder operator. Then, we present an iterative algorithm to minimize the point normal filtering model.
First of all, we briefly give the definition of a novel second-order operator defined on point clouds for measuring the second-order variations of local regions. Figure 2 shows the auxiliary geometric elements for constructing the second-order operator. For a point p i,j , we define its anticlockwise ordered 1-ring neighborhood as Ω(i, j) = p i−1,j , p i+1,j , p i,j−1 , p i,j+1 . Let e(i, j) = e = (p i,j , p) : p ∈ Ω(i, j) be the set of edges connecting to p i,j . Then, we denote all the edges of a point cloud as . Let l be a line connecting p i,j with a midpoint (e.g., m 1 ) of two consecutive neighbors (see Figure 2b). Let be all such lines of a point cloud. Based on the above auxiliary elements, the operator can be defined as: where u is the signal field defined on the point cloud, and w j , w k are two positive weighting parameters. If both w j , w k are set to 1, the operator (2) degenerates into an isotropic secondorder operator. Recently, using the operator (2), Liu et al. [41] proposed a feature-preserving point denoising framework, which can remove small-scale noise and simultaneously keep sharp geometric features as well as nonlinear smooth transition regions. However, it is hard to keep geometric features in this framework when the noise level is high. Motivated by this situation, we present a point normal filtering model by combining the total variation with the second-order variation, which is formulated as: where len(e), len(l) are the length of edge e and line l, respectively. Moreover, disk(p i,j ) represents the area of a circle of radius r centered at p i,j . Note that r is the average length of the edges contained in e(i, j), and L is the set of auxiliary geometric lines of the input point cloud, C = {N i : N i = 1 }; α, β are parameters for balancing the terms in (3). To obtain filtered point normals, we need to solve the minimization problem (3). However, since the the problem (3) has non-differentiability and nonlinear constraints, we cannot directly solve (3). Inspired by studies [34,35,41], we propose an iterative algorithm based on an augmented Lagrangian method. First, we define two auxiliary variables X, Y, and rewrite the minimization problem (3) as: where To minimize the above problem, we first define the augmented Lagrangian function where µ = {µ e }, η = {η l } are two Lagrangian multipliers and t e , t l are two positive penalty coefficients. Note that minimizing problem (3) is equivalent to minimizing function (5). The solving procedure of (5) mainly consists of three subproblems, i.e., the N-subproblem, the X-subproblem, and the Y-subproblem.
(1) N-subproblem: This problem is quadratic, if we ignore the non-quadratic constraint term ψ (N). Thus, we first obtain the solution of the quadratic problem, and then project the solution onto a unit sphere.
(2) X-subproblem: Obviously, we can solve each X e independently. Specifically, for each X e , we only solve the problem .
Given the definition of shrink(v, w) = max 0, 1 − 1 v w w, the solution of this problem can be formulated as: (3) Y-subproblem: Similary, the Y-subproblem can also be spatially decomposed. In other words, for each Y l , we solve the problem: , whose solution is as: Last, we show the overall procedure for solving problem (3) in Algorithm 1. As we can see, the proposed approach iteratively minimizes the above three subproblems and updates two Lagrangian multipliers.
Effectiveness of our point normal estimation. To test the effectiveness of our point normal filter, we ran it on indoor point clouds, and compared it to state-of-the-art methods including RIMLS [53], MRPCA [54], and L0P [33]. Figure 3 shows the comparison results. Note that, for better visualization, we adopt the strategy presented in [41] for updating point positions after filtering point normals. Apparently, all these tested methods can effectively filter out noise. However, RIMLS blurs sharp geometric features (e.g., edges and corners), though it recovers smooth transition regions well (see Figure 3c). On the contrary, MRPCA and L0 can preserve sharp geometric features. Nevertheless, L0 inevitably causes staircase artifacts in smooth transition regions (see Figure 3d), and MRPCA tends to oversharpen smooth features (see Figure 3c). Unlike the above methods, our normal filtering method can simultaneously keep sharp geometric features and smooth transition regions (see Figure 3e). The comparisons in Figure 3 demonstrate that, our approach outperforms the other ones in handling indoor point clouds, especially those containing sharp features and smooth regions. Minimize N-subproblem: Given (µ, η, X, Y), compute N according to (6); Normalize N. Minimize X-subproblem: Given (µ, η, N, Y), compute X according to (8).

Correspondence Extraction
As an ingredient for point cloud registration, correspondence point extraction aims at estimating exactly the matching points between a pair of point clouds. In recent years, researchers have developed many remarkable methods to extract correspondence points [7,9,14,16,17]. Nevertheless, since existing methods usually only utilize the ge-ometry information of the input clouds, they have ambiguity problems when handling point clouds containing noise or weak texture regions. To solve this issue, we design a robust correspondence extraction method using both texture and geometry information. Specifically, our method has two stages, i.e., 3D point descriptors computation followed by point matching. Details of each stage are given as follows.
First, we design a new 3D point descriptor, named TexGeo, based on both texture and geometry information. Formally, given one point p i,j , we can write this descriptor as where f 2d (p i,j ) is the textural feature term, f 3d (p i,j ) is the geometric feature term, and 0 ≤ c ≤ 1 is the weighting parameter. Here, for efficiency, we adopt the SURF descriptor [55] to extract the textural features and the FPFH descriptor [7] to extract the geometric features. Note that the textural and geometric features are both normalized. In particular, when c = 0, the TexGeo descriptor degenerates into the FPFH descriptor; when c = 1, the TexGeo descriptor degenerates into the SURF descriptor.
With the TexGeo descriptor, we can robustly extract correspondences between point clouds with repetitive structures, textureless regions, or large holes. Specifically, when the corresponding RGB images are textureless, we can set c = 0 to only use geometric features to extract correspondences; when point clouds have repetitive structures, we can set c = 1 to only use texture features to extract correspondences. In other cases, we empirically set c = 0.5, which means that both texture and geometric features contribute equally to the following point matching process. Moreover, since FPFH descriptors are computed using the filtered point normals, the proposed TexGeo descriptor can accurately describe local regions when point clouds are corrupted by noise.
To compute correspondences of a pair of point clouds, we need first to compute a TexGeo descriptor for each point, and then match all the points in feature space. Here, assume P to be a point cloud, and Γ(P) = { f (p i,j ) : ∀p i,j ∈ P} to be all the computed TexGeo descriptor vectors of P. In practice, for the input point clouds P a , P b , we can easily compute their descriptor vectors Γ(P a ) and Γ(P b ), respectively. Then, we build two k-D trees, T a , T b , by using descriptor vectors Γ(P a ) and Γ(P b ), respectively. Finally, we match all points by the following cross-validation strategy: (i) for each point p ∈ P a , we compute its candidate matching point p 1 by searching a point from tree T b , whose descriptor vector is closest to the descriptor vector f (p). Note that we measure the differences between descriptor vectors by using the Euclidean distance; (ii) for the point p 1 , we also compute its candidate point p 2 by searching a point from tree T a in a similar way; (iii) if the point p 2 and the point p are the same, p and p 1 are matched. After computing all the matched points, we can obtain the correspondence points of the input point clouds.

Point Clouds Alignment
Let P and Q be two point clouds. Here, we aim to compute the rigid transformation information to align these two point clouds. To this end, ICP-like methods iteratively perform the following process: (1) for each point in Q, compute one correspondence (corresponding point) by finding the closest point from P; (2) compute an intermediate transformation by minimizing the energy: where (p i , q i ) are two points of the i-th correspondence, which is weighted by w i ≥ 0. If the energy converges, the algorithm ends. (3) transform Q using the above T, and return to step 1. Given a good enough initial state, we can compute an accurate rigid transformation by ICP. However, since the ICP algorithm needs to update correspondences iteratively, its performance would be greatly reduced when handling large-scale point clouds. Moreover, if the input point clouds are noisy, the ICP algorithm easily converges to a local minimum.
To overcome these limitations, Zhou et al. [23] used a robust penalty function to rewrite energy (12) as: where g(x) = µx 2 µ+x 2 is the scaled Geman-McClure function, and µ is a positive parameter. Because the Geman-McClure function can automatically validate and prune correspondences, it does not need to recompute correspondence during the optimization. Compared to traditional ICP-like methods, benefiting from the robust penalty function, the method in [23] is robust against noise. However, since the point clouds captured by RGB-D sensors are usually corrupted by large noise, this method may converge to a local minimum. To resolve this problem, instead of using point-to-point distance, we use point-to-plane distance to reformulate the energy (13) as (14) where N p i is the filtered normal of point p i . Due to the nonlinearity, it is difficult to minimize energy (14) directly. Inspired by [23], we also present an iterative algorithm by using the method introduced in [56]. First, we introduce the set L = {l p,q } as a line process over all the computed correspondence points. The problem (14) can be reformulated as where ψ(l p,q ) = µ( l p,q − 1) 2 . The procedure of minimizing energy (15) can be separated into two subproblems, i.e., the T-and L-subproblem. The optimization fixes L when optimizing T, and vice versa. We can minimize each subproblem as follows: (i) T-subproblem: min To solve this problem, we first linearize T as a vector ζ = (α, β, γ, x, y, z). As a result, T can be approximated as a function of ζ: where T i is the rigid transformation information used in the previous step, and T 0 is initialized as I. Then, the energy function in Equation (16) can be regarded as a leastsquares function on ζ, which can be solved as: where r is a residual vector and J r is the Jacobian matrix of r. After that, the current rigid transformation information T can be calculated using Equation (17).
(ii) L-subproblem: This problem is decomposable spatially. Specifically, for each pair of corresponding points (p i , q i ), we need to solve: which has a closed form solution: (20) Similar to [23], to produce good results, we initialize µ = D 2 , where D is the diameter of the largest surface. During the optimization procedure, we also decrease µ in each iteration until µ = δ 2 , where δ is the threshold for pruning correspondences. We outline the whole algorithm in Algorithm 2.

Experimental Results
We tested the effectiveness of our approach on real-scanned indoor scene point clouds, which were generated from RGB-D images. Note that the RGB-D images used in this paper were derived from open-source datasets [49,57]. Moreover, the size of these images was 640 × 480 uniformly. The example RGB images of the tested point clouds are listed in Figure 4. Some of the tested point clouds are corrupted by zero-mean Gaussian noise with standard deviation proportional to the σ times the diagonal length of the minimum bounding box of the point cloud. We also give visual and numerical comparisons of our point cloud registration method with existing approaches including PCL [7], S4PCS [44], GICP [58], GICPT [58], and FGR [11]. For our method, we implemented it using C++. For the other compared methods, we used the source codes kindly provided by their authors.

Qualitative Comparison
We qualitatively compared our point cloud registration method to the other methods on indoor scene point clouds. During this process, we generated visually appealing results by carefully tuning their parameters. Note that we performed the comparison on indoor scene point clouds corrupted by synthetic noise followed by point clouds contaminated by real noise.   Figure 5 shows a comparison of a pair of indoor scene point clouds Lr1, which have a small rotation and translation between them. Note that these two point clouds are corrupted by a relatively small noise. As we can see, the results of GICP and GICPT have a comparatively large translation between the two point clouds (Figure 5c,d). The reason is that these two methods cannot accurately extract correspondences in the presence of noise. Moreover, although methods PCL, S4PCS, and FGR compute point features to construct the correspondences, their results still have small translations between the two point clouds (Figure 5b,e,f). The reason is that these methods produce some wrong correspondences between two input point clouds. On the contrary, our method generates the visually best registration result; see Figure 5g.  Figure 6 demonstrates the registration results of indoor scene point clouds Lr2 with a large translation and rotation. As can be seen, all the tested methods can produce the right rotation information. However, due to the noise interference, the results of GICP and GICPT have a relatively large translation between the two point clouds; see Figure 6c,d. Apart from that, from Figure 6b,e,f,g we can see that PCL, S4PCS, FGR, and our approach generate visually good registration results. However, the numerical comparison in Table 1 shows that the registration error of our method is the smallest one. Thus, our method outperforms the others.  Figure 7 shows the comparisons on a pair of indoor scene point clouds Lr3, which has large textureless regions. As can be seen, PCL, GICP, GICPT, and S4PCS do not generate satisfactory results (Figure 7b-e). Besides, FGR produces a better registration result due to its robustness against noise. However, the result of FGR still has a small translation between the two point clouds (Figure 7f). Compared to them, our approach can align two point clouds more exactly (Figure 7g).  Figure 8 shows comparison results on indoor scene point clouds Of1, which has a large translation and a large rotation between them. As can be seen, GICP and GICPT fail to obtain the right rotation information (Figure 8c,d). PCL can get the right rotation information, but its result has a relatively large translation between the two point clouds (Figure 8b). Similarly, although both S4PCS and FGR can produce the right rotation information, their results still have a small translation between the two point clouds (Figure 8e,f). In contrast, from Figure 8g, we can see that our method not only generates the right rotation information, but also effectively alleviates the translation between the input point clouds.  Figure 9 shows the registration results of indoor scene point clouds Of2 consisting of local regions from which it is difficult to extract geometric features. In other words, this indoor scene lacks geometric features. As we can see, GICPT generates the wrong translation and rotation information (Figure 9c). Except GICPT, the other methods register input point clouds well (Figure 9b,e,f,g). Moreover, from the comparison presented in the next subsection, we can find that our method is the best one with the least registration error.  Figure 10 shows registration results of indoor scene point clouds Teddy, containing a comparatively large real noise. Note that these point clouds also contain holes, because of the scene occlusion problem. As we can see, due to the noise interference, the results of GICP and S4PCS have relatively large translations between the corresponding transformed point cloud and target point cloud (Figure 10c,e). Moreover, PCL, GICPT, FGR, and our approach can effectively align the input point clouds ( Figure 10b,d,f,g). However, the numerical comparison listed in Table 1 shows that our method has the smallest registration error.  Figure 11 compares the registration results produced by our framework and the recent method abbreviated as SymICP, which is presented in [46]. As can be seen, both methods can produce visually good registration results (Figure 11b,c). However, from the zoomed views, we can find that our method achieves a higher registration accuracy than SymICP.
As a result, given the fact that indoor scene point clouds tend to be noisy, contain large textureless regions, or contain regions from which it is difficult to extract geometric features, it is challenging for the compared state-of-the-art methods to achieve desirable registration results from these low-quality indoor scene point clouds. One of the key reasons is that these tested methods cannot accurately estimate correspondences between indoor scene point clouds. Compared to the other methods, our method based on filtered point normals is robust against noise. Moreover, by using the TexGeo descriptor that combines both geometry and texture information, our method can effectively process point clouds with holes or large textureless regions. Therefore, our method outperforms the other compared methods when processing low-quality indoor scene point clouds.

Quantitative Comparison
The above qualitative comparisons show that our method is capable of producing visually better registration results when compared to the other selected methods. Furthermore, to objectively assess our method, we further quantitatively compared it to the other methods on point clouds corrupted by synthetic noise. To achieve this, we used an error metric E reg , which measures the average point-to-plane distance of all pairs of corresponding points. Specifically, assume the number of correspondences is M and the obtained rigid transformation matrix is T, the metric E reg is formulated as: where p i , q i are two corresponding points. We computed E reg for the examples shown in Figures 5-10, and listed the errors in Table 1. Obviously, our approach has the smallest errors for all the examples, which demonstrate that our approach quantitatively outperforms the other compared ones.

Ablation Study
To clearly show the contribution of each stage of the proposed framework, we further conducted a series of controlled experiments that tested the specific decision in each stage of the proposed framework. These studies were also performed on the above-mentioned data sets.
We first conducted an ablation study on the point normal estimation. Figure 12 compares the registration results produced using noisy point normals and estimated point normals. As can be seen, the registration result produced using estimated point normals is much better than the registration result produced using noisy point normals (Figure 12b,c). This suggests that the point normal estimation is essential in the proposed framework.
Then, we conducted an ablation study on the correspondence extraction. In this study, we tested our method using two 3D point descriptors including the FPFH descriptor and the proposed TexGeo descriptor. Figure 13 compares the registration results produced using these descriptors. As can be seen, our framework cannot produce the desired result when replacing the proposed TexGeo descriptor by the FPFH descriptor (Figure 13b,c). This means the proposed correspondence extraction is important to achieve a robust registration of indoor scene point clouds. Finally, we conducted an ablation study on the transformation computation scheme. In this study, we tested our framework using the ICP scheme and our computation method. Figure 14 shows the registration results. As can be seen, given the extracted correspondences, the ICP algorithm cannot produce the exact rigid transformation matrix for aligning input point clouds (Figure 14b). On the contrary, our computation method is capable of generating the accurate rigid transformation matrix (Figure 14c). The reason is that our computation method is robust against those remaining false correspondences. Thus, the proposed transformation computation scheme is important to the framework.

Discussion
In this section, we further discussed the main features of the proposed method based on its performances in the above experiments.
(1) 3D indoor scene point clouds scanned by RGB-D cameras are of low quality. To this end, existing point cloud registration approaches are usually designed based on the RANSAC scheme or some robust estimators. From the experiments, we found that an accurate point normal field can be used to further improve the registration accuracy. The reason is that accurate point normals are not only useful for constructing point descriptors, but also help the rigid transformation computation.
(2) The distinctiveness of 3D point features computed using common geometry descriptors is low in indoor scenes. The 2D features extracted from corresponding RGB images can improve the distinctiveness of the 3D points. From the experiments, we found that, with the help of the proposed descriptor, our method still has a good performance when point clouds contain large-scale noise or weak texture features. This is because both geometry and texture information can be complementary to each other.
(3) The rigid transformation computation using a nonconvex regularizer is important for accurate point cloud registration because it can further prune some false correspondences remaining from the first two stages. Moreover, the point-to-plane distance metric can enhance the registration accuracy.
(4) The proposed registration framework is flexible. First, the method used in each stage can be replaced by more advanced approaches to pursue a higher accuracy of indoor scene point cloud registration. Second, the framework can be adapted to process point clouds from other sources, e.g., laser-scanned point clouds.

Conclusions
This work presented a robust registration framework of indoor scene point cloud using RGB-D information, which consisted of three stages. Specifically, we first introduced a point normal filtering model combining the total variation with a second-order variation, which can effectively filter out noise while simultaneously maintaining sharp geometric features and nonlinear smooth transition regions. Then, we designed a novel point feature descriptor (TexGeo), encoding both texture and geometry information. By using this descriptor, we were able to robustly establish correspondences between a pair of point clouds, even when dealing with large noise or encountering local regions that are difficult to extract geometric features. Finally, we utilized a global optimization method to efficiently compute rigid transformations between those pairs. Intensive registration results verified that our method outperformed the state-of-the-art methods visually and numerically. In addition, our method is robust against noise, textureless regions, and the regions from which it is difficult to extract geometric features.
There are still some open problems. For example, many parameters in our method need to be tuned manually, which may be tedious. Moreover, when dealing with large-scale scene point clouds, our method seems to be time-consuming. The strategy of speeding up our method can be thoroughly explored in the future.

Data Availability Statement:
The data presented in this study are openly available in [59,60].

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: The auxiliary line connecting point p i,j with some midpoint len(·) The length of (·) disk(·) The area of (·) D 1 The first-order operator