RGB-D SLAM Using Point–Plane Constraints for Indoor Environments

Pose estimation and map reconstruction are basic requirements for robotic autonomous behavior. In this paper, we propose a point–plane-based method to simultaneously estimate the robot’s poses and reconstruct the current environment’s map using RGB-D cameras. First, we detect and track the point and plane features from color and depth images, and reliable constraints are obtained, even for low-texture scenes. Then, we construct cost functions from these features, and we utilize the plane’s minimal representation to minimize these functions for pose estimation and local map optimization. Furthermore, we extract the Manhattan World (MW) axes on the basis of the plane normals and vanishing directions of parallel lines for the MW scenes, and we add the MW constraint to the point–plane-based cost functions for more accurate pose estimation. The results of experiments on public RGB-D datasets demonstrate the robustness and accuracy of the proposed algorithm for pose estimation and map reconstruction, and we show its advantages compared with alternative methods.


Introduction
This article is an extension of a recent conference paper [1] that presented the exploitation of plane features to estimate sensors' poses for low-texture indoor environments. Robust pose estimation and environment mapping are of great significance in the execution of robotic tasks, such as motion control and navigation. The robot's pose and the scene's map can be obtained by utilizing robotic sensors, such as wheel encoders, inertial measurement units [2][3][4], lasers [5,6], and cameras [7][8][9]. Among these solutions, the visual-based method is one of the more effective approaches because cameras can conveniently capture informative images to estimate the robot's poses and perceive its surroundings. Although there have been plenty of methods using monocular, stereo, or RGB-D cameras for pose estimation and 3D mapping, daunting challenges remain for structural and low-texture environments for several reasons. For instance, in existing point-based methods [10,11], key steps in pose estimation, such as image aligning and computing the transformation matrix, heavily rely on feature points or high-contrast pixels. However, feature points are generally absent in structural and low-texture environments, and these methods can fail to estimate poses or result in low-accuracy estimations. To solve this problem, high-level features, such as lines and planes, are required.
Most indoor environments have many parallel and orthogonal lines and planes (called the Manhattan World [12]), and these high-level features can be exploited to improve the performance of pose estimation. Since these line and plane features can be easily calculated by using RGB-D cameras, which provide both depth information and a color image, the RGB-D camera has become a popular alternative to monocular and stereo cameras for the purpose of simultaneous localization and mapping (SLAM) tasks in indoor environments. These structural regularities have been exploited in studies [13][14][15][16] to estimate drift-free rotation, and by decoupling the rotation and translation components, pose estimation accuracy and map quality have been markedly improved by using the Manhattan World (MW) assumption with an RGB-D camera in scenes that satisfy the MW assumption.
In this paper, we propose a robust and accurate approach to pose estimation and 3D mapping using an RGB-D camera. We detected and matched point features from the color images by using the oriented fast and rotated brief (ORB) descriptor, and we detected and tracked multiple planes from the depth images on the basis of a motion model. Then, we exploited these features to construct cost functions to solve the pose of each captured frame and point-plane landmarks in local and global maps. Meanwhile, we added an orientation constraint to the loop detection process to reduce the drift error and avoid mismatches using an appearance constraint. Furthermore, we extracted the MW axes in the first captured frame of MW scenes, and we added the MW constraint to the previous point-plane-based cost functions to improve their performance.
Our algorithm exploits point and plane features and adds the MW constraint for pose estimation and scene reconstruction, which can perform well in harsh environments with low texture as well as general indoor environments. The contributions of this work are as follows: • We exploited point and plane features, which provide reliable constraints for the estimation of poses and reconstruction of the scene's map for the majority of indoor environments.

•
We added the MW constraint to point-plane-based cost functions, resulting in the provision of fixed-plane normals as global landmarks for more accurate pose estimation.

•
We evaluated our proposed approach on two public available datasets, and we obtained robust and accurate performance.

Related Work
The existing RGB-D SLAM methods for structural and low-texture environments can be divided into three classes: plane-based methods, dense methods, and MW-based methods.
Plane-based methods use plane features to construct and solve the optimization function for pose estimation. Lee et al. [17] presented a fast plane extraction and matching method for indoor mapping, and Taguchi et al. [18] used both points and planes as primitives to realize the registration of different 3D data. Khoshelham et al. [19] proposed a no-iteration pose estimation method based on point-plane correspondences. Thomas et al. [20] presented a structured 3D representation with a point-to-plane relationship to correct the deformations, and local and global mapping were processed to reduce the accumulation error and obtain an accurate large-scale 3D model. Kaess [21] presented a minimal representation for planar features and introduced a relative plane formulation that improved the convergence properties for faster pose optimization. The plane-based methods mentioned above require plane extraction and matching for each frame to construct the optimization function. Since there are no plane descriptors to perform plane matching, it is achieved by utilizing additional odometry sensors, such as wheel encoders or an inertial measurement unit. However, these additional sensors increase the complexity of the SLAM system and may not be available in some circumstances, so plane matching methods that use only the RGB-D frame need to be developed.
In dense methods, almost all pixels are used to estimate the pose. Newcombe et al. [22] presented a frame-to-global method that maintained the single-scene model with a global volumetric so that each new frame would be integrated into the volumetric. Whelan et al. [23] used a rolling cyclical buffer to operate in large environments and used place recognition for loop closing. Kerl et al. [24] proposed a dense visual SLAM method for RGB-D cameras that minimized both the photometric and depth error of all pixels, and Prisacariu et al. [25] presented a robust dense RGB-D SLAM method (InfiniTAM) that had low computational cost with RGB and depth constraints. These dense methods solve the pose with a dense vision front-end and are robust in low-texture environments. However, the number of points processed for each frame is large (typically hundreds of thousands), which makes the optimization computationally infeasible in real time without GPU implementation.
MW-based methods estimate the pose by decoupling the rotation and translation components. These methods utilize line and plane features to achieve a drift-free rotation matrix with the MW constraint, and the translational accuracy can be improved by using drift-free rotation. Zhou et al. [26] developed a mean shift paradigm to extract and track planar modes to achieve drift-free rotation, and they estimated the translation using three simple 1D density alignments in man-made environments. In the work of Kim et al. [27], lines and planes were exploited to estimate drift-free rotation, and the translation was recovered by minimizing the de-rotated reprojection error. Kim et al. [28] also proposed a linear SLAM method based on the Bayesian filtering framework for MW scenes. These methods have produced good SLAM performance results in MW scenes, but if the MW assumption is invalid, MW-based methods fail to estimate the pose or reconstruct the map.

Proposed Method
We propose a point-plane-based RGB-D SLAM system that exploits point and plane features to estimate the camera pose and generate the 3D global map for indoor environments. Our proposed system has two main parts: (1) we detect and track the point and plane features with respect to the local map for each new captured frame, and we estimate the current pose by solving the cost function that is constituted by the tracked features (tracking part); and (2) we update the local map that consists of point-plane landmarks and keyframes for each new inserted keyframe, and we process the full bundle adjustment to obtain the global map if a loop is detected (map management part). If the current environment satisfies the MW assumption, we add the MW direction to constrain the normal of plane landmarks for more accurate pose estimation. The overview of our proposed system is shown in Figure 1.

Preliminaries
In this section, we first introduce the representations for the point and plane features that are extracted from the color and depth images, respectively. Then, we detail the state transformation and distance measurement, which are essential for constructing the cost functions and solving the nonlinear graph optimization problem for pose estimation.

Point and Plane Representation
We extract ORB features for point tracking, as these features are computed extremely quickly, and they present good invariance to the camera's auto-gain, auto-exposure, and changes in illumination. The point feature's 2D pixel coordinate in the color image domain is defined as u c = (u c , v c ) T , where c represents the current processing frame, and u and v represent the coordinate values on the x-axis and y-axis, respectively. For the aligned color and depth images, the point feature's 2D coordinates are the same in the depth image and color image, so the corresponding value of u c in the depth image is represented by d(u c ). The point feature's 3D position X c is reconstructed by using the inverse projection function P −1 (·): where f x and f y are the focal lengths on the x-axis and y-axis, and (c x , c y ) T is the camera's central coordinate.
We detect the planes from the depth images using a fast plane extraction algorithm [29], which has three steps: generate initial blocks, merge the blocks on the basis of agglomerative clustering, and refine the border pixels. With this approach, we can obtain the plane: n T · X p = d, where n = (n x , n y , n z ) T represents the unit normal vector of the plane, X p represents the 3D point lying on this plane, and d is the distance to the origin of the camera coordinate system. The plane can also be represented by a homogeneous vector, π π π = (π x , π y , π z , π w ) T : where Q(·) is the normalized transfer function for a 4-dimensional vector, S 3 represents the unit sphere, which can be identified with a set of unit quaternions, so the operations on the quaternions are suited to the plane's homogeneous representation [21].

State Transformation
The 3D point X c in the current frame is transformed to X w = R w,c · X c + t w,c in the global coordinate, where R w,c ∈ SO(3) is the rotation matrix and t w,c ∈ R 3 is the translation vector. When we use the rigid-body transformation matrix that defined as T w,c = R w,c t w,c 0 1×3 1 , the transformation relationship is expressed as The plane π π π c in the current coordinate is transformed to the global coordinate by π π π w = Q( In terms of the plane's normal-distance representation, its state transformation is represented by

Distance Measurement
In the graph-based pose estimation problem, the error function is constructed from an edge that connects multiple nodes [30]. As the binary edge only connects two nodes, the error function measures the distance between these two nodes. For two 3D points X 1 and X 2 , we use the 2-norm function · 2 to define their relative distance: e X = X 1 − X 2 2 .

Pose Estimation with Point and Plane Features
We match the points and planes detected in the current frame with the point and plane landmarks in the local map, and we utilize the tracked point and plane features to construct a cost function for estimating the current pose. By combining the point and plane constraints, we can accurately estimate the pose in scenes, even those with less texture.

Point and Plane Feature Tracking
As mentioned in Section 3.1.1, we extract the ORB point features and plane features from the current frame F c . For the point features, we get the initial point matches between the current frame and the last reference keyframe by using the ORB descriptors. Then, we project the corresponding map points onto the current frame and discard some outlier matches on the basis of the projection error. The set of optimized point matches is defined as X = {(X c i , X L i ), i = 1, 2, ...m}, where X c represents the point feature's 3D position in the current frame coordinate, and X L represents the 3D position of the point landmark in the local map.
Since there are no plane descriptors to perform plane matching, we search for plane matches by the motion-model-based distance constraint. If the previous two frames F c−2 and F c−1 were tracked successfully, we use the constant velocity motion model [10] to predict the current pose: and we obtain the plane matches when they meet: where (n L , d L ) is the normal-distance representation of the plane landmark in the local map. The set of plane matches is defined as P = {(π π π c i , π π π L i ), i = 1, 2, ...n}, where π π π c represents the plane's homogeneous representation extracted from the current frame, and π π π L represents the plane landmark in the local map.
To avoid incorrect plane matches in cluttered environments, we only select extracted planes that have enough points (more than 5000) lying on them to match with the plane landmarks in the local map. For parallel planes that satisfy the previous condition, we select the plane with the largest number of on-plane points. In this way, the number of incorrect matches can be effectively reduced.

Robust Pose Estimation
We jointly utilize the tracked points and planes to construct a cost function for the current pose estimation. Pose T w,c can be computed by solving where ρ(·) is the robust Huber cost function, λ j represents the number of pixels in the jth tracked plane, and 3D point X c . The cost function (6) contains two parts that correspond to point and plane constraints. The accurate pose can be solved by minimizing Equation (6), even in a texture-less environment in which few points tracked. The point constraints ensure that the pose estimation is reliable even in scenes in which there are not enough planes to be visible.
Keyframe Selection: By the previous process, we always know the number of tracked points and planes for each frame. If there is only one tracked plane but the number of detected planes is larger than 2 or the number of the tracked points is less than a threshold, this frame is selected as a keyframe. By inserting the keyframes and updating the local map, the drift error of the pose estimation can be markedly reduced.

Map Management and Loop Detection
In this section, we describe the operation for the local map when a new keyframe is inserted, and we detect the closing loop on the basis of both the appearance and orientation constraints. If loop detection is successful, the full bundle adjustment is performed to generate the final global map.
Similar to the co-visibility graph and essential graph in ORB-SLAM [10], we denote the set of co-visible keyframes by K L ; all points seen in K L are represented by S L1 , and all planes seen in K L are represented by S L2 . All other keyframes K F 1 that are not in K L , as well as the observation points in S L1 , contribute to the cost function but remain fixed in the optimization. All other keyframes K F 2 that are not in K L , as well as the observation planes in S L2 , contribute to the cost function and also remain fixed in the optimization. We define the set of point matches as X L m between the points in S L1 and keypoints in keyframe m, and we define the set of plane matches as P L n between the planes in S L2 and the keyplanes in keyframe n. The local map is updated by solving {X L i , π π π L j , R k w,l , t k w,l |i ∈ S L1 , j ∈ S L2 , l ∈ K L } = arg min where λ q represents the number of pixels contained in the plane.

Local Map Update
The local map contains three elements: the keyframe, point landmark, and plane landmark. These elements are represented by nodes in a factor graph, which is shown in Figure 2. There are two kinds of binary edges in this point-plane-based factor graph: one connects the keyframe node and point landmark node, and the other one connects the keyframe node and plane landmark node. When a new keyframe is inserted, the poses of all elements in the local map are optimized by the local bundle adjustment (BA).

Loop Detection based on Appearance and Orientation Constraints
In the point-based SLAM system, loop detection is performed by using a bag-of-words (BoW) place recognition module with DBoW [31]. This visual vocabulary is an appearance constraint for loop detection, and it is created by pretraining with a large set of pictures. Since places that appear similar in indoor environments are familiar, as shown in Figure 3, we add an orientation constraint to complement the appearance constraint in determining the loop keyframes. In our proposed system, if the similarity score (using the visual vocabulary) between a new inserted keyframe K i and an existing keyframe K L p is higher than a threshold, we select this keyframe as the potential loop keyframe, and we compute their orientation distance in degrees by Equation (8). We confirm loop detection if the potential loop keyframe meets d O p,i < 90deg. By adding this orientation constraint, the mismatch (based on the appearance constraint) of these two loop images is revised.
where tr(·) denotes the trace of a matrix, and R w,i and R w,p represent the rotation matrix component for keyframe K i and K L p , respectively.

Global Map Generation
If a loop is detected in the previous step, all poses of the elements are optimized by the full bundle adjustment, and the first keyframe is fixed in the process of full BA. We represent the set of all keyframes by K G : all point landmarks are represented by S G1 , and all plane landmarks are represented by S G2 . We define the set of point matches as X G m between points in S G1 and keypoints in keyframe m, and we define the set of plane matches as P G n between planes in S G2 and planes in keyframe n. The global map is optimized by solving where λ q represents the number of pixels in the plane. After the full BA, the global map with point and plane landmarks is generated, which can be applied to robotic localization, navigation, and path planning.

Pose and Plane Optimization with the MW Constraint
For the environment that satisfies the MW assumption, we exploit the parallel and orthogonal lines and planes to extract the MW axes, and we add the MW constraint to construct the cost function to optimize the poses of keyframes and landmarks.

MW Axes Extraction
We extract the MW axes from the first frame by utilizing the plane normal vectors and the parallel lines' vanishing directions (VDs), the details of which are given in our previous work [32]. To extract the accurate plane normals, we use the normal vectors obtained by the previous fast plane extraction method as the initial value and then perform the mean shift algorithm in the tangent plane of the unit sphere to get the final plane normal vectors, as shown in Figure 4. As the normals of parallel planes are regularly distributed and more likely to be around the ground MW axes on the unit sphere, the final extracted results are obtained by utilizing all normals of parallel planes, which are more accurate than the initial plane normals.  The geometric relationship between the VDs and parallel lines is shown in Figure 5. To extract accurate VDs (d v k , k = 1, 2, 3), we use the simplified Expectation-Maximization (EM) clustering method to group image lines and compute their corresponding 3D direction vectors. We use the linear-time Line Segment Detector (LSD) [33] to extract 2D line segments from the color image and roughly cluster the lines using the K-means method. Then, we compute the VDs by solving the weighted objective function for each d v k : where l (k) represents the ith line cluster obtained by the K-means method, length(l (k) i ) represents the length of the ith line, max(length(l (k) )) represents the maximum line length in cluster k, and K represents the internal camera parameters.

Unit sphere
Center of projection

MW Axes Seeking:
In the scenes that satisfy the MW assumption, there are three fixed axes (r g 1 r g 2 r g 3 ). It should be noted that we treat r g and −r g as the same direction. To determine the MW axes, we first get a redundant set by using the plane normals and VDs obtained using the previous method. Then, we seek the plane that contains the most pixels and set its plane normal as the first MW axis r 1 . The other two MW axes r 2 and r 3 are sought on the basis of two principles: the number of pixels belonging to the plane or line and the orthogonal constraint. The larger the number of pixels, the higher the priority of the plane normal or VD. The final global MW axes are obtained by using singular value decomposition (SVD): where [U, D, V] = SVD([λ 1 r 1 λ 2 r 2 λ 3 r 3 ]), and factor λ i represents the number of pixels belonging to a plane or line.

Optimization with Fixed Plane Normal
In the previous section, we present the extraction of the global MW axes (r g 1 r g 2 r g 3 ), which are used to fix the normals of plane landmarks during optimization. We add the MW constraint to construct the cost functions (6), (7) and (9), in which the plane landmarks are represented by For MW environments, this MW constraint can effectively improve the accuracy of the SLAM system because adding the MW axes to the cost function is equivalent to setting three global directions in the optimization for poses and landmarks, so the drift can be reduced. If the MW axes are not detected in the current environment, we simply use the point-plane-based SLAM without the MW constraint to perform the localization and mapping tasks.

Results
We evaluated our proposed approach on a synthetic dataset (ICL-NUIM [34]) and a real-world dataset (TUM RGB-D [35]). All experiments were run on a desktop computer with an Intel Core i7, 16 GB memory, and Ubuntu 16.04 platform. Our proposed system was built on ORB-SLAM2 [10], and our system is executed in the same manner as ORB-SLAM2.

•
The ICL-NUIMdataset comprises images from a hand-held RGB-D camera in synthetically generated environments. These sequences were captured in a living room and an office with perfect ground-truth poses to fully quantify the accuracy of a given visual odometry or SLAM system. Depth and RGB noise models were used to alter the ground images to simulate realistic sensor noise. Some image sequences are in low-texture environments, which makes it difficult to estimate the poses of the whole images in these sequences.

•
The TUM RGB-D dataset is a famous benchmark that is used to evaluate the accuracy of a given visual odometry or visual SLAM system. It contains various indoor sequences captured by a Kinect RGB-D sensor. The sequences were recorded in real environments at a frame rate of 30 Hz with a 640 × 480 resolution, and their ground-truth trajectories were obtained from a high-accuracy motion-capture system. The TUM dataset is more challenging than the ICL dataset because includes some blurred images and inaccurate alignment image pairs that make it difficult to estimate the camera poses.
We compared our proposed approach with five methods: ORB-SLAM2 [10], DVO [24], InfiniTAM [25], LPVO [27], and L-SLAM [28]. ORB-SLAM2 is a state-of-the-art point-based SLAM system; DVO estimates the robust poses with photometric and depth error by using the color and depth images together; InfiniTAM estimates the camera poses from the RGB and depth images with a GPU in real time; LPVO exploits the line and plane to estimate the zero-drift rotation and then estimates the 3D poses with tracked points in the MW scenes; L-SLAM estimates the camera position and plane landmarks with a linear SLAM formulation in the MW environments. We use the root-mean-square error (RMSE) of the absolute translational error (ATE) as the performance metric for the entire sequences: where R g,p and t g,p represent the rotation matrix and translational matrix that transform the trajectory coordinate obtained by our proposed method to the ground-truth coordinate; three-dimensional points X p i and X g i denote the traces of the proposed method and ground truth, respectively; and N represents the number of frames in the tested sequence.

Evaluation on Synthetic Dataset
We first evaluated our proposed method on the ICL-NUIM dataset. The estimated trajectories and point-plane landmarks are shown in Figure 6, and the measured RMSE values of the ATE for each sequence are shown in Table 1. The smallest values are bolded and indicate the most accurate result for the pose estimation. For example, in 'Living Room 0', the ATE.RMSE value of our proposed method is 0.006 m, while those of ORB-SLAM2, DVO, LPVO, and L-SLAM are 0.010, 0.108, 0.015, and 0.012 m, respectively. The 'Living Room 1' sequence includes images that are mostly composed of a texture-less wall, so the accuracy of the point-based ORB-SLAM2 method is poor. As the DVO method does not have an efficient loop-closing process, the drift error cannot be avoided, and its ATE.RMSE is large.
InfiniTAM fails to estimate the whole frames' poses in three sequences ("Living Room 0", "Living Room 3", and "Office Room 2") because there are some frames with only one visible plane in the depth image and low texture in the color image. We marked the result as '×'. Although the LPVO method can provide drift-free rotation, it estimates the 3D pose using only the tracked points; thus, if there are not enough points, the accuracy decreases. L-SLAM is a linear SLAM method that uses the MW constraint and does not need to estimate the 3 degrees of freedom rotation. L-SLAM performs well with the MW scenes, and our method is comparable to it. The last column in Table 1 shows the number of frames in the current sequence. The MW assumption is sufficiently suitable for the ICL-NUIM benchmark. To clearly show the effect of the MW constraint, we measured the ATE.RMSE for all sequences obtained by our method without the MW constraint; this corresponds to the 'No MW' column in Table 1. We recorded the values of the absolute translational error (ARE) for each frame in the 'Living Room 0' sequence, and the ATE values with and without the MW constraint are shown in Figure 7: in this figure, the smaller the ATE value, the more accurate the pose estimation. This demonstrates that the MW constraint can improve the accuracy of pose estimation for MW environments.

Evaluation on Real-World Data
We then evaluated our proposed method on the TUM RGB-D dataset. The trajectories generated by the poses of the whole captured frames and the point-plane landmarks in the map are shown in Figure 8. We compared the performance of our proposed algorithm with that of the other five methods on six real-world TUM RGB-D sequences that contain structural regularities. The comparison results are shown in Table 2. We provide the ATE.RMSE for 3D pose estimation, and the smallest values are indicated in bold. Our proposed method performs better in low-texture environments because it uses the point and plane features to estimate poses. In 'fr3_cabinet', the ORB-SLAM2 method failed to estimate the poses for the entire sequence because there are not enough reliable tracked points for some frames; we marked the result as '×' in Table 2. The last column in Table 2 also represents the number of frames in the current TUM RGB-D sequence.
The performance results of the MW constraint on the 'fr3_struc_notex_far' sequence is shown in Figure 9. The final translational drift obtained by our proposed method with and without the MW constraint is 0.031 and 0.072, respectively. It is clear that the MW constraint can effectively reduce the drift error for the MW scenes. (d) 'fr3_large_cabinet'. For each sequence, four images from left to right represent, respectively, one color image in the current sequence, the point landmarks (black dots) obtained by our proposed method, the plane landmarks obtained by our proposed method, and the trajectory comparison between the ground truth and our proposed method. The estimated keyframe trace (blue boxes) and connection graph between them (green lines) were added to the middle images that show point and plane landmarks.

Conclusions
We proposed a point-plane-based method to estimate robot poses and reconstruct the maps of scenes of indoor environments using an RGB-D camera. We exploited point and plane features to generate reliable constraints, which we applied to the constructed cost function for solving the transformation matrix, and we used minimal representation for planes in the nonlinear optimization process. We developed a vanishing direction extraction method based on parallel lines and combined it with the detected plane normals to seek the MW axes in the current environment. Then, we added the MW constraint to further improve accuracy for MW environments. The proposed algorithm was tested on both synthetic and real-world publicly available RGB-D datasets, and we compared the pose estimation performance of our method with that of five existing methods. The results demonstrate the accuracy and robustness of the proposed method. Our approach can be used for a robot's tasks in indoor environments. In future work, we will extend our approach to the point-line-plane feature fusion SLAM system, which may provide robust pose estimation in more general environments and generate structural maps.