Robust Visual Odometry Leveraging Mixture of Manhattan Frames in Indoor Environments

We propose a robust RGB-Depth (RGB-D) Visual Odometry (VO) system to improve the localization performance of indoor scenes by using geometric features, including point and line features. Previous VO/Simultaneous Localization and Mapping (SLAM) algorithms estimate the low-drift camera poses with the Manhattan World (MW)/Atlanta World (AW) assumption, which limits the applications of such systems. In this paper, we divide the indoor environments into two different scenes: MW and non-MW scenes. The Manhattan scenes are modeled as a Mixture of Manhattan Frames, in which each Manhattan Frame in itself defines a Manhattan World of a specific orientation. Moreover, we provide a method to detect Manhattan Frames (MFs) using the dominant directions extracted from the parallel lines. Our approach is designed with lower computational complexity than existing techniques using planes to detect Manhattan Frame (MF). For MW scenes, we separately estimate rotational and translational motion. A novel method is proposed to estimate the drift-free rotation using MF observations, unit direction vectors of lines, and surface normal vectors. Then, the translation part is recovered from point-line tracking. In non-MW scenes, the tracked and matched dominant directions are combined with the point and line features to estimate the full 6 degree of freedom (DoF) camera poses. Additionally, we exploit the rotation constraints generated from the multi-view dominant directions observations. The constraints are combined with the reprojection errors of points and lines to refine the camera pose through local map bundle adjustment. Evaluations on both synthesized and real-world datasets demonstrate that our approach outperforms state-of-the-art methods. On synthesized datasets, average localization accuracy is 1.5 cm, which is equivalent to state-of-the-art methods. On real-world datasets, the average localization accuracy is 1.7 cm, which outperforms the state-of-the-art methods by 43%. Our time consumption is reduced by 36%.


Introduction
Visual simultaneous localization and mapping (Visual SLAM) and Visual Odometry (VO) estimate the 6 DoF camera pose from a sequence of camera images. They have various applications, such as autonomous robots and virtual and augmented reality (VR/AR).
Indoor environments contain low-texture surfaces such as the floor, walls, and ceiling, which leads to performance degradation for pure point-based methods [1]. Robust pose estimation performance can be improved by adding geometric structural features present in indoor scenes, such as lines and planes, to the systems [2][3][4][5][6][7]. These works extend the working scenarios to low-textured environments.
A technique to leverage the structural regularity in indoor scenes is based on the MW/AW assumption, which can reduce the rotation drift. This technique has been employed by [8][9][10][11][12][13]. These systems benefit from the MW/AW assumption to the rotation estimation. They decouple the rotational and translational motion estimation and estimate drift-free rotational motion from structural regularities in man-made environments, which • A robust and general RGB-D VO framework for indoor environments is proposed. It is more suitable for real-world scenes because it can choose different tracking methods (decoupled and non-decoupled pose estimation methods) for different scenes. • A novel drift-free rotation estimation approach is proposed. We detect the dominant directions for every frame by clustering the parallel lines. These dominant directions are tracked to detect MFs. Then, we use a mean-shift algorithm to obtain rotation estimation.

•
An accurate and efficient local map bundle adjustment strategy combines points and lines reprojection errors with the rotation constraints from the multi-view dominant directions observations.
We compare the proposed method with other works in the literature, as shown in Table 1. All works are open source. To verify the effectiveness of the proposed method, we evaluate the proposed method on synthetic and real-world RGB-D benchmark datasets. • A robust and general RGB-D VO framework for indoor environments is proposed. It is more suitable for real-world scenes because it can choose different tracking methods (decoupled and non-decoupled pose estimation methods) for different scenes. • A novel drift-free rotation estimation approach is proposed. We detect the dominant directions for every frame by clustering the parallel lines. These dominant directions are tracked to detect MFs. Then, we use a mean-shift algorithm to obtain rotation estimation. • An accurate and efficient local map bundle adjustment strategy combines points and lines reprojection errors with the rotation constraints from the multi-view dominant directions observations.
We compare the proposed method with other works in the literature, as shown in Table 1. All works are open source. To verify the effectiveness of the proposed method, we evaluate the proposed method on synthetic and real-world RGB-D benchmark datasets.

System Overview
In this work, we use {R kw , t kw } to represent the camera pose of the kth frame, where R kw ∈ SO(3) and t kw ∈ R 3 denote the rotation and translation from the world frame to the camera frame, respectively. We also use a set of unit vectors d w i to represent the dominant directions in the global map, and these vectors constitute all MFs saved in the Manhattan map G. Each MF contains the three mutually orthogonal dominant directions. These concepts are visualized in Figure 2. In addition, we use d c k i to represent the dominant directions in kth frame. The rotation matrix R c k m j ∈ SO(3) represents the orientation from jth MF to kth camera frame.

System Overview
In this work, we use { , } to represent the camera pose of the th frame, where ∈ (3) and ∈ denote the rotation and translation from the world frame to the camera frame, respectively. We also use a set of unit vectors { } to represent the dominant directions in the global map, and these vectors constitute all MFs saved in the Manhattan map . Each MF contains the three mutually orthogonal dominant directions. These concepts are visualized in Figure 2. In addition, we use { } to represent the dominant directions in th frame. The rotation matrix ∈ (3) represents the orientation from th MF to th camera frame. With the RGB-D camera as the sensor input, the proposed system is built on top of the tracking and local mapping components of Oriented FAST and Rotated BRIEF SLAM2 (ORB-SLAM2) [22]. The overall framework is shown in Figure 3. We then describe each module of the proposed VO system. With the RGB-D camera as the sensor input, the proposed system is built on top of the tracking and local mapping components of Oriented FAST and Rotated BRIEF SLAM2 (ORB-SLAM2) [22]. The overall framework is shown in Figure 3. We then describe each module of the proposed VO system.
The tracking thread is used to estimate the pose of each frame and select appropriate keyframes as input to the local mapping thread. In the tracking thread, for each frame, we extract point and line features from the RGB image and surface normals from the depth image, which are performed in parallel. Then, we extract the dominant directions from parallel lines to estimate the MFs in the current frame. The points, lines, and dominant directions are tracked and matched to estimate the camera pose. We divide the scenes into MW scenes and non-MW scenes. For MW scenes, we use a decoupled method to estimate the rotational and translational motion. For non-MW scenes, we combine point and line features with the dominant direction observations to estimate the whole 6 DoF camera pose. Based on the initial pose estimation, the camera motion is refined with the matched landmarks from the local map. Finally, the results on the keyframe are inferenced. We take both point and line features into account to decide whether a new keyframe should be inserted. Instead of a fixed reasonable threshold, the ratio-based method is use to create a new keyframe [20]. The tracking thread is used to estimate the pose of each frame and select appropriate keyframes as input to the local mapping thread. In the tracking thread, for each frame, we extract point and line features from the RGB image and surface normals from the depth image, which are performed in parallel. Then, we extract the dominant directions from parallel lines to estimate the MFs in the current frame. The points, lines, and dominant directions are tracked and matched to estimate the camera pose. We divide the scenes into MW scenes and non-MW scenes. For MW scenes, we use a decoupled method to estimate the rotational and translational motion. For non-MW scenes, we combine point and line features with the dominant direction observations to estimate the whole 6 DoF camera pose. Based on the initial pose estimation, the camera motion is refined with the matched landmarks from the local map. Finally, the results on the keyframe are inferenced. We take both point and line features into account to decide whether a new keyframe should be inserted. Instead of a fixed reasonable threshold, the ratio-based method is use to create a new keyframe [20].
Map points, map lines, dominant directions, a set of keyframes, a covisibility graph, and a spanning tree jointly make up the stored map. The covisibility graph is maintained to link any two keyframes observing common landmarks. Whenever a keyframe is inserted, the local mapping thread is implemented to process the new keyframe and update the covisibility graph by the number of covisible landmarks. The map point culling and the map line culling are performed to improve tracking performance by retaining the high-quality map points and map lines. Furthermore, we merge the dominant directions to maintain the orientation difference between any two directions. Besides, a local map bundle adjustment procedure is performed to estimate keyframes poses, together with map points, map lines, and dominant directions observed by these keyframes. Finally, a keyframe culling procedure is conducted to remove the redundant keyframes. A keyframe is considered to be removed when more than 90% of map points can be observed by other keyframes (usually at least 3). Map points, map lines, dominant directions, a set of keyframes, a covisibility graph, and a spanning tree jointly make up the stored map. The covisibility graph is maintained to link any two keyframes observing common landmarks. Whenever a keyframe is inserted, the local mapping thread is implemented to process the new keyframe and update the covisibility graph by the number of covisible landmarks. The map point culling and the map line culling are performed to improve tracking performance by retaining the high-quality map points and map lines. Furthermore, we merge the dominant directions to maintain the orientation difference between any two directions. Besides, a local map bundle adjustment procedure is performed to estimate keyframes poses, together with map points, map lines, and dominant directions observed by these keyframes. Finally, a keyframe culling procedure is conducted to remove the redundant keyframes. A keyframe is considered to be removed when more than 90% of map points can be observed by other keyframes (usually at least 3).

Feature Detection and Matching
In this paper, we use ORB features [23] to address the rotation, scale, and illumination changes. They can be extracted and matched quickly. The lines are extracted by Line Segment Detector (LSD) [24] and represented by Line Band Descriptor (LBD) [25]. The unit surface normal vectors are extracted from the depth image [9]. These procedures are conducted in parallel.
After extracting 2D features in the frame F k , we use p i = (u i , v i ) to represent the 2D point feature and l j = s j , e j to represent the line segment in image coordinates. Let s j and e j Once the 2D features have been detected and described, it is easy to obtain the 3D positions in camera coordinates according to the camera intrinsic parameters and the depth image. The 3D points and lines are denoted as P c i and L c j = P c j,start , P c j,end , respectively. To match point features, we still use the same strategy as ORB-SLAM2 to match. We jointly use both the LBD descriptor and geometric constraints to match line features between consecutive frames.

Dominant Direction
After obtaining the 3D position of lines, we classify the 3D line vectors to obtain parallel line clusters. The dominant directions are extracted from the parallel lines. The dominant directions are tracked and matched to detect the MFs and estimate the camera pose. We solve a least square problem for every parallel line cluster to determine its dominant direction: where S = s i 1≤i≤n ∈ R 3×n and n is the number of lines in this parallel line cluster. Each column s j represents a unit direction vector of the line in this cluster. Then, we obtain the initial set of dominant directions d We choose those pairs d c k i , d w j whose absolute values of cosine satisfy a given threshold (3 • in this letter) as the candidate matches. As a result, we choose the dominant direction whose angular difference between d c k i and d w j is the closest to 1 as the correct match. Sometimes, the angular difference between two dominant directions in the global map may be smaller than the threshold after the local map BA. In that case, we merge the two dominant directions by an iterative to maintain the orientation difference between any two directions.

Manhattan Frame Detection
For MF M i in the kth frame, it can be represented by three mutually perpendicular 3 . To detect an MF M i in F k , we compute the angular difference between two different dominant directions in d c k i . We think the two dominant directions are orthogonal if the angular difference meets the orthogonal threshold (at least 87 • in this work). Any three dominant directions, which are mutually orthogonal, constitute an MF. If only two perpendicular dominant directions are found, the third direction can be obtained by taking the cross-product between the two dominant directions. At the same time, we add the newly created third dominant direction to the current frame's dominant direction set d c k i . The rotation matrix from this MF M i to the current frame is represented Like the method in [19], we save the MFs in the scene to a Manhattan map G. Through the Manhattan map G, we can obtain the full and partial MF observations and the corresponding frames that observe the MF first.

Pose Estimation
Two different strategies are used to estimate the camera pose T cw = {R cw , t cw } from world coordinates W to camera coordinates C, depending on whether the scenes conform to the MW assumption. For non-MW scenes, we directly estimate the 6 DoF camera pose with a feature tracking method. In MW scenes, we decouple the camera pose to separately estimate the rotational and translational motion.

Non-MW Scenes
In non-MW scenes, the tracked dominant directions are used to estimate the camera motion by combining the point-line tracking. The dominant directions only provide the orientation constraints, independent of translation. Then, the full camera pose is estimated by minimizing the following cost function: where P, L, and D are the set of all point, line, and dominant direction matches, respectively. Let ρ denote the robust Huber cost function. The point reprojection error between observed 2D features and corresponding matched 3D features is defined as where P w i ∈ R 3 is the 3D map point in world coordinates corresponding to the 2D point feature p i ∈ R 2 in the image plane. The projection function π transforms a 3D point P c in camera coordinates into the image plane: where the focal length f x , f y and principal point c x , c y belong to camera intrinsic parameters. The line reprojection error is formulated based on the point-to-line distance between the 2D line segment l j and the 3D endpoints P w j,start and P w j,end from the matched 3D line L w j . The error function is formulated as We define the dominant direction observation errors based on the 3D-3D correspondence, formally: where d w k , d c k are the dominant directions in world coordinates and camera coordinates, respectively. Then these data associations are employed to optimize the current camera pose using the Levenberg Marquardt (LM) algorithm implemented in g2o [26].

MW Scenes
Compared to estimating the camera pose directly from frame-to-frame tracking, the pose estimation can be decoupled in MW scenes. To reduce the drift caused by frame-toframe tracking, we leverage the structural constraints in scenes to estimate the drift-free rotation. The translation estimation is recovered from the feature tracking. The process is shown in Figure 4.

MW Scenes
Compared to estimating the camera pose directly from frame-to-frame tracking, the pose estimation can be decoupled in MW scenes. To reduce the drift caused by frame-toframe tracking, we leverage the structural constraints in scenes to estimate the drift-free rotation. The translation estimation is recovered from the feature tracking. The process is shown in Figure 4. Secondly, we detect the MF 2 by using dominant directions to obtain the initial rotation from MF to the current frame. The frame first observed this MF. Then, we use a mean shift-based tracking strategy to refine the rotation. Finally, we obtain the drift-free rotation using as the reference frame. The green dashed arrow indicates the virtual dominant direction created by the cross-product between the two extracted dominant directions.
For the rotation estimation, the set of dominant directions can be obtained using the method described in Section 2.3. Then, all MFs in the current frame can be detected using the method described in Section 2.4. To check whether an MF = { ,1 , ,2 , ,3 } in the current frame is present in the Manhattan map , we match the dominant direction in the current frame with the dominant direction in the global map using the method described in Section 2.3. For three dominant directions that constitute the MF , if we can find that at least two directions are matched with the dominant directions in the global map and has been present in , then we obtain the corresponding frame in which was first observed. If does not contain any previously observed MF, then we use the feature-tracking method (Section 2.5.1) instead of a decoupled method to solve the camera pose.
We use the popular mean shift algorithm [8,9,14] for MF tracking to estimate the rotation matrix. Firstly, we calculate the initial relative rotation from MF to the current frame with the reference frame and the last frame : Secondly, we transform the unit direction vectors of lines and the surface normal vectors in the current frame to MF using the transposed initial rotation matrix . We project the unit direction vectors of lines and the surface normal vectors onto tangent planes to compute a mean shift. Then, the mean shift result is transformed back to the unit sphere from the tangential plane. Finally, we obtain the updated rotation matrix = has been present in G, then we obtain the corresponding frame F j in which M i was first observed. If F k does not contain any previously observed MF, then we use the feature-tracking method (Section 2.5.1) instead of a decoupled method to solve the camera pose.
We use the popular mean shift algorithm [8,9,14] for MF tracking to estimate the rotation matrix. Firstly, we calculate the initial relative rotation R init c k m i from MF M i to the current frame F k with the reference frame F j and the last frame F l : Secondly, we transform the unit direction vectors of lines and the surface normal vectors in the current frame to MF M i using the transposed initial rotation matrix R init m i c k . We project the unit direction vectors of lines and the surface normal vectors onto tangent planes to compute a mean shift. Then, the mean shift result is transformed back to the unit sphere from the tangential plane. Finally, we obtain the updated rotation matrix R c k m i = r 1 r 2 r 3 . However, to make R c k m i still satisfy the orthogonality constraint, we transform R c k m i onto SO (3) manifold using singular value decomposition (SVD): Then, we can obtain the rotation matrix R c k w from world coordinates to the current camera frame F k using the reference frame F j : More details on the sphere mean-shift method can be found in [8,9,14].
Once we obtain the drift-free rotation estimation, the 3 DoF translation estimation can be calculated by using the point-line reprojection errors. Note that we do not use the dominant direction observation errors in this process since they only provide rotational constraints. Furthermore, we simplify the original non-linear optimization problem into a linear one: where e i p and e j l are the rotation-assisted point and line errors, respectively: where we refer [·] (k) as the kth row of a vector. P w j,x , x = { start, end} represents the endpoints of the 3D line L w j . Then, we solve this BA problem using the LM algorithm. After estimating the camera pose, we project the points, lines, and dominant directions in the local map to the current frame to obtain more correspondence. The current camera pose is optimized again with the resulting matches.

Local Map Bundle Adjustment
When a new keyframe K is inserted, the next step is to perform a local map BA procedure, which refines the camera poses and landmarks in the local map. Γ = P w i , L w j , d k , R l , t l i ∈ P, j ∈ L, k ∈ D, l ∈ K c is the definition of the variable set to be optimized. K c represents all keyframes to be optimized, including the newly inserted keyframe and all local keyframes that are connected to it in the covisibility graph. P, L, and D represent all the map points, map lines, and dominant directions observed by these keyframes, respectively. We also fix some keyframes that observe these points, lines, and dominant directions but do not belong to K c , denoted by K f . We minimize the following cost function to estimate Γ:

Results
To evaluate the performance of the proposed method, we conduct experiments in synthesized and real-world sequences. Additionally, we compare it with other state-ofthe-art approaches. All the experiments have been performed on an Intel Core i5-10400 CPU @ 2.90 GHz/16 GB RAM, without GPU parallelization. Additionally, we disable the bundle adjustment and loop closure modules of ORB-SLAM2 and SP-SLAM to make a fair comparison.
ORB-SLAM2 [22] is a feature-point based RGB-D SLAM system, and our method is based on it. MSC-VO is an RGB-D VO system using point, line, MW constraints, and a non-decoupled pose estimation method. ManhattanSLAM is an RGB-D SLAM system using point, line, plane, MMF constraints, and decoupled pose estimation methods. RGB-D SLAM is a SLAM system using point, line, plane, MW constraints, and decoupled pose estimation methods. SP-SLAM is an RGB-D SLAM system using point, plane, and non-decoupled pose estimation method. This information is also shown in Table 1.

ICL-NUIM Dataset
Imperial College London and National University of Ireland Maynooth (ICL-NUIM) [27] dataset is a synthesized dataset containing two low-texture scenes with ground truth trajectories: living room and office, as shown on the left side of Figure 1. The scenes are rendered based on a rigid Manhattan World model. Furthermore, this dataset contains large structured areas and low-textured surfaces such as floors, walls, and ceilings. Table 2 shows the performance of our method based on the translation root mean square error (RMSE) of the absolute trajectory error (ATE). We compared the proposed method with the state-of-the-art systems, including MSC-VO, ManhattanSLAM, RGB-D SLAM, SP-SLAM, and ORB-SLAM2. The comparison of the RMSE is also shown in Figure 5. Figure 6 shows the percentage of MFs detected from each sequence in the ICL-NUIM dataset. the-art approaches. All the experiments have been performed on an Intel Core i5-10400 CPU @ 2.90 GHz/16 GB RAM, without GPU parallelization. Additionally, we disable the bundle adjustment and loop closure modules of ORB-SLAM2 and SP-SLAM to make a fair comparison. ORB-SLAM2 [22] is a feature-point based RGB-D SLAM system, and our method is based on it. MSC-VO is an RGB-D VO system using point, line, MW constraints, and a non-decoupled pose estimation method. ManhattanSLAM is an RGB-D SLAM system using point, line, plane, MMF constraints, and decoupled pose estimation methods. RGB-D SLAM is a SLAM system using point, line, plane, MW constraints, and decoupled pose estimation methods. SP-SLAM is an RGB-D SLAM system using point, plane, and nondecoupled pose estimation method. This information is also shown in Table 1.

ICL-NUIM Dataset
Imperial College London and National University of Ireland Maynooth (ICL-NUIM) [27] dataset is a synthesized dataset containing two low-texture scenes with ground truth trajectories: living room and office, as shown on the left side of Figure 1. The scenes are rendered based on a rigid Manhattan World model. Furthermore, this dataset contains large structured areas and low-textured surfaces such as floors, walls, and ceilings. Table 2 shows the performance of our method based on the translation root mean square error (RMSE) of the absolute trajectory error (ATE). We compared the proposed method with the state-of-the-art systems, including MSC-VO, ManhattanSLAM, RGB-D SLAM, SP-SLAM, and ORB-SLAM2. The comparison of the RMSE is also shown in Figure  5. Figure 6 shows the percentage of MFs detected from each sequence in the ICL-NUIM dataset.   The best result for each sequence is shown in bold.

TUM RGB-D Dataset
Technical University of Munich (TUM) RGB-D Benchmark [28] is a popular dataset to evaluate RGB-D VO/SLAM systems. Unlike the ICL-NUIM dataset, it consists of several real-world camera sequences, which contain different indoor scenes such as cluttered scenes, and different structure and texture scenes, as shown in Figure 7. Based on this, it can evaluate our system's robustness and accuracy in both MW and non-MW scenes.  Table 3.

TUM RGB-D Dataset
Technical University of Munich (TUM) RGB-D Benchmark [28] is a popular dataset to evaluate RGB-D VO/SLAM systems. Unlike the ICL-NUIM dataset, it consists of several real-world camera sequences, which contain different indoor scenes such as cluttered scenes, and different structure and texture scenes, as shown in Figure 7. Based on this, it can evaluate our system's robustness and accuracy in both MW and non-MW scenes.   The best result for each sequence is shown in bold.

TUM RGB-D Dataset
Technical University of Munich (TUM) RGB-D Benchmark [28] is a popular dataset to evaluate RGB-D VO/SLAM systems. Unlike the ICL-NUIM dataset, it consists of several real-world camera sequences, which contain different indoor scenes such as cluttered scenes, and different structure and texture scenes, as shown in Figure 7. Based on this, it can evaluate our system's robustness and accuracy in both MW and non-MW scenes.  Table 3. Figure 7. Sequences in TUM RGB-D dataset. The sorting is consistent with that in Table 3.
We selected 11 sequences in the TUM RGB-D dataset and divided them into three groups. Then we distinguished them according to the number of textures, structures and planes and whether they strictly follow the MW assumption. Table 3 shows the differences between sequences. Table 4 shows the performance comparison of our method based on the translation RMSE (ATE), and other systems, including MSC-VO, ManhattanSLAM, RGB-D SLAM, SP-SLAM, and ORB-SLAM2. Local map for the fr3-longoffice sequence is shown in Figure 8. Relevant data are shown in Figures 9-11.

Time Consumption
The average running time of each operation of the proposed method and Manhat-tanSLAM can be found in Table 5. We obtained the average results by running on seven different sequences in the TUM RGB-D benchmark.

Method
Tracking Local Mapping

Time Consumption
The average running time of each operation of the proposed method and Manhat-tanSLAM can be found in Table 5. We obtained the average results by running on seven different sequences in the TUM RGB-D benchmark.

Time Consumption
The average running time of each operation of the proposed method and Manhat-tanSLAM can be found in Table 5. We obtained the average results by running on seven different sequences in the TUM RGB-D benchmark.

Drift
We evaluated our system on the Texas A&M University (TAMU) RGB-D dataset [29] to test the amount of accumulated drift and robustness over time. Unlike the ICL-NUIM and TUM RGB-D datasets, the TAMU dataset does not provide ground-truth poses and contains long indoor sequences. Due to the camera trajectory being a loop, we can calculate the Trajectory Endpoint Drift (TED) [29], which computes the Euclidean distance between the starting and end points of the trajectory, to represent the accumulated drift. The output trajectory is shown on the right side of Figure 12. contains long indoor sequences. Due to the camera trajectory being a loop, we can calculate the Trajectory Endpoint Drift (TED) [29], which computes the Euclidean distance between the starting and end points of the trajectory, to represent the accumulated drift. The output trajectory is shown on the right side of Figure 12.

ICL-NUIM Dataset
The results are shown in Figure 5 and Table 2. Since there are rich structural regularities (enough lines and planes) and the highly present MW assumption in ICL-NUIM dataset, these are beneficial to the MW-based approaches. ManhattanSLAM shows the best quantitative results on average. Our method shows the second-best quantitative results on average, with a difference of 0.001 m. MSC-VO combines the structural constraints and MA alignment with the point line reprojection errors to optimize camera poses and shows the best quantitative results in four sequences. However, in sequence lr-kt3, it contains a perspective very close to the wall, which highly affects the MW detection, leading to the performance degradation of MSC-VO. Our method and ManhattanSLAM are more robust, as they can switch tracking strategies and adaptively estimate the camera motion. Figure 6 shows the percentage of MFs detected from each sequence in the ICL-NUIM dataset. In ICL-NUIM dataset, it contains large structured areas. Since ManhattanSLAM uses plane features, it can detect MFs on 88% of all frames in sequence. The number of our method is 42%. However, it also leads to a 23 ms increase in time consumption. However,  The results are shown in Figure 5 and Table 2. Since there are rich structural regularities (enough lines and planes) and the highly present MW assumption in ICL-NUIM dataset, these are beneficial to the MW-based approaches. ManhattanSLAM shows the best quantitative results on average. Our method shows the second-best quantitative results on average, with a difference of 0.001 m. MSC-VO combines the structural constraints and MA alignment with the point line reprojection errors to optimize camera poses and shows the best quantitative results in four sequences. However, in sequence lr-kt3, it contains a perspective very close to the wall, which highly affects the MW detection, leading to the performance degradation of MSC-VO. Our method and ManhattanSLAM are more robust, as they can switch tracking strategies and adaptively estimate the camera motion. Figure 6 shows the percentage of MFs detected from each sequence in the ICL-NUIM dataset. In ICL-NUIM dataset, it contains large structured areas. Since ManhattanSLAM uses plane features, it can detect MFs on 88% of all frames in sequence. The number of our method is 42%. However, it also leads to a 23 ms increase in time consumption. However, the average accuracy is only 0.001 m (6.6%) different. Time consumption data is described in Section 3.3.

TUM RGB-D Dataset
The results are shown in Table 4. In the TUM RGB-D dataset, our method shows the best quantitative results. Only our method and ManhattanSLAM can obtain results in all sequences.
As shown in Table 4, in fr1 and fr2 sequences, the environments are cluttered and can be detected with few or no MFs using planes, which makes RGB-D SLAM, using a decoupled pose estimation method, track failure. ManhattanSLAM can robustly estimate a pose in these scenes by switching it to a feature-tracking method and performing an equivalent result to feature-based ORB-SLAM2 and SP-SLAM. However, the scenes also have a few structural characteristics such as lines, which makes our method achieve higher accuracy by using the dominant directions extracted from parallel lines.
For the fr3 sequence, the scenes contain different degrees of structure and texture. The proposed method can obtain the highest performance in six of seven except for cabinet. Only a few textures existed in four of seven sequences-the point-based method, ORB-SLAM2, is not able to find enough corresponding points, which results in tracking failure. As shown in Figure 8, after the camera runs a loop, the trajectory of our method does not drift significantly and achieves higher accuracy compared to other methods.
Next, we will further discuss the reason why our method is more accurate than ManhattanSLAM on the TUM RGB-D dataset. Relevant data are shown in Figures 9-11.
The sequences of group 1 record a typical office scene, including desks, a computer monitor, a keyboard, a telephone, chairs, etc. The environments are cluttered and can be detected with few or no MFs using planes, as shown in Figure 9, less than 1%. Our method can still extract few structural characteristics such as lines, which means our method can achieve higher accuracy by using the dominant directions extracted from parallel lines.
The sequences of group 2 consist of multiple planes and can detect large MFs using planes, as shown in Figure 10. Our method can achieve higher accuracy. Although ManhattanSLAM can extract enough MFs, the planes in the first four sequences do not strictly follow the parallel or orthogonal relationship, and the forced use of the MW assumption will introduce redundant errors. Our method filters out non-orthogonal lines by line direction, making the real situation consistent with the assumption.
As shown in Figure 11, sequence fr3/l-cabinet contains some planes, but Manhat-tanSLAM does not extract enough MFs. With these sequences containing much texture and structure, our method can extract enough MFs, which makes our method achieve higher accuracy.

Time Consumption
Although the extraction of lines and surface normals is time-consuming for the proposed method, using multiple threads reduces the overall system time consumption, and we only need an average of 24.39 ms for the feature extraction. The local map BA procedure takes 183.34 ms on average, but it runs in a parallel thread. The whole tracking thread works at around 25 Hz. ManhattanSLAM takes 40 ms for superpixel extraction and surfel fusion and 67 ms for tracking on average. The whole tracking thread works at around 15 Hz.
The proposed method can work in real time. Our time consumption has decreased by 36%, and the accuracy has been maintained at the same level or beyond.

Drift
We employ Corridor-A and Entry-Hall sequences to evaluate the final trajectory drift. This dataset contains noisy depth data and low-texture floors and walls, as shown on the left side of Figure 12, which highly affect the camera pose estimation. As shown in Table 6, ManhattanSLAM achieved the best estimation results by adding plane features in the tracking process. The improvements of our method over the whole trajectory lengths of Corridor-A and Entry-Hall are 74.4% and 65.8%, respectively, compared to ORB-SLAM2. Compared with MSC-VO, which also uses point and line features, the improvements of our method are 12.1% and 29.0%.

Conclusions
In this letter, we propose an accurate and efficient RGB-D Visual Odometry system leveraging the structural regularity in indoor environments, which can robustly run in general indoor scenes. This is achieved by leveraging the dominant directions extracted from parallel lines in scenes to improve localization accuracy. On the one hand, the dominant directions can be used to solve the drift-free rotation estimation in MW scenes. On the other hand, they can also provide a rotation constraint to incorporate point and lines reprojection errors to optimize the camera pose. All these contributions can improve the accuracy of the computed trajectory for our method, as shown in our experiments. Furthermore, our pipeline is designed to address the different scenes: MW scenes and non-MW scenes, which means our system can work in a wider range of environments.
The estimation accuracy of the line affects the calculation of the dominant direction. If the uncertainty of the 3D coordinates of the recovered line is too large, the calculation and matching of the dominant direction will be affected, and the relative MF cannot be matched. In the future, we would like to add a loop closure module and improve the dominant direction detection to further discard unstable observations. We will also try to implement the proposed method with a monocular camera and IMU, which is beneficial for the Manhattan Frame detection, and possibly extend it to outdoor environments.